SQL module

pycof.sql.remote_execute_sql(sql_query='', query_type='', table='', data={}, credentials={}, verbose=True, connection='direct', autofill_nan=True, engine='default', cache=False, cache_name=None, *args, **kwargs)[source]

Simplified function for executing SQL queries. Will look at the credentials at /etc/.pycof/config.json. User can also pass a dictionnary for credentials.

Parameters
  • sql_query (str): SQL query to be executed. Allows a string containing the SQL or a path containing the extension ‘.sql’ (defaults “”).

  • query_type (str): Type of SQL query to execute. Can either be SELECT, INSERT, COPY, DELETE or UNLOAD (defaults “SELECT”).

  • table (str): Table in which we want to operate, only used for INSERT and DELETE (defaults “”).

  • data (pandas.DataFrame): Data to load on the database (defaults {}).

  • credentials (dict): Credentials to use to connect to the database. Check the FAQ for arguments required depending on your type of connection. You can also provide the credentials path or the json file name from ‘/etc/.pycof/’ (defaults {}).

  • verbose (bool): Display progression bar (defaults True).

  • connection (str): Type of connection to establish. Can either be ‘direct’, ‘IAM’ or ‘SSH’ (defaults ‘direct’).

  • autofill_nan (bool): Replace NaN values by ‘NULL’ (defaults True).

  • cache (str): Caches the data to avoid running again the same SQL query (defaults False). Provide a str for the cache time.

  • cache_name (str): File name for storing cache data, if None the name will be generated by hashing the SQL (defaults None).

  • **kwargs (str): Arguments to be passed to the pycof.data.f_read() function.

Warning

Since version 1.2.0, argument useIAM is replaced by connection. To connect via AWS IAM, use connection='IAM'. You can also establish an SSH tunnel with connection='SSH', check FAQ below for credentials required.

Warning

Since version 1.2.0, default file for credentials on Unix needs to be /etc/.pycof/config.json. You can still use your old folder location by providing the full path. Note that you can also provide the crendentials’ file name in the folder /etc/.pycof/ without having to specify the extension. Check FAQ below for more details.

Configuration

The function requires the below arguments in the configuration file.

  • DB_USER: Database user.

  • DB_PASSWORD: Password for connection to database, can remain empty if connection='IAM'.

  • DB_HOST: End point (hostname) of the database.

  • DB_PORT: Port to access the database.

  • CLUSTER_NAME: Name of the Redshift cluster. Can be accessible on the Redshift dashboard. Only required if cluster to access is Redshift.

  • SSH_USER: User on the server to use for SSH connection. Only required if connection='SSH'.

  • SSH_KEY: Path to SSH private key for SSH connection. Only required if connection='SSH' and path to the key is not default (usually /home/<username>/.ssh/id_rsa on Linux/MacOS or 'C://Users/<username>/.ssh/id_rsa on Windows).

  • SSH_PASSWORD: Password of the SSH user if no key is provided (or key is not registered on the destination host).

{
"DB_USER": "",
"DB_PASSWORD": "",
"DB_HOST": "",
"DB_PORT": "3306",
"__COMMENT_1__": "Redshift specific",
"CLUSTER_NAME": "",
"__COMMENT_2__": "SSH specific",
"SSH_USER": "",
"SSH_KEY": "",
"SSH_PASSWORD": ""
}
Example
>>> df = pycof.remote_execute_sql("SELECT * FROM SCHEMA.TABLE LIMIT 10")
Returns
  • pandas.DataFrame: Result of an SQL query if query_type = "SELECT".

Metadata are also available to users with addtionnal information regarding the SQL query and the file.

  • df.meta.cache.creation_date: Datetime when the query has been run and cached.

  • df.meta.cache.cache_path: Path to the local cached file.

  • df.meta.cache.query_path: Path to the local cached SQL query.

  • df.meta.cache.age(): Function to evaluate the age of the data file. See pycof.misc.file_age() for formats available.


Credentials

Save your credentials locally

The functions pycof.sql.remote_execute_sql() and pycof.format.send_email() will by default look for the credentials located in /etc/.pycof/config.json. On Windows, save the config file as C:/Windows/.pycof/config.json.

The file follows the below structure:

{
"DB_USER": "",
"DB_PASSWORD": "",
"DB_HOST": "",
"DB_PORT": "3306",
"DB_DATABASE": "",
"__COMMENT_1__": "Email specific, send_email",
"EMAIL_USER": "",
"EMAIL_PASSWORD": "",
"EMAIL_SMTP": "smtp.gmail.com",
"EMAIL_PORT": "587",
"__COMMENT_2__": "IAM specific, if connection='SSH' in remote_execute_sql",
"CLUSTER_NAME": "",
"AWS_ACCESS_KEY_ID": "",
"AWS_SECRET_ACCESS_KEY": "",
"REGION": "eu-west-1",
"__COMMENT_3__": "SSH specific",
"SSH_USER": "",
"SSH_KEY": "",
"SSH_PASSWORD": ""
}

On Unix based system, run:

sudo nano /etc/.pycof/config.json

and paste the above json after filling the empty strings (pre-filled values are standard default values).

Reminder: To save the file, with nano press CTRL + O, confirm with b then CTRL + X to exit.

On Windows, use the path C:/Users/<username>/.pycof/config.json.

Pass your credentials in your code

Though it is highly not recommended, you can pass your credentials locally to the functions with the argument credentials. You can then create a dictionnary using the same keys as described in previous section.

The preferred option is to provide the json file’s path to the credentials argument.


Example

Standard SELECT

The function executes a given SQL query with credentials automatically pulled from /etc/.pycof/config.json. To execute an SQL query, follow the below steps:

from pycof as pc

## Set up the SQL query
sql = "SELECT * FROM SCHEMA.TABLE LIMIT 10"

## The function will return a pandas dataframe
df = pc.remote_execute_sql(sql)

INSERT data

You can insert the content of a pandas.DataFrame into an SQL table. To execute an insert query, follow the below steps:

from pycof as pc

## The function will insert pandas dataframe in the specified table
pc.remote_execute_sql(df, table='SCHEMA.TABLE')

You only need to ensure you SQL user has write access on the table and to specified the targeted table.


FAQ

1 - What if I change an argument in the SQL query and run with cache='1h'?

The function pycof.sql.remote_execute_sql() looks at your SQL query as a whole when saving/loading the cache data. Even a slight change in the query (column name, filter, etc…) will trigger a new run of the new query before being cached again. You can then safely use caching without worrying about the eventual evolution of your SQL.

2 - How to use different credential sets?

The credentials argument can take the path or json file name into account to load them.

For instance, you can have multiple credential files such as /etc/.pycof/config.json, /etc/.pycof/MyNewHost.json and /home/.pycof/OtherHost.json. In pycof.sql.remote_execute_sql() you can play with the credentials argument.

  • To use the /etc/.pycof/config.json credentials you can use the default arguments by not providing anything.

  • To use /etc/.pycof/MyNewHost.json you can either pass credentials='MyNewHost' or credentials='MyNewHost.json' or the whole path to use them.

  • To use /home/.pycof/OtherHost.json you need to pass the whole path.

Example:

import pycof as pc

## Set the SQL query
sql = "SELECT * FROM SCHEMA.TABLE LIMIT 10"

## Run the query
df = pc.remote_execute_sql(sql, credentials='MyNewHost')
df2 = pc.remote_execute_sql(sql, credentials='/home/OtherHost.json')

3 - How to execute a query from an SQL file?

By providing the path to your sql file in pycof.sql.remote_execute_sql(), the function will directly load the content of the file. You can play with the arguments and give wildcards to format in the file. The function uses pycof.data.f_read() to load your file, and hence the same arguments can be passed as **kwargs.

Example:

Consider you have a first file /path/to/simple_query.sql:

SELECT *
FROM schema.table
WHERE 1 = 1

Then, you can directly use this file in pycof.sql.remote_execute_sql() as:

import pycof as pc

## Run the query
df1 = pc.remote_execute_sql('/path/to/simple_query.sql')

You can also put wildcards in your file that will be replaced when loading the file. Consider your path file is /path/to/query_wildcards.sql. The wildcards can have any name you want. Just make sure you do not use same name as arguments in pycof.sql.remote_execute_sql() or pycof.data.f_read().

Note

TIP: In both functions, we do not use arguments starting with an underscore _. You can use this structure for your wilcards, as illustrated below.

SELECT *
FROM schema.table
WHERE column1 = '{_value1}'
    AND column2 = '{_newfilter}'

You can then execute your query:

import pycof as pc

## Run the query
df2 = pc.remote_execute_sql('/path/to/query_wildcards.sql', _value1='random_value', _newfilter='test')

4 - How to query with AWS IAM credentials?

The function pycof.sql.remote_execute_sql() can take into account IAM user’s credentials. You need to ensure that your credentials file /etc/.pycof/config.json includes the IAM access and secret keys with the Redshift cluster information. The only argument to change when calling the function is to set connection='IAM'.

The function will then use the AWS access and secret keys to ask AWS to provide the user name and password to connect to the cluster. This is a much safer approach to connect to a Redshift cluster than using direct cluster’s credentials.

Note

TIP: When using PYCOF on EC2 or SageMaker, you can also play with IAM roles and not specify the access and secret keys. Roles do not depend on credentials. Just make sure your host has the required roles attached.

Example:

import pycof as pc

## Set the SQL query
sql = "SELECT * FROM SCHEMA.TABLE LIMIT 10"

## Run the query
df = pc.remote_execute_sql(sql, connection='IAM')

5 - How to cache the data?

pycof.sql.remote_execute_sql() allows to cache the data with the cache argument and save time for multiple executions.

## Cache the results of the query for 30 minutes
df = pc.remote_execute_sql(sql, cache='30mins')

The cache argument will allow you to save time for the next execution of the same SQL query. It will then load the cached data and not execute the whole SQL query if the age of the last execution is younger than the cache argument.

6 - How to query a database with SSH tunneling?

pycof.sql.remote_execute_sql() allows SSH tunneling with the argument connection='SSH'. The only additional requirement will be in the /etc/.pycof/config.json config file to specify the SSH user name.

If no SSH key path nor password is provided, the function will use the default SSH path: usually /home/<username>/.ssh/id_rsa on Linux/MacOS or 'C://Users/<username>/.ssh/id_rsa on Windows.

Just make sure you private key with name id_rsa is storage in that folder. You can also specify the ssh private key location or user password. We however recommend to work with SSH key pairs for more secure connections.

# User SSH tunnel
df = pc.remote_execute_sql(sql, connection='SSH')