Data module

pycof.data.f_read(*args, **kwargs)[source]

Old function to load data file. This function is on deprecation path. Consider using pycof.data.read() instead.

Warning

Note that from version 1.6.0, the f_read will be fully deprecated and replaced by the current pycof.data.read().

Returns

Output from pycof.data.read()

Return type

pandas.DataFrame

pycof.data.read(path, extension=None, parse=True, remove_comments=True, sep=',', sheet_name=0, engine='auto', credentials={}, profile_name=None, cache='30mins', cache_name=None, verbose=False, **kwargs)[source]

Read and parse a data file. It can read multiple format. For data frame-like format, the function will return a pandas data frame, otherzise a string. The function will by default detect the extension from the file path. You can force an extension with the argument. It can remove comments, trailing spaces, breaklines and tabs. It can also replace f-strings with provided values.

Parameters
  • path (str): path to the SQL file.

  • extension (str): extension to use. Can be ‘csv’, ‘txt’, ‘xslsx’, ‘sql’, ‘html’, ‘py’, ‘json’, ‘js’, ‘parquet’, ‘read-only’ (defaults None).

  • parse (bool): Format the query to remove trailing space and comments, ready to use format (defaults True).

  • remove_comments (bool): Remove comments from the loaded file (defaults True).

  • sep (str): Columns delimiter for pd.read_csv (defaults ‘,’).

  • sheet_name (str): Tab column to load when reading Excel files (defaults 0).

  • engine (str): Engine to use to load the file. Can be ‘pyarrow’ or the function from your preferred library (defaults ‘auto’).

  • credentials (dict): Credentials to use to connect to AWS S3. You can also provide the credentials path or the json file name from ‘/etc/.pycof’ (defaults {}).

  • profile_name (str): Profile name of the AWS profile configured with the command aws configure (defaults None).

  • cache (str): Caches the data to avoid downloading again.

  • cache_name (str): File name for storing cache data, if None the name will be generated by hashing the path (defaults None).

  • verbose (bool): Display intermediate steps (defaults False).

  • **kwargs (str): Arguments to be passed to the engine or values to be formated in the file to load.

Configuration

The function requires the below arguments in the configuration file.

  • AWS_ACCESS_KEY_ID: AWS access key, can remain empty if an IAM role is assign to the host.

  • AWS_SECRET_ACCESS_KEY: AWS secret key, can remain empty if an IAM role is assign to the host.

{
"AWS_ACCESS_KEY_ID": "",
"AWS_SECRET_ACCESS_KEY": ""
}
Example
>>> sql = pycof.read('/path/to/file.sql', country='FR')
>>> df1 = pycof.read('/path/to/df_file.json')
>>> df2 = pycof.read('/path/to/df.csv')
>>> df3 = pycof.read('s3://bucket/path/to/file.parquet')
Returns
  • pandas.DataFrame: Data frame a string from file read.


FAQ

1 - How can I load a .json file as a dictionnary?

The function pycof.data.read() allows to read different formats. By default it will load as and pandas.DataFrame but you can provide engine='json' to load as dict.

import pycof as pc

pc.read('/path/to/file.json', engine='json')

2 - How can I read .parquet files?

Providing a path containing the keyword parquet to pycof.data.read(), it will by default call the pyarrow engine. You can also pass pyarrow specific argument for loading your file. In particular, you can parallilize the loading step by providing the argument metadata_nthreads and set it to the number of threads to be used.

import pycof as pc

# Reading a local file
df1 = pc.read('/path/to/file.parquet', metadata_nthreads=32)
# Reading remote file from S3
df2 = pc.read('s3://bucket/path/to/parquet/folder', extension='parquet', metadata_nthreads=32)

You can also find more details on the pyarrow read_table documentation.

Warning

When loading a file from S3, the credentials required to access AWS. See setup for the config file or FAQ3 for boto3.

3 - How can I get my AWS credentials with boto3?

You can use boto3 to get your credentials before passing to a function:

import boto3

session = boto3.Session()
creds = session.get_credentials().get_frozen_credentials()

config = {"AWS_ACCESS_KEY_ID": creds[0],
          "AWS_SECRET_ACCESS_KEY": creds[1],
          "REGION": "eu-west-3"}

4 - How do I save my AWS credentials?

If you work from an AWS EC2 instance or a SageMaker notebook instance, you may not need to setup your credentials. Just ensure you assigned an IAM role with the correct permissions to your instance. You should then be able to use PYCOF normally.

If you use it locally, or with an instance not using an AWS IAM role (but user), you then need to run:

ubuntu@ip-123-45-67-890:~$ aws configure
AWS Access Key ID [None]: ****
AWS Secret Access Key [None]: ****
Default region name [None]: us-east-1
Default output format [None]: json