Data module

pycof.data.f_read(path, extension=None, parse=True, remove_comments=True, sep=',', sheet_name=0, engine='pyarrow', credentials={}, cache='30mins', verbose=False, **kwargs)[source]

Read and parse a data file. It can read multiple format. For data frame-like format, the function will return a pandas data frame, otherzise a string. The function will by default detect the extension from the file path. You can force an extension with the argument. It can remove comments, trailing spaces, breaklines and tabs. It can also replace f-strings with provided values.

Parameters
  • path (str): path to the SQL file.

  • extension (str): extension to use. Can be ‘csv’, ‘txt’, ‘xslsx’, ‘sql’, ‘html’, ‘py’, ‘json’, ‘js’, ‘parquet’, ‘read-only’ (defaults None).

  • parse (bool): Format the query to remove trailing space and comments, ready to use format (defaults True).

  • remove_comments (bool): Remove comments from the loaded file (defaults True).

  • sep (str): Columns delimiter for pd.read_csv (defaults ‘,’).

  • sheet_name (str): Tab column to load when reading Excel files (defaults 0).

  • engine (str): Engine to use to load the file. Can be ‘pyarrow’ or the function from your preferred library (defaults ‘pyarrow’).

  • credentials (dict): Credentials to use to connect to AWS S3. You can also provide the credentials path or the json file name from ‘/etc/.pycof’ (defaults {}).

  • **kwargs (str): Arguments to be passed to the engine or values to be formated in the file to load.

Configuration

The function requires the below arguments in the configuration file.

  • AWS_ACCESS_KEY_ID: AWS access key, can remain empty if an IAM role is assign to the host.

  • AWS_SECRET_ACCESS_KEY: AWS secret key, can remain empty if an IAM role is assign to the host.

{
"AWS_ACCESS_KEY_ID": "",
"AWS_SECRET_ACCESS_KEY": ""
}
Example
>>> sql = pycof.f_read('/path/to/file.sql', country='FR')
>>> df1 = pycof.f_read('/path/to/df_file.json')
>>> df2 = pycof.f_read('/path/to/df.csv')
>>> df3 = pycof.f_read('s3://bucket/path/to/file.parquet')
Returns
  • pandas.DataFrame: Data frame a string from file read.


FAQ

1 - How can I load a .json file as a dictionnary?

The function pycof.data.f_read() allows to read different formats. By default it will load as and pandas.DataFrame but you can provide engine='json' to load as dict.

import pycof as pc

pc.f_read('/path/to/file.json', engine='json')

2 - How can I read .parquet files?

Providing a path containing the keyword parquet to pycof.data.f_read(), it will by default call the pyarrow engine. You can also pass pyarrow specific argument for loading your file. In particular, you can parallilize the loading step by providing the argument metadata_nthreads and set it to the number of threads to be used.

import pycof as pc

# Reading a local file
df1 = pc.f_read('/path/to/file.parquet', metadata_nthreads=32)
# Reading remote file from S3
df2 = pc.f_read('s3://bucket/path/to/parquet/folder', extension='parquet', metadata_nthreads=32)

You can also find more details on the pyarrow read_table documentation.

Warning

When loading a file from S3, the credentials required to access AWS. See setup for the config file or FAQ3 for boto3.


3 - How can I get my AWS credentials with boto3?

You can use boto3 to get your credentials before passing to a function:

import boto3

session = boto3.Session()
creds = session.get_credentials().get_frozen_credentials()

config = {"AWS_ACCESS_KEY_ID": creds[0],
          "AWS_SECRET_ACCESS_KEY": creds[1],
          "REGION": "eu-west-3"}