Data module
- pycof.data.f_read(*args, **kwargs)[source]
Old function to load data file. This function is on deprecation path. Consider using
pycof.data.read()
instead.Warning
Note that from version 1.6.0, the f_read will be fully deprecated and replaced by the current
pycof.data.read()
.- Returns
Output from
pycof.data.read()
- Return type
pandas.DataFrame
- pycof.data.read(path, extension=None, parse=True, remove_comments=True, sep=',', sheet_name=0, engine='auto', credentials={}, profile_name=None, cache='30mins', cache_name=None, verbose=False, **kwargs)[source]
Read and parse a data file. It can read multiple format. For data frame-like format, the function will return a pandas data frame, otherzise a string. The function will by default detect the extension from the file path. You can force an extension with the argument. It can remove comments, trailing spaces, breaklines and tabs. It can also replace f-strings with provided values.
- Parameters
path (
str
): path to the SQL file.extension (
str
): extension to use. Can be ‘csv’, ‘txt’, ‘xslsx’, ‘sql’, ‘html’, ‘py’, ‘json’, ‘js’, ‘parquet’, ‘read-only’ (defaults None).parse (
bool
): Format the query to remove trailing space and comments, ready to use format (defaults True).remove_comments (
bool
): Remove comments from the loaded file (defaults True).sep (
str
): Columns delimiter for pd.read_csv (defaults ‘,’).sheet_name (
str
): Tab column to load when reading Excel files (defaults 0).engine (
str
): Engine to use to load the file. Can be ‘pyarrow’ or the function from your preferred library (defaults ‘auto’).credentials (
dict
): Credentials to use to connect to AWS S3. You can also provide the credentials path or the json file name from ‘/etc/.pycof’ (defaults {}).profile_name (
str
): Profile name of the AWS profile configured with the command aws configure (defaults None).cache (
str
): Caches the data to avoid downloading again.cache_name (
str
): File name for storing cache data, if None the name will be generated by hashing the path (defaults None).verbose (
bool
): Display intermediate steps (defaults False).**kwargs (
str
): Arguments to be passed to the engine or values to be formated in the file to load.
- Configuration
The function requires the below arguments in the configuration file.
AWS_ACCESS_KEY_ID
: AWS access key, can remain empty if an IAM role is assign to the host.AWS_SECRET_ACCESS_KEY
: AWS secret key, can remain empty if an IAM role is assign to the host.
{ "AWS_ACCESS_KEY_ID": "", "AWS_SECRET_ACCESS_KEY": "" }
- Example
>>> sql = pycof.read('/path/to/file.sql', country='FR') >>> df1 = pycof.read('/path/to/df_file.json') >>> df2 = pycof.read('/path/to/df.csv') >>> df3 = pycof.read('s3://bucket/path/to/file.parquet')
- Returns
pandas.DataFrame
: Data frame a string from file read.
FAQ
1 - How can I load a .json file as a dictionnary?
The function pycof.data.read()
allows to read different formats.
By default it will load as and pandas.DataFrame
but you can provide engine='json'
to load as dict
.
import pycof as pc
pc.read('/path/to/file.json', engine='json')
2 - How can I read .parquet files?
Providing a path containing the keyword parquet
to pycof.data.read()
, it will by default call the pyarrow engine.
You can also pass pyarrow specific argument for loading your file. In particular, you can parallilize the loading step by providing the argument
metadata_nthreads
and set it to the number of threads to be used.
import pycof as pc
# Reading a local file
df1 = pc.read('/path/to/file.parquet', metadata_nthreads=32)
# Reading remote file from S3
df2 = pc.read('s3://bucket/path/to/parquet/folder', extension='parquet', metadata_nthreads=32)
You can also find more details on the pyarrow read_table documentation.
3 - How can I get my AWS credentials with boto3?
You can use boto3 to get your credentials before passing to a function:
import boto3
session = boto3.Session()
creds = session.get_credentials().get_frozen_credentials()
config = {"AWS_ACCESS_KEY_ID": creds[0],
"AWS_SECRET_ACCESS_KEY": creds[1],
"REGION": "eu-west-3"}
4 - How do I save my AWS credentials?
If you work from an AWS EC2 instance or a SageMaker notebook instance, you may not need to setup your credentials. Just ensure you assigned an IAM role with the correct permissions to your instance. You should then be able to use PYCOF normally.
If you use it locally, or with an instance not using an AWS IAM role (but user), you then need to run:
ubuntu@ip-123-45-67-890:~$ aws configure
AWS Access Key ID [None]: ****
AWS Secret Access Key [None]: ****
Default region name [None]: us-east-1
Default output format [None]: json