Data Uploaders#

In this tutorial we are going to demonstrate the usage of the Bayesline Uploaders API. The Uploaders API provides a generalized mechanism to bring different types of data into the Bayesline ecosystem.

Specifically, we will introduce and explore:

Data Types
Datasets
Schemas and Parsers
The staging concept
The commit concept
Staging Validation
Data filtering and downloading
Housekeeping

Imports & Setup#

For this tutorial notebook, you will need to import the following packages.

import tempfile
from pathlib import Path

import polars as pl

from bayesline.apiclient import BayeslineApiClient

We will also need to have a Bayesline API client configured.

bln = BayeslineApiClient.new_client(
    endpoint="https://[ENDPOINT]",
    api_key="[API-KEY]",
)

The main entrypoint for the Uploaders API sits on bln.equity.uploaders. All upload functionality can be reached from here on out.

See here for relevant docs:

Uploaders API Summary

uploaders = bln.equity.uploaders

Data Types#

A data type distinguishes distinct types of data that can be brought into the Bayesline ecosystem and are pre-configured by Bayesline. These include portfolio holdings, factor exposures, etc.

uploaders.get_data_types()

['exposures', 'factors', 'hierarchies', 'portfolios']

We can obtain a specific uploader for a data type. In this tutorial, we will be working with the exposure uploader, but all other uploaders operate analogously.

The get_data_type method will return a DataTypeUploaderApi instance which distinguishes the concept of datasets (see below).

See here for relevant docs:

Data Type Uploader API Summary

exposure_uploader = uploaders.get_data_type("exposures")

Datasets#

For each data type (e.g. exposures) we can create isolated datasets. For instance, we might want to upload different sets of exposures. This can be achieved by creating a dataset. We can always retrieve existing datasets using the get_datasets method. Since we haven’t created any datasets yet, this will be empty.

exposure_uploader.get_datasets()

[]

Now, we start by creating a new dataset "tutorial".

See here for relevant docs:

Uploader API Summary

dataset = exposure_uploader.create_dataset("tutorial")

Schemas and Parsers#

Every data type comes with its own dataframe schema. Every uploaded dataframe will be converted into this schema to ensure a uniform way to view the data for a specific data type.

dataset.get_schema()

{'date': Date,
 'asset_id': String,
 'asset_id_type': String,
 'factor_group': String,
 'factor': String,
 'exposure': Float32}

We may have input data in a different format than what the exposures data type declares as its schema (e.g. a wide format). We can either convert it ourselves or use one of the predefined input data parsers.

A parser:

Is defined for an input format and will convert it to the schema that the uploader expects.
Will add operations such as null-filtering.
Will record error messages if a given input cannot be parsed.
Provides access to example dataframes for the expected input.
Will ensure that the dataframe is valid if the parsing succeeds.

dataset.get_parser_names()

['Long-Format', 'Wide-Format']

For demonstration, we will use the Wide-Format parser. When uploading dataframes, we can simply pass the name of the parser.

parser = dataset.get_parser("Wide-Format")

example_df = parser.get_examples()[0]
example_df

shape: (3, 9)

date	asset_id	asset_id_type	style^momentum_6	style^momentum_12	style^growth	market^market	industry^consumer	industry^tech
date	str	str	f64	f64	f64	f64	f64	f64
2025-01-06	"GOOG"	"cusip9"	-0.3	-0.2	1.2	1.0	null	1.0
2025-01-06	"AAPL"	"cusip9"	0.1	0.5	1.1	1.0	1.0	null
2025-01-07	"GOOG"	"cusip9"	-0.28	-0.19	1.21	1.0	null	1.0

Before running the parser, we can check if the data can be successfully parsed with the can_handle method.

parser.can_handle(example_df)

UploadParserResult(parser='Wide-Format', success=True, messages=[])

parser.parse(example_df)

(shape: (15, 6)
 ┌────────────┬──────────┬───────────────┬──────────────┬─────────────┬──────────┐
 │ date       ┆ asset_id ┆ asset_id_type ┆ factor_group ┆ factor      ┆ exposure │
 │ ---        ┆ ---      ┆ ---           ┆ ---          ┆ ---         ┆ ---      │
 │ date       ┆ str      ┆ str           ┆ str          ┆ str         ┆ f32      │
 ╞════════════╪══════════╪═══════════════╪══════════════╪═════════════╪══════════╡
 │ 2025-01-06 ┆ GOOG     ┆ cusip9        ┆ style        ┆ momentum_6  ┆ -0.3     │
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ style        ┆ momentum_6  ┆ 0.1      │
 │ 2025-01-07 ┆ GOOG     ┆ cusip9        ┆ style        ┆ momentum_6  ┆ -0.28    │
 │ 2025-01-06 ┆ GOOG     ┆ cusip9        ┆ style        ┆ momentum_12 ┆ -0.2     │
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ style        ┆ momentum_12 ┆ 0.5      │
 │ …          ┆ …        ┆ …             ┆ …            ┆ …           ┆ …        │
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ market       ┆ market      ┆ 1.0      │
 │ 2025-01-07 ┆ GOOG     ┆ cusip9        ┆ market       ┆ market      ┆ 1.0      │
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ industry     ┆ consumer    ┆ 1.0      │
 │ 2025-01-06 ┆ GOOG     ┆ cusip9        ┆ industry     ┆ tech        ┆ 1.0      │
 │ 2025-01-07 ┆ GOOG     ┆ cusip9        ┆ industry     ┆ tech        ┆ 1.0      │
 └────────────┴──────────┴───────────────┴──────────────┴─────────────┴──────────┘,
 UploadParserResult(parser='Wide-Format', success=True, messages=[]))

Staging Data#

Staging takes an input dataframe (or file), parses it, and keeps it in a separate area (stage). We can repeat this process of staging multiple times to stage multiple files (e.g. if we have daily files). The staging area can then be committed which concatenates all staged dataframes and writes them to versioned storage.

Adding to the Staging Area#

We use the example wide dataframe for staging. We define a name example-1 to be able to tell the staged dataframes apart later on. We also specify a concrete parser we want to use. Note that the parser can be left blank in which case all available parsers will be tried and the first succeeding parser will be chosen.

dataset.stage_df(name="example-1", df=example_df, parser="Wide-Format")

UploadStagingResult(name='example-1', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 14, 448093, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])

Below we’re adding a second dataframe for demonstration purposes. For this we use the existing example dataframe and roll the dates by one week.

example_df2 = example_df.with_columns(pl.col("date").dt.add_business_days(5))
example_df2

shape: (3, 9)

date	asset_id	asset_id_type	style^momentum_6	style^momentum_12	style^growth	market^market	industry^consumer	industry^tech
date	str	str	f64	f64	f64	f64	f64	f64
2025-01-13	"GOOG"	"cusip9"	-0.3	-0.2	1.2	1.0	null	1.0
2025-01-13	"AAPL"	"cusip9"	0.1	0.5	1.1	1.0	1.0	null
2025-01-14	"GOOG"	"cusip9"	-0.28	-0.19	1.21	1.0	null	1.0

# note that if we used the same name "example-1" this cell would fail
dataset.stage_df(name="example-2", df=example_df2, parser="Wide-Format")

UploadStagingResult(name='example-2', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 14, 561971, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])

Retrieving Staged Data#

Below we demonstrate how to obtain previously staged data. We can either obtain the staging results (as we saw above when calling the stage_df method) or the data itself.

dataset.get_staging_results()

{'example-1': UploadStagingResult(name='example-1', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 14, 448093, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])]),
 'example-2': UploadStagingResult(name='example-2', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 14, 561971, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])}

dataset.get_staging_data().collect()

shape: (30, 7)

_name	date	asset_id	asset_id_type	factor_group	factor	exposure
str	date	str	str	str	str	f32
"example-2"	2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.3
"example-2"	2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.2
"example-2"	2025-01-13	"GOOG"	"cusip9"	"style"	"growth"	1.2
"example-2"	2025-01-13	"GOOG"	"cusip9"	"market"	"market"	1.0
"example-2"	2025-01-13	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…	…
"example-1"	2025-01-07	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.28
"example-1"	2025-01-07	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.19
"example-1"	2025-01-07	"GOOG"	"cusip9"	"style"	"growth"	1.21
"example-1"	2025-01-07	"GOOG"	"cusip9"	"market"	"market"	1.0
"example-1"	2025-01-07	"GOOG"	"cusip9"	"industry"	"tech"	1.0

dataset.get_staging_data(names=["example-1"]).collect()

shape: (15, 7)

_name	date	asset_id	asset_id_type	factor_group	factor	exposure
str	date	str	str	str	str	f32
"example-1"	2025-01-06	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.3
"example-1"	2025-01-06	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.2
"example-1"	2025-01-06	"GOOG"	"cusip9"	"style"	"growth"	1.2
"example-1"	2025-01-06	"GOOG"	"cusip9"	"market"	"market"	1.0
"example-1"	2025-01-06	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…	…
"example-1"	2025-01-07	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.28
"example-1"	2025-01-07	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.19
"example-1"	2025-01-07	"GOOG"	"cusip9"	"style"	"growth"	1.21
"example-1"	2025-01-07	"GOOG"	"cusip9"	"market"	"market"	1.0
"example-1"	2025-01-07	"GOOG"	"cusip9"	"industry"	"tech"	1.0

Retrieving Summary Data#

We can also obtain predefined summary data for each staged file.

dataset.get_staging_data_summary()

shape: (2, 11)

_name	n_dates	n_assets	min_date	max_date	n_factor_groups	n_factors	min_exposure	max_exposure	mean_exposure	std_exposure
str	u32	u32	date	date	u32	u32	f32	f32	f32	f32
"example-1"	2	2	2025-01-06	2025-01-07	3	6	-0.3	1.21	0.609333	0.580189
"example-2"	2	2	2025-01-13	2025-01-14	3	6	-0.3	1.21	0.609333	0.580189

Drilling down, we can obtain a more detailed summary as well.

dataset.get_staging_data_detail_summary()

shape: (4, 7)

_name	date	n_assets	min_exposure	max_exposure	mean_exposure	std_exposure
str	date	u32	f32	f32	f32	f32
"example-1"	2025-01-06	2	-0.3	1.2	0.64	0.542586
"example-1"	2025-01-07	1	-0.28	1.21	0.548	0.644528
"example-2"	2025-01-13	2	-0.3	1.2	0.64	0.542586
"example-2"	2025-01-14	1	-0.28	1.21	0.548	0.644528

Staging from a File#

Instead of passing a dataframe directly, we can also stage data from an existing csv, csv.gz, parquet or zip file.

Below we’ll demonstrate how to:

Stage a file
Obtain the staged data
Remove the file from the staging area

First, we define an output path where we will write our example_df2 dataframe.

path = Path(tempfile.mkdtemp()) / "example2.csv"
example_df2.write_csv(path)

We can then stage the output file with stage_file, retrieve back the data with get_staging_data, and wipe the staging area with wipe_staging.

dataset.stage_file(path, parser="Wide-Format")

UploadStagingResult(name='example2', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 15, 157849, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])

dataset.get_staging_data(names=["example2"]).collect()

shape: (15, 7)

_name	date	asset_id	asset_id_type	factor_group	factor	exposure
str	date	str	str	str	str	f32
"example2"	2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.3
"example2"	2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.2
"example2"	2025-01-13	"GOOG"	"cusip9"	"style"	"growth"	1.2
"example2"	2025-01-13	"GOOG"	"cusip9"	"market"	"market"	1.0
"example2"	2025-01-13	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…	…
"example2"	2025-01-14	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.28
"example2"	2025-01-14	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.19
"example2"	2025-01-14	"GOOG"	"cusip9"	"style"	"growth"	1.21
"example2"	2025-01-14	"GOOG"	"cusip9"	"market"	"market"	1.0
"example2"	2025-01-14	"GOOG"	"cusip9"	"industry"	"tech"	1.0

dataset.wipe_staging(names=["example2"])

{'example2': UploadStagingResult(name='example2', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 15, 157849, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])}

Committing Data#

Once we’re happy with the staged files, we can commit them into versioned storage. Versions are immutable so every commmit creates a new version, which allows for full time travel.

When committing staged data, we need to choose a mode which defines how to write the data in the context of a versioned storage.

dataset.get_commit_modes()

{'append': 'Appends new factor/date combinations to the existing data. Collisions will be ignored.',
 'append_factor': 'Appends new factors to the existing data. Collisions with existing factors will be ignored.',
 'overwrite': 'Overwrites the entire dataset with the new data.',
 'overwrite_factor': 'Overwrites every factor present in the incoming data.',
 'append_from': 'Appends new factor/date combinations to the existing data but only after the last date in the existing data. Collisions will be ignored.',
 'overwrite_from': 'Overwrites the entire dataset with the new data but only after the last date in the existing data.'}

We’ll follow the steps below to demonstrate the commit and versioning process:

Commit entire staging area.
Show empty staging area (committed staged names are cleared from the staging area).
Show version history.
Get data at latest version.
Re-stage the example-1 dataframe and commit in append mode.
Re-stage the example-2 dataframe and commit in append_from mode.
Get data at different versions.

We start by committing the entire staging area. When we do this, we see that the commit was created as version 1.

dataset.commit(mode="append")

UploadCommitResult(version=1, committed_names=['example-1', 'example-2'])

After committing, the staging area should now be empty.

dataset.get_staging_data().collect()

shape: (0, 7)

_name	date	asset_id	asset_id_type	factor_group	factor	exposure
str	date	str	str	str	str	f32

We can get the list of all historical versions with the version_history method. We see below that there are 2 versions. Version 0 corresponds to the automatic creation of the dataset before it was overwritten with our commit changes in Version 1.

dataset.version_history()

{1: datetime.datetime(2025, 12, 14, 21, 43, 15, 584000, tzinfo=datetime.timezone.utc),
 0: datetime.datetime(2025, 12, 14, 21, 43, 15, 544000, tzinfo=datetime.timezone.utc)}

dataset.get_data().collect()

shape: (30, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.199951
2025-01-13	"GOOG"	"cusip9"	"style"	"growth"	1.200195
2025-01-13	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-13	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…
2025-01-07	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.280029
2025-01-07	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.189941
2025-01-07	"GOOG"	"cusip9"	"style"	"growth"	1.209961
2025-01-07	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-07	"GOOG"	"cusip9"	"industry"	"tech"	1.0

Below we will append some more data to demonstrate the versioning.

dataset.stage_df(
    "example-1", 
    example_df.with_columns(pl.col("date").dt.add_business_days(10)), 
    parser="Wide-Format"
)

UploadStagingResult(name='example-1', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 16, 523332, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])

dataset.commit(mode="append")

UploadCommitResult(version=2, committed_names=['example-1'])

dataset.version_history()

{2: datetime.datetime(2025, 12, 14, 21, 43, 16, 769000, tzinfo=datetime.timezone.utc),
 1: datetime.datetime(2025, 12, 14, 21, 43, 15, 584000, tzinfo=datetime.timezone.utc),
 0: datetime.datetime(2025, 12, 14, 21, 43, 15, 544000, tzinfo=datetime.timezone.utc)}

dataset.stage_df(
    "example-2", 
    example_df2.with_columns(pl.col("date").dt.add_business_days(10)), 
    parser="Wide-Format"
)

UploadStagingResult(name='example-2', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 17, 117852, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])

dataset.commit(mode="append_from")

UploadCommitResult(version=3, committed_names=['example-2'])

dataset.version_history()

{3: datetime.datetime(2025, 12, 14, 21, 43, 17, 546000, tzinfo=datetime.timezone.utc),
datetime.datetime(2025, 12, 14, 21, 43, 16, 769000, tzinfo=datetime.timezone.utc),
datetime.datetime(2025, 12, 14, 21, 43, 15, 584000, tzinfo=datetime.timezone.utc),
datetime.datetime(2025, 12, 14, 21, 43, 15, 544000, tzinfo=datetime.timezone.utc)}

# both example-1 and example-2 at this version
dataset.get_data(version=3).collect()

shape: (60, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.199951
2025-01-13	"GOOG"	"cusip9"	"style"	"growth"	1.200195
2025-01-13	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-13	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…
2025-01-28	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.280029
2025-01-28	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.189941
2025-01-28	"GOOG"	"cusip9"	"style"	"growth"	1.209961
2025-01-28	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-28	"GOOG"	"cusip9"	"industry"	"tech"	1.0

# only example-1 at this version
dataset.get_data(version=2).collect()

shape: (45, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.199951
2025-01-13	"GOOG"	"cusip9"	"style"	"growth"	1.200195
2025-01-13	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-13	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…
2025-01-21	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.280029
2025-01-21	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.189941
2025-01-21	"GOOG"	"cusip9"	"style"	"growth"	1.209961
2025-01-21	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-21	"GOOG"	"cusip9"	"industry"	"tech"	1.0

Note that if we append data that already exists as identified by their primary key (i.e. there is no data to append), then no new version will be recorded.

dataset.stage_df("no-new-data", example_df)

UploadStagingResult(name='no-new-data', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 18, 818149, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])

dataset.commit(mode="append")

UploadCommitResult(version=3, committed_names=['no-new-data'])

dataset.version_history()

{3: datetime.datetime(2025, 12, 14, 21, 43, 17, 546000, tzinfo=datetime.timezone.utc),
datetime.datetime(2025, 12, 14, 21, 43, 16, 769000, tzinfo=datetime.timezone.utc),
datetime.datetime(2025, 12, 14, 21, 43, 15, 584000, tzinfo=datetime.timezone.utc),
datetime.datetime(2025, 12, 14, 21, 43, 15, 544000, tzinfo=datetime.timezone.utc)}

Retrieving Summary Data#

Similar to the summary data we could obtain for the staging data, we can do the same for the committed data (at different versions).

dataset.get_data_summary()

shape: (1, 10)

n_dates	n_assets	min_date	max_date	n_factor_groups	n_factors	min_exposure	max_exposure	mean_exposure	std_exposure
u32	u32	date	date	u32	u32	f32	f32	f32	f32
8	2	2025-01-06	2025-01-28	3	6	-19948.0	15575.0	5771.866699	15326.09668

dataset.get_data_summary(version=2)

shape: (1, 10)

n_dates	n_assets	min_date	max_date	n_factor_groups	n_factors	min_exposure	max_exposure	mean_exposure	std_exposure
u32	u32	date	date	u32	u32	f32	f32	f32	f32
6	2	2025-01-06	2025-01-21	3	6	-19948.0	15575.0	5771.866699	15326.09668

dataset.get_data_detail_summary()

shape: (8, 6)

date	n_assets	min_exposure	max_exposure	mean_exposure	std_exposure
date	i64	f32	f32	f64	f64
2025-01-06	2	-0.300049	1.200195	0.639978	0.542577
2025-01-07	1	-0.280029	1.209961	0.547998	0.644514
2025-01-13	2	-0.300049	1.200195	0.639978	0.542577
2025-01-14	1	-0.280029	1.209961	0.547998	0.644514
2025-01-20	2	-0.300049	1.200195	0.639978	0.542577
2025-01-21	1	-0.280029	1.209961	0.547998	0.644514
2025-01-27	2	-0.300049	1.200195	0.639978	0.542577
2025-01-28	1	-0.280029	1.209961	0.547998	0.644514

dataset.get_data_detail_summary(version=2)

shape: (6, 6)

date	n_assets	min_exposure	max_exposure	mean_exposure	std_exposure
date	i64	f32	f32	f64	f64
2025-01-06	2	-0.300049	1.200195	0.639978	0.542577
2025-01-07	1	-0.280029	1.209961	0.547998	0.644514
2025-01-13	2	-0.300049	1.200195	0.639978	0.542577
2025-01-14	1	-0.280029	1.209961	0.547998	0.644514
2025-01-20	2	-0.300049	1.200195	0.639978	0.542577
2025-01-21	1	-0.280029	1.209961	0.547998	0.644514

Validating Staging Data#

When staging multiple files, it’s possible that their combined contents may not be valid and so cannot be committed. For example, if the files introduce duplicate entries, the commit method will fail.

The example below illustrates how to validate the staging area before committing. To simulate a validation failure, we intentionally stage the same example dataframe twice, resulting in duplicate records.

dataset.stage_df("example-1", example_df, parser="Wide-Format")
dataset.stage_df("example-2", example_df, parser="Wide-Format")

dataset.get_staging_results()

{'example-1': UploadStagingResult(name='example-1', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 20, 939752, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])]),
 'example-2': UploadStagingResult(name='example-2', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 21, 81304, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])}

The validation check below will produce non-empty dataframes if any validation errors occurred. In this case, it produces the duplicated records together with a _name column which indicates the names of the staged dataframes that introduced the duplication.

dataset.validate_staging_data()

{'Duplication Check': shape: (15, 7)
 ┌────────────┬──────────┬───────────────┬──────────────┬─────────────┬─────────┬────────────┐
 │ date       ┆ asset_id ┆ asset_id_type ┆ factor_group ┆ factor      ┆ n_dupes ┆ _name      │
 │ ---        ┆ ---      ┆ ---           ┆ ---          ┆ ---         ┆ ---     ┆ ---        │
 │ date       ┆ str      ┆ str           ┆ str          ┆ str         ┆ u32     ┆ str        │
 ╞════════════╪══════════╪═══════════════╪══════════════╪═════════════╪═════════╪════════════╡
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ industry     ┆ consumer    ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ market       ┆ market      ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ style        ┆ growth      ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ style        ┆ momentum_12 ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ 2025-01-06 ┆ AAPL     ┆ cusip9        ┆ style        ┆ momentum_6  ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ …          ┆ …        ┆ …             ┆ …            ┆ …           ┆ …       ┆ …          │
 │ 2025-01-07 ┆ GOOG     ┆ cusip9        ┆ industry     ┆ tech        ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ 2025-01-07 ┆ GOOG     ┆ cusip9        ┆ market       ┆ market      ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ 2025-01-07 ┆ GOOG     ┆ cusip9        ┆ style        ┆ growth      ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ 2025-01-07 ┆ GOOG     ┆ cusip9        ┆ style        ┆ momentum_12 ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 │ 2025-01-07 ┆ GOOG     ┆ cusip9        ┆ style        ┆ momentum_6  ┆ 2       ┆ example-2, │
 │            ┆          ┆               ┆              ┆             ┆         ┆ example-1  │
 └────────────┴──────────┴───────────────┴──────────────┴─────────────┴─────────┴────────────┘}

# this call would fail with: `UploadError: Staging data fails validation checks.`
# dataset.commit(mode="append")

To resolve this error, we can delete one of the erroneously staged dataframes, after which the validation will produce an empty dataframe, indicating successful validation.

dataset.wipe_staging(names=["example-2"])

{'example-2': UploadStagingResult(name='example-2', timestamp=datetime.datetime(2025, 12, 14, 21, 43, 21, 81304, tzinfo=TzInfo(0)), success=True, results=[UploadParserResult(parser='Wide-Format', success=True, messages=[])])}

dataset.validate_staging_data()

{}

Fast Commit#

We can skip the staging process and commit a dataframe straight into versioned storage as demonstrated below.

dataset.fast_commit(example_df, mode="append", parser="Wide-Format")

UploadCommitResult(version=3, committed_names=[])

Filtering Data#

get_data allows us to obtain committed data at different versions. We can further add filters and column selectors to minimize the payload that has to travel across the network.

Filters follow a disjunctive normal form as outlined in the filters section of the PyArrow Documentation. That is, we specify triples of (column_name, operator, value) to express a filter. A list of such triples will create an AND expression, whereas a nested list of triples will create AND expressions on the inner lists and OR expressions on the outer.

Examples:

[("date", ">=", "2025-01-15"), ("asset_id", "=", "GOOG")] produces date >= 2025-01-15 AND asset_id = GOOG.
[[("date", ">=", "2025-01-15"), ("asset_id", "=", "GOOG")], [("exposure", "<", 0)]] produces (date >= 2025-01-15 AND asset_id = GOOG) OR (exposure < 0)

dataset.get_data(head=5).collect()

shape: (5, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.199951
2025-01-13	"GOOG"	"cusip9"	"style"	"growth"	1.200195
2025-01-13	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-13	"GOOG"	"cusip9"	"industry"	"tech"	1.0

dataset.get_data(filters=[("date", ">=", "2025-01-15"), ("asset_id", "=", "GOOG")]).collect()

shape: (20, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-27	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
2025-01-27	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.199951
2025-01-27	"GOOG"	"cusip9"	"style"	"growth"	1.200195
2025-01-27	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-27	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…
2025-01-21	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.280029
2025-01-21	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.189941
2025-01-21	"GOOG"	"cusip9"	"style"	"growth"	1.209961
2025-01-21	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-21	"GOOG"	"cusip9"	"industry"	"tech"	1.0

dataset.get_data(filters=[
    [("date", ">=", "2025-01-15"), ("asset_id", "=", "GOOG")], 
    [("exposure", "<", 0)]
]).collect()

shape: (28, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.199951
2025-01-14	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.280029
2025-01-14	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.189941
2025-01-06	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
…	…	…	…	…	…
2025-01-28	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.280029
2025-01-28	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.189941
2025-01-28	"GOOG"	"cusip9"	"style"	"growth"	1.209961
2025-01-28	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-28	"GOOG"	"cusip9"	"industry"	"tech"	1.0

We can also add the parameters columns and unique. For instance, to obtain a list of all assets (by asset id type) covered in this upload we would run below.

dataset.get_data(columns=["asset_id", "asset_id_type"], unique=True).collect()

shape: (2, 2)

asset_id	asset_id_type
str	str
"GOOG"	"cusip9"
"AAPL"	"cusip9"

Downloading Data#

For very large datasets it could become prohibitive to download the entire dataset into memory. For that purpose we can stream the data into a flat file and then use Polars’ lazy capabilities to read it.

out_dir = Path(tempfile.mkdtemp())
list(out_dir.iterdir())

[]

df = dataset.get_data(download_to=out_dir)

# this dataframe is reading from the flat files that were downloaded to the out_dir
df.collect()

shape: (60, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.199951
2025-01-13	"GOOG"	"cusip9"	"style"	"growth"	1.200195
2025-01-13	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-13	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…
2025-01-28	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.280029
2025-01-28	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.189941
2025-01-28	"GOOG"	"cusip9"	"style"	"growth"	1.209961
2025-01-28	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-28	"GOOG"	"cusip9"	"industry"	"tech"	1.0

list(out_dir.iterdir())

[PosixPath('/tmp/tmpwl2exku3/data-0.parquet')]

pl.scan_parquet(out_dir).collect()

shape: (60, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.300049
2025-01-13	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.199951
2025-01-13	"GOOG"	"cusip9"	"style"	"growth"	1.200195
2025-01-13	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-13	"GOOG"	"cusip9"	"industry"	"tech"	1.0
…	…	…	…	…	…
2025-01-28	"GOOG"	"cusip9"	"style"	"momentum_6"	-0.280029
2025-01-28	"GOOG"	"cusip9"	"style"	"momentum_12"	-0.189941
2025-01-28	"GOOG"	"cusip9"	"style"	"growth"	1.209961
2025-01-28	"GOOG"	"cusip9"	"market"	"market"	1.0
2025-01-28	"GOOG"	"cusip9"	"industry"	"tech"	1.0

Housekeeping#

To delete a dataset entirely we can call its destroy method. Warning: This cannot be undone, so exercise caution when deleting datasets.

dataset.destroy()

Data Uploaders

Contents

Data Uploaders#

Imports & Setup#

Data Types#

Datasets#

Schemas and Parsers#

Staging Data#

Adding to the Staging Area#

Retrieving Staged Data#

Retrieving Summary Data#

Staging from a File#

Committing Data#

Retrieving Summary Data#

Validating Staging Data#

Fast Commit#

Filtering Data#

Downloading Data#

Housekeeping#