Exposure Upload Commit Modes#
In this tutorial we are going to demonstrate the usage of the different commit modes for the Exposure Uploader.
As a prerequisite see the Uploaders Tutorial for a comprehensive overview of the Data Uploaders API.
Specifically, we will explore:
For this tutorial we will use a very small dataframe which makes it easier to point out the behavior of the different upload modes. We test our implementation on global datasets with 20 years of history to ensure they scale in a performant way.
For simplicity we’ll be using the fast_commit functionality and skip the staging step.
Imports & Setup#
For this tutorial notebook, you will need to import the following packages.
import io
import pandas as pd
import polars as pl
from bayesline.apiclient import BayeslineApiClient
We will also need to have a Bayesline API client configured.
bln = BayeslineApiClient.new_client(
endpoint="https://[ENDPOINT]",
api_key="[API-KEY]",
)
uploaders = bln.equity.uploaders
exposure_uploaders = uploaders.get_data_type("exposures")
my_exposure_dataset = exposure_uploaders.create_or_replace_dataset("Test-Exposure-Modes")
Finally we are adding a helper function below to help us print out the data changes between versions.
def highlight_changes(old: pl.DataFrame, new: pl.DataFrame):
cols = ["date", "asset_id", "asset_id_type", "factor_group", "factor", "exposure"]
unchanged = pl.concat([old, new]).group_by(cols).len().filter(pl.col("len") > 1).drop("len").to_pandas()
dropped = old.join(new, on=cols[:-1], how="anti").to_pandas()
added = new.join(old, on=cols[:-1], how="anti").to_pandas()
df = pd.concat([added, dropped, unchanged], ignore_index=True)
mask = pd.concat([
pd.DataFrame({col: ["background-color: #d4edda"] * len(added) for col in cols}),
pd.DataFrame({col: ["background-color: #f8d7da; text-decoration: line-through"] * len(dropped) for col in cols}),
pd.DataFrame({col: [""] * len(unchanged) for col in cols}),
], ignore_index=True)
return df.style.apply(lambda _: mask, axis=None)
Availabe Commit Modes#
Each uploader has different commit modes that are tailored to the type of data that is being handled. For example for exposures a common use case is to add a new set of factors, but leaving any existing factors untouched. For this use case the exposures uploader provides a dedicated commit mode which we’ll explore later.
Below we show how to obtain the available commit modes together with a description of their behavior.
my_exposure_dataset.get_commit_modes()
{'append': 'Appends new factor/date combinations to the existing data. Collisions will be ignored.',
'append_factor': 'Appends new factors to the existing data. Collisions with existing factors will be ignored.',
'overwrite': 'Overwrites the entire dataset with the new data.',
'overwrite_factor': 'Overwrites every factor present in the incoming data.',
'append_from': 'Appends new factor/date combinations to the existing data but only after the last date in the existing data. Collisions will be ignored.',
'overwrite_from': 'Overwrites the entire dataset with the new data but only after the last date in the existing data.'}
Initial Exposure Upload#
The initial exposure upload is applied against a new exposure dataset. For this we can use the append mode but overwrite or upsert would work just the same.
Below we create our starting dataframe which we’ll modify as we go through this tutorial. Recall that we can get examples of the required input format from the API as well.
my_exposure_dataset.get_parser("Wide-Format").get_examples()[0]
| date | asset_id | asset_id_type | style^momentum_6 | style^momentum_12 | style^growth | market^market | industry^consumer | industry^tech |
|---|---|---|---|---|---|---|---|---|
| date | str | str | f64 | f64 | f64 | f64 | f64 | f64 |
| 2025-01-06 | "GOOG" | "cusip9" | -0.3 | -0.2 | 1.2 | 1.0 | null | 1.0 |
| 2025-01-06 | "AAPL" | "cusip9" | 0.1 | 0.5 | 1.1 | 1.0 | 1.0 | null |
| 2025-01-07 | "GOOG" | "cusip9" | -0.28 | -0.19 | 1.21 | 1.0 | null | 1.0 |
data_csv = """
date asset_id asset_id_type style^momentum
2025-01-01 AAPL ticker 1
2025-01-01 GOOG ticker 2
2025-01-02 AAPL ticker 1.5
2025-01-02 GOOG ticker 2.5
"""
exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)
exposures_df
| date | asset_id | asset_id_type | style^momentum |
|---|---|---|---|
| date | str | str | f64 |
| 2025-01-01 | "AAPL" | "ticker" | 1.0 |
| 2025-01-01 | "GOOG" | "ticker" | 2.0 |
| 2025-01-02 | "AAPL" | "ticker" | 1.5 |
| 2025-01-02 | "GOOG" | "ticker" | 2.5 |
my_exposure_dataset.fast_commit(exposures_df, mode="append")
UploadCommitResult(version=1, committed_names=[])
If we now confirm what was commited it’s (unsurprisingly) exactly the dataset as we uploaded it (after parsing).
highlight_changes(
pl.DataFrame(schema=my_exposure_dataset.get_schema()),
my_exposure_dataset.get_data(version=1).collect()
)
| date | asset_id | asset_id_type | factor_group | factor | exposure | |
|---|---|---|---|---|---|---|
| 0 | 2025-01-01 00:00:00 | AAPL | ticker | style | momentum | 1.000000 |
| 1 | 2025-01-01 00:00:00 | GOOG | ticker | style | momentum | 2.000000 |
| 2 | 2025-01-02 00:00:00 | AAPL | ticker | style | momentum | 1.500000 |
| 3 | 2025-01-02 00:00:00 | GOOG | ticker | style | momentum | 2.500000 |
Overwriting exposures#
Next up we say ‘we made a mistake and would like to overwrite the entire dataset’. Note that the easiest way to do this is to delete the dataset entirely, which would also delete the entire version history.
data_csv = """
date asset_id asset_id_type style^value
2025-01-03 TSLA ticker 8
2025-01-03 AMZN ticker 9
2025-01-04 TSLA ticker 10
2025-01-04 AMZN ticker 11
"""
exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)
my_exposure_dataset.fast_commit(exposures_df, mode="overwrite")
UploadCommitResult(version=2, committed_names=[])
As expected the data has been completely overridden.
highlight_changes(
my_exposure_dataset.get_data(version=1).collect(),
my_exposure_dataset.get_data(version=2).collect()
)
| date | asset_id | asset_id_type | factor_group | factor | exposure | |
|---|---|---|---|---|---|---|
| 0 | 2025-01-03 00:00:00 | TSLA | ticker | style | value | 8.000000 |
| 1 | 2025-01-03 00:00:00 | AMZN | ticker | style | value | 9.000000 |
| 2 | 2025-01-04 00:00:00 | TSLA | ticker | style | value | 10.000000 |
| 3 | 2025-01-04 00:00:00 | AMZN | ticker | style | value | 11.000000 |
| 4 | 2025-01-01 00:00:00 | AAPL | ticker | style | momentum | 1.000000 |
| 5 | 2025-01-01 00:00:00 | GOOG | ticker | style | momentum | 2.000000 |
| 6 | 2025-01-02 00:00:00 | AAPL | ticker | style | momentum | 1.500000 |
| 7 | 2025-01-02 00:00:00 | GOOG | ticker | style | momentum | 2.500000 |
Note that we can still recover the previous version.
my_exposure_dataset.get_data(version=1).collect()
| date | asset_id | asset_id_type | factor_group | factor | exposure |
|---|---|---|---|---|---|
| date | str | str | str | str | f32 |
| 2025-01-01 | "AAPL" | "ticker" | "style" | "momentum" | 1.0 |
| 2025-01-01 | "GOOG" | "ticker" | "style" | "momentum" | 2.0 |
| 2025-01-02 | "AAPL" | "ticker" | "style" | "momentum" | 1.5 |
| 2025-01-02 | "GOOG" | "ticker" | "style" | "momentum" | 2.5 |
Appending Arbitrary Records#
append means don’t overwrite existing records. It’s purpose is to prevent accidentally changing already uploaded data. For exposures a record already exists if for two records the date, asset_id, asset_id_type, factor_group and factor match. This means that any upload that uses the append mode will ignore (i.e. throw away) any record that already exists in the version that is appended to.
To show this with an example, we append below dataframe. Note how the record for TSLA on 2025-01-04 already exists with value 10.0 and note how after the commit this value stays unchanged.
append is the most generalized form of adding new data without overwriting existing data. In this example we append time (a new date), a new factor and we bring in a new asset as well. Later we’ll see how append_from and append_factor adds more protections to prevent accidentally overwriting existing data.
Also note that we’re using long format for this upload, an alternative option for formatting the input data prior to ingestion.
data_csv = """
date asset_id asset_id_type factor_group factor exposure
2025-01-03 XOM ticker style growth 50
2025-01-04 XOM ticker style growth 51
2025-01-05 XOM ticker style growth 51
2025-01-03 XOM ticker style value 12
2025-01-04 XOM ticker style value 13
2025-01-05 XOM ticker style value 14
2025-01-04 TSLA ticker style value 9999
"""
exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)
my_exposure_dataset.fast_commit(exposures_df, mode="append")
UploadCommitResult(version=3, committed_names=[])
highlight_changes(
my_exposure_dataset.get_data(version=2).collect(),
my_exposure_dataset.get_data(version=3).collect()
)
| date | asset_id | asset_id_type | factor_group | factor | exposure | |
|---|---|---|---|---|---|---|
| 0 | 2025-01-03 00:00:00 | XOM | ticker | style | growth | 50.000000 |
| 1 | 2025-01-03 00:00:00 | XOM | ticker | style | value | 12.000000 |
| 2 | 2025-01-04 00:00:00 | XOM | ticker | style | growth | 51.000000 |
| 3 | 2025-01-05 00:00:00 | XOM | ticker | style | growth | 51.000000 |
| 4 | 2025-01-04 00:00:00 | XOM | ticker | style | value | 13.000000 |
| 5 | 2025-01-05 00:00:00 | XOM | ticker | style | value | 14.000000 |
| 6 | 2025-01-03 00:00:00 | AMZN | ticker | style | value | 9.000000 |
| 7 | 2025-01-04 00:00:00 | AMZN | ticker | style | value | 11.000000 |
| 8 | 2025-01-04 00:00:00 | TSLA | ticker | style | value | 10.000000 |
| 9 | 2025-01-03 00:00:00 | TSLA | ticker | style | value | 8.000000 |
Appending New Exposures#
Day over day new exposures arrive which need to be appended to the existing exposure upload. For this use case we may want to ensure that we don’t accidentally add any data for dates that already exist.
Consider the example where our last upload was on 2025-01-05 and we are now appending exposures 2025-01-06 and 2025-01-07. If in our input data some records from 2025-01-05 and before slipped in which does not already exist in the uploaded exposures (e.g. exposures with new assets were added), using append would insert these new records and therefore introduce new data into historical dates. This may or may not be desired depending on use case.
The commit mode append_from was designed to only append data that is strictly after the last date of the existing data, therefore ensuring no historical data accidentally slips in.
In the example below we add data for the date 2025-01-06 but also add a new historical record for asset MSFT for date 2025-01-05. Note how this record won’t make it through to the committed data.
data_csv = """
date asset_id asset_id_type style^value style^growth
2025-01-05 MSFT ticker 1000 2000
2025-01-06 AMZN ticker 40 70
2025-01-06 TSLA ticker 50 80
2025-01-06 XOM ticker 60 90
"""
exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)
my_exposure_dataset.fast_commit(exposures_df, mode="append_from")
UploadCommitResult(version=4, committed_names=[])
highlight_changes(
my_exposure_dataset.get_data(version=3).collect(),
my_exposure_dataset.get_data(version=4).collect()
)
| date | asset_id | asset_id_type | factor_group | factor | exposure | |
|---|---|---|---|---|---|---|
| 0 | 2025-01-06 00:00:00 | AMZN | ticker | style | value | 40.000000 |
| 1 | 2025-01-06 00:00:00 | AMZN | ticker | style | growth | 70.000000 |
| 2 | 2025-01-06 00:00:00 | TSLA | ticker | style | value | 50.000000 |
| 3 | 2025-01-06 00:00:00 | TSLA | ticker | style | growth | 80.000000 |
| 4 | 2025-01-06 00:00:00 | XOM | ticker | style | value | 60.000000 |
| 5 | 2025-01-06 00:00:00 | XOM | ticker | style | growth | 90.000000 |
| 6 | 2025-01-04 00:00:00 | XOM | ticker | style | growth | 51.000000 |
| 7 | 2025-01-03 00:00:00 | AMZN | ticker | style | value | 9.000000 |
| 8 | 2025-01-05 00:00:00 | XOM | ticker | style | growth | 51.000000 |
| 9 | 2025-01-04 00:00:00 | TSLA | ticker | style | value | 10.000000 |
| 10 | 2025-01-04 00:00:00 | AMZN | ticker | style | value | 11.000000 |
| 11 | 2025-01-03 00:00:00 | XOM | ticker | style | value | 12.000000 |
| 12 | 2025-01-03 00:00:00 | TSLA | ticker | style | value | 8.000000 |
| 13 | 2025-01-03 00:00:00 | XOM | ticker | style | growth | 50.000000 |
| 14 | 2025-01-05 00:00:00 | XOM | ticker | style | value | 14.000000 |
| 15 | 2025-01-04 00:00:00 | XOM | ticker | style | value | 13.000000 |
Adding New Factors#
Adding new factors works similarly to appending new exposures, only that here we ensure that no existing factors are edited. Any incoming record that contains a factor name that already exists will be discarded.
Below example uses the append_factor mode but the input dataframe also contains values for the existing factor value. Note how these records do not make it into the committed dataframe.
One possible use case for this mode is if the incoming data comes from a single spreadsheet and an additional factor was added to this spreadsheet. We want to ignore all factors that we have already captured.
data_csv = """
date asset_id asset_id_type style^value style^momentum
2025-01-03 JPM ticker 1000 -5
2025-01-03 MSFT ticker 1000 -10
2025-01-04 MSFT ticker 2000 -20
2025-01-05 MSFT ticker 3000 -30
2025-01-06 MSFT ticker 4000 -40
"""
exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)
my_exposure_dataset.fast_commit(exposures_df, mode="append_factor")
UploadCommitResult(version=5, committed_names=[])
highlight_changes(
my_exposure_dataset.get_data(version=4).collect(),
my_exposure_dataset.get_data(version=5).collect()
)
| date | asset_id | asset_id_type | factor_group | factor | exposure | |
|---|---|---|---|---|---|---|
| 0 | 2025-01-03 00:00:00 | JPM | ticker | style | momentum | -5.000000 |
| 1 | 2025-01-03 00:00:00 | MSFT | ticker | style | momentum | -10.000000 |
| 2 | 2025-01-04 00:00:00 | MSFT | ticker | style | momentum | -20.000000 |
| 3 | 2025-01-05 00:00:00 | MSFT | ticker | style | momentum | -30.000000 |
| 4 | 2025-01-06 00:00:00 | MSFT | ticker | style | momentum | -40.000000 |
| 5 | 2025-01-03 00:00:00 | AMZN | ticker | style | value | 9.000000 |
| 6 | 2025-01-05 00:00:00 | XOM | ticker | style | value | 14.000000 |
| 7 | 2025-01-06 00:00:00 | TSLA | ticker | style | growth | 80.000000 |
| 8 | 2025-01-06 00:00:00 | AMZN | ticker | style | value | 40.000000 |
| 9 | 2025-01-06 00:00:00 | AMZN | ticker | style | growth | 70.000000 |
| 10 | 2025-01-03 00:00:00 | TSLA | ticker | style | value | 8.000000 |
| 11 | 2025-01-06 00:00:00 | XOM | ticker | style | growth | 90.000000 |
| 12 | 2025-01-03 00:00:00 | XOM | ticker | style | growth | 50.000000 |
| 13 | 2025-01-05 00:00:00 | XOM | ticker | style | growth | 51.000000 |
| 14 | 2025-01-06 00:00:00 | TSLA | ticker | style | value | 50.000000 |
| 15 | 2025-01-04 00:00:00 | XOM | ticker | style | value | 13.000000 |
| 16 | 2025-01-04 00:00:00 | TSLA | ticker | style | value | 10.000000 |
| 17 | 2025-01-03 00:00:00 | XOM | ticker | style | value | 12.000000 |
| 18 | 2025-01-06 00:00:00 | XOM | ticker | style | value | 60.000000 |
| 19 | 2025-01-04 00:00:00 | XOM | ticker | style | growth | 51.000000 |
| 20 | 2025-01-04 00:00:00 | AMZN | ticker | style | value | 11.000000 |
Replacing Factors#
Replacing factors is the inverse of the behavior of appending factors. Any factor that has at least one record in the incoming dataframe will be wiped from the existing dataset and fully replaced with the incoming factor data.
In below example we fully replace the growth and value factors.
data_csv = """
date asset_id asset_id_type style^value style^growth
2025-01-05 AVGO ticker 1000 2000
"""
exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)
my_exposure_dataset.fast_commit(exposures_df, mode="overwrite_factor")
UploadCommitResult(version=6, committed_names=[])
highlight_changes(
my_exposure_dataset.get_data(version=5).collect(),
my_exposure_dataset.get_data(version=6).collect()
)
| date | asset_id | asset_id_type | factor_group | factor | exposure | |
|---|---|---|---|---|---|---|
| 0 | 2025-01-05 00:00:00 | AVGO | ticker | style | value | 1000.000000 |
| 1 | 2025-01-05 00:00:00 | AVGO | ticker | style | growth | 2000.000000 |
| 2 | 2025-01-03 00:00:00 | TSLA | ticker | style | value | 8.000000 |
| 3 | 2025-01-03 00:00:00 | AMZN | ticker | style | value | 9.000000 |
| 4 | 2025-01-04 00:00:00 | TSLA | ticker | style | value | 10.000000 |
| 5 | 2025-01-04 00:00:00 | AMZN | ticker | style | value | 11.000000 |
| 6 | 2025-01-03 00:00:00 | XOM | ticker | style | growth | 50.000000 |
| 7 | 2025-01-03 00:00:00 | XOM | ticker | style | value | 12.000000 |
| 8 | 2025-01-04 00:00:00 | XOM | ticker | style | growth | 51.000000 |
| 9 | 2025-01-05 00:00:00 | XOM | ticker | style | growth | 51.000000 |
| 10 | 2025-01-04 00:00:00 | XOM | ticker | style | value | 13.000000 |
| 11 | 2025-01-05 00:00:00 | XOM | ticker | style | value | 14.000000 |
| 12 | 2025-01-06 00:00:00 | AMZN | ticker | style | value | 40.000000 |
| 13 | 2025-01-06 00:00:00 | AMZN | ticker | style | growth | 70.000000 |
| 14 | 2025-01-06 00:00:00 | TSLA | ticker | style | value | 50.000000 |
| 15 | 2025-01-06 00:00:00 | TSLA | ticker | style | growth | 80.000000 |
| 16 | 2025-01-06 00:00:00 | XOM | ticker | style | value | 60.000000 |
| 17 | 2025-01-06 00:00:00 | XOM | ticker | style | growth | 90.000000 |
| 18 | 2025-01-06 00:00:00 | MSFT | ticker | style | momentum | -40.000000 |
| 19 | 2025-01-04 00:00:00 | MSFT | ticker | style | momentum | -20.000000 |
| 20 | 2025-01-05 00:00:00 | MSFT | ticker | style | momentum | -30.000000 |
| 21 | 2025-01-03 00:00:00 | JPM | ticker | style | momentum | -5.000000 |
| 22 | 2025-01-03 00:00:00 | MSFT | ticker | style | momentum | -10.000000 |
Housekeeping#
my_exposure_dataset.destroy()