Exposure Upload Commit Modes#

In this tutorial we are going to demonstrate the usage of the different commit modes for the Exposure Uploader.

As a prerequisite see the Uploaders Tutorial for a comprehensive overview of the Data Uploaders API.

Specifically, we will explore:

For this tutorial we will use a very small dataframe which makes it easier to point out the behavior of the different upload modes. We test our implementation on global datasets with 20 years of history to ensure they scale in a performant way.

For simplicity we’ll be using the fast_commit functionality and skip the staging step.

Imports & Setup#

For this tutorial notebook, you will need to import the following packages.

import io

import pandas as pd
import polars as pl

from bayesline.apiclient import BayeslineApiClient

We will also need to have a Bayesline API client configured.

bln = BayeslineApiClient.new_client(
    endpoint="https://[ENDPOINT]",
    api_key="[API-KEY]",
)
uploaders = bln.equity.uploaders
exposure_uploaders = uploaders.get_data_type("exposures")

my_exposure_dataset = exposure_uploaders.create_or_replace_dataset("Test-Exposure-Modes")

Finally we are adding a helper function below to help us print out the data changes between versions.

def highlight_changes(old: pl.DataFrame, new: pl.DataFrame):
    cols = ["date", "asset_id", "asset_id_type", "factor_group", "factor", "exposure"]
    unchanged = pl.concat([old, new]).group_by(cols).len().filter(pl.col("len") > 1).drop("len").to_pandas()
    dropped = old.join(new, on=cols[:-1], how="anti").to_pandas()
    added = new.join(old, on=cols[:-1], how="anti").to_pandas()

    df = pd.concat([added, dropped, unchanged], ignore_index=True)
    mask = pd.concat([
        pd.DataFrame({col: ["background-color: #d4edda"] * len(added) for col in cols}),
        pd.DataFrame({col: ["background-color: #f8d7da; text-decoration: line-through"] * len(dropped) for col in cols}),
        pd.DataFrame({col: [""] * len(unchanged) for col in cols}),
    ], ignore_index=True)
    return df.style.apply(lambda _: mask, axis=None)

Availabe Commit Modes#

Each uploader has different commit modes that are tailored to the type of data that is being handled. For example for exposures a common use case is to add a new set of factors, but leaving any existing factors untouched. For this use case the exposures uploader provides a dedicated commit mode which we’ll explore later.

Below we show how to obtain the available commit modes together with a description of their behavior.

my_exposure_dataset.get_commit_modes()
{'append': 'Appends new factor/date combinations to the existing data. Collisions will be ignored.',
 'append_factor': 'Appends new factors to the existing data. Collisions with existing factors will be ignored.',
 'overwrite': 'Overwrites the entire dataset with the new data.',
 'overwrite_factor': 'Overwrites every factor present in the incoming data.',
 'append_from': 'Appends new factor/date combinations to the existing data but only after the last date in the existing data. Collisions will be ignored.',
 'overwrite_from': 'Overwrites the entire dataset with the new data but only after the last date in the existing data.'}

Initial Exposure Upload#

The initial exposure upload is applied against a new exposure dataset. For this we can use the append mode but overwrite or upsert would work just the same.

Below we create our starting dataframe which we’ll modify as we go through this tutorial. Recall that we can get examples of the required input format from the API as well.

my_exposure_dataset.get_parser("Wide-Format").get_examples()[0]
shape: (3, 9)
dateasset_idasset_id_typestyle^momentum_6style^momentum_12style^growthmarket^marketindustry^consumerindustry^tech
datestrstrf64f64f64f64f64f64
2025-01-06"GOOG""cusip9"-0.3-0.21.21.0null1.0
2025-01-06"AAPL""cusip9"0.10.51.11.01.0null
2025-01-07"GOOG""cusip9"-0.28-0.191.211.0null1.0
data_csv = """
date	asset_id	asset_id_type	style^momentum
2025-01-01	AAPL	ticker	1
2025-01-01	GOOG	ticker	2
2025-01-02	AAPL	ticker	1.5
2025-01-02	GOOG	ticker	2.5
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

exposures_df
shape: (4, 4)
dateasset_idasset_id_typestyle^momentum
datestrstrf64
2025-01-01"AAPL""ticker"1.0
2025-01-01"GOOG""ticker"2.0
2025-01-02"AAPL""ticker"1.5
2025-01-02"GOOG""ticker"2.5
my_exposure_dataset.fast_commit(exposures_df, mode="append")
UploadCommitResult(version=1, committed_names=[])

If we now confirm what was commited it’s (unsurprisingly) exactly the dataset as we uploaded it (after parsing).

highlight_changes(
    pl.DataFrame(schema=my_exposure_dataset.get_schema()),
    my_exposure_dataset.get_data(version=1).collect()
)
  date asset_id asset_id_type factor_group factor exposure
0 2025-01-01 00:00:00 AAPL ticker style momentum 1.000000
1 2025-01-01 00:00:00 GOOG ticker style momentum 2.000000
2 2025-01-02 00:00:00 AAPL ticker style momentum 1.500000
3 2025-01-02 00:00:00 GOOG ticker style momentum 2.500000

Overwriting exposures#

Next up we say ‘we made a mistake and would like to overwrite the entire dataset’. Note that the easiest way to do this is to delete the dataset entirely, which would also delete the entire version history.

data_csv = """
date	asset_id	asset_id_type	style^value
2025-01-03	TSLA	ticker	8
2025-01-03	AMZN	ticker	9
2025-01-04	TSLA	ticker	10
2025-01-04	AMZN	ticker	11
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="overwrite")
UploadCommitResult(version=2, committed_names=[])

As expected the data has been completely overridden.

highlight_changes(
    my_exposure_dataset.get_data(version=1).collect(),
    my_exposure_dataset.get_data(version=2).collect()
)
  date asset_id asset_id_type factor_group factor exposure
0 2025-01-03 00:00:00 TSLA ticker style value 8.000000
1 2025-01-03 00:00:00 AMZN ticker style value 9.000000
2 2025-01-04 00:00:00 TSLA ticker style value 10.000000
3 2025-01-04 00:00:00 AMZN ticker style value 11.000000
4 2025-01-01 00:00:00 AAPL ticker style momentum 1.000000
5 2025-01-01 00:00:00 GOOG ticker style momentum 2.000000
6 2025-01-02 00:00:00 AAPL ticker style momentum 1.500000
7 2025-01-02 00:00:00 GOOG ticker style momentum 2.500000

Note that we can still recover the previous version.

my_exposure_dataset.get_data(version=1).collect()
shape: (4, 6)
dateasset_idasset_id_typefactor_groupfactorexposure
datestrstrstrstrf32
2025-01-01"AAPL""ticker""style""momentum"1.0
2025-01-01"GOOG""ticker""style""momentum"2.0
2025-01-02"AAPL""ticker""style""momentum"1.5
2025-01-02"GOOG""ticker""style""momentum"2.5

Appending Arbitrary Records#

append means don’t overwrite existing records. It’s purpose is to prevent accidentally changing already uploaded data. For exposures a record already exists if for two records the date, asset_id, asset_id_type, factor_group and factor match. This means that any upload that uses the append mode will ignore (i.e. throw away) any record that already exists in the version that is appended to.

To show this with an example, we append below dataframe. Note how the record for TSLA on 2025-01-04 already exists with value 10.0 and note how after the commit this value stays unchanged.

append is the most generalized form of adding new data without overwriting existing data. In this example we append time (a new date), a new factor and we bring in a new asset as well. Later we’ll see how append_from and append_factor adds more protections to prevent accidentally overwriting existing data.

Also note that we’re using long format for this upload, an alternative option for formatting the input data prior to ingestion.

data_csv = """
date	asset_id	asset_id_type	factor_group	factor	exposure
2025-01-03	XOM	ticker	style	growth	50
2025-01-04	XOM	ticker	style	growth	51
2025-01-05	XOM	ticker	style	growth	51
2025-01-03	XOM	ticker	style	value	12
2025-01-04	XOM	ticker	style	value	13
2025-01-05	XOM	ticker	style	value	14
2025-01-04	TSLA	ticker	style	value	9999

"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="append")
UploadCommitResult(version=3, committed_names=[])
highlight_changes(
    my_exposure_dataset.get_data(version=2).collect(),
    my_exposure_dataset.get_data(version=3).collect()
)
  date asset_id asset_id_type factor_group factor exposure
0 2025-01-03 00:00:00 XOM ticker style growth 50.000000
1 2025-01-03 00:00:00 XOM ticker style value 12.000000
2 2025-01-04 00:00:00 XOM ticker style growth 51.000000
3 2025-01-05 00:00:00 XOM ticker style growth 51.000000
4 2025-01-04 00:00:00 XOM ticker style value 13.000000
5 2025-01-05 00:00:00 XOM ticker style value 14.000000
6 2025-01-03 00:00:00 AMZN ticker style value 9.000000
7 2025-01-04 00:00:00 AMZN ticker style value 11.000000
8 2025-01-04 00:00:00 TSLA ticker style value 10.000000
9 2025-01-03 00:00:00 TSLA ticker style value 8.000000

Appending New Exposures#

Day over day new exposures arrive which need to be appended to the existing exposure upload. For this use case we may want to ensure that we don’t accidentally add any data for dates that already exist.

Consider the example where our last upload was on 2025-01-05 and we are now appending exposures 2025-01-06 and 2025-01-07. If in our input data some records from 2025-01-05 and before slipped in which does not already exist in the uploaded exposures (e.g. exposures with new assets were added), using append would insert these new records and therefore introduce new data into historical dates. This may or may not be desired depending on use case.

The commit mode append_from was designed to only append data that is strictly after the last date of the existing data, therefore ensuring no historical data accidentally slips in.

In the example below we add data for the date 2025-01-06 but also add a new historical record for asset MSFT for date 2025-01-05. Note how this record won’t make it through to the committed data.

data_csv = """
date	asset_id	asset_id_type	style^value	style^growth
2025-01-05	MSFT	ticker	1000	2000
2025-01-06	AMZN	ticker	40	70
2025-01-06	TSLA	ticker	50	80
2025-01-06	XOM	ticker	60	90
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="append_from")
UploadCommitResult(version=4, committed_names=[])
highlight_changes(
    my_exposure_dataset.get_data(version=3).collect(),
    my_exposure_dataset.get_data(version=4).collect()
)
  date asset_id asset_id_type factor_group factor exposure
0 2025-01-06 00:00:00 AMZN ticker style value 40.000000
1 2025-01-06 00:00:00 AMZN ticker style growth 70.000000
2 2025-01-06 00:00:00 TSLA ticker style value 50.000000
3 2025-01-06 00:00:00 TSLA ticker style growth 80.000000
4 2025-01-06 00:00:00 XOM ticker style value 60.000000
5 2025-01-06 00:00:00 XOM ticker style growth 90.000000
6 2025-01-04 00:00:00 XOM ticker style growth 51.000000
7 2025-01-03 00:00:00 AMZN ticker style value 9.000000
8 2025-01-05 00:00:00 XOM ticker style growth 51.000000
9 2025-01-04 00:00:00 TSLA ticker style value 10.000000
10 2025-01-04 00:00:00 AMZN ticker style value 11.000000
11 2025-01-03 00:00:00 XOM ticker style value 12.000000
12 2025-01-03 00:00:00 TSLA ticker style value 8.000000
13 2025-01-03 00:00:00 XOM ticker style growth 50.000000
14 2025-01-05 00:00:00 XOM ticker style value 14.000000
15 2025-01-04 00:00:00 XOM ticker style value 13.000000

Adding New Factors#

Adding new factors works similarly to appending new exposures, only that here we ensure that no existing factors are edited. Any incoming record that contains a factor name that already exists will be discarded.

Below example uses the append_factor mode but the input dataframe also contains values for the existing factor value. Note how these records do not make it into the committed dataframe.

One possible use case for this mode is if the incoming data comes from a single spreadsheet and an additional factor was added to this spreadsheet. We want to ignore all factors that we have already captured.

data_csv = """
date	asset_id	asset_id_type	style^value	style^momentum
2025-01-03	JPM	ticker	1000	-5
2025-01-03	MSFT	ticker	1000	-10
2025-01-04	MSFT	ticker	2000	-20
2025-01-05	MSFT	ticker	3000	-30
2025-01-06	MSFT	ticker	4000	-40
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="append_factor")
UploadCommitResult(version=5, committed_names=[])
highlight_changes(
    my_exposure_dataset.get_data(version=4).collect(),
    my_exposure_dataset.get_data(version=5).collect()
)
  date asset_id asset_id_type factor_group factor exposure
0 2025-01-03 00:00:00 JPM ticker style momentum -5.000000
1 2025-01-03 00:00:00 MSFT ticker style momentum -10.000000
2 2025-01-04 00:00:00 MSFT ticker style momentum -20.000000
3 2025-01-05 00:00:00 MSFT ticker style momentum -30.000000
4 2025-01-06 00:00:00 MSFT ticker style momentum -40.000000
5 2025-01-03 00:00:00 AMZN ticker style value 9.000000
6 2025-01-05 00:00:00 XOM ticker style value 14.000000
7 2025-01-06 00:00:00 TSLA ticker style growth 80.000000
8 2025-01-06 00:00:00 AMZN ticker style value 40.000000
9 2025-01-06 00:00:00 AMZN ticker style growth 70.000000
10 2025-01-03 00:00:00 TSLA ticker style value 8.000000
11 2025-01-06 00:00:00 XOM ticker style growth 90.000000
12 2025-01-03 00:00:00 XOM ticker style growth 50.000000
13 2025-01-05 00:00:00 XOM ticker style growth 51.000000
14 2025-01-06 00:00:00 TSLA ticker style value 50.000000
15 2025-01-04 00:00:00 XOM ticker style value 13.000000
16 2025-01-04 00:00:00 TSLA ticker style value 10.000000
17 2025-01-03 00:00:00 XOM ticker style value 12.000000
18 2025-01-06 00:00:00 XOM ticker style value 60.000000
19 2025-01-04 00:00:00 XOM ticker style growth 51.000000
20 2025-01-04 00:00:00 AMZN ticker style value 11.000000

Replacing Factors#

Replacing factors is the inverse of the behavior of appending factors. Any factor that has at least one record in the incoming dataframe will be wiped from the existing dataset and fully replaced with the incoming factor data.

In below example we fully replace the growth and value factors.

data_csv = """
date	asset_id	asset_id_type	style^value	style^growth
2025-01-05	AVGO	ticker	1000	2000
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="overwrite_factor")
UploadCommitResult(version=6, committed_names=[])
highlight_changes(
    my_exposure_dataset.get_data(version=5).collect(),
    my_exposure_dataset.get_data(version=6).collect()
)
  date asset_id asset_id_type factor_group factor exposure
0 2025-01-05 00:00:00 AVGO ticker style value 1000.000000
1 2025-01-05 00:00:00 AVGO ticker style growth 2000.000000
2 2025-01-03 00:00:00 TSLA ticker style value 8.000000
3 2025-01-03 00:00:00 AMZN ticker style value 9.000000
4 2025-01-04 00:00:00 TSLA ticker style value 10.000000
5 2025-01-04 00:00:00 AMZN ticker style value 11.000000
6 2025-01-03 00:00:00 XOM ticker style growth 50.000000
7 2025-01-03 00:00:00 XOM ticker style value 12.000000
8 2025-01-04 00:00:00 XOM ticker style growth 51.000000
9 2025-01-05 00:00:00 XOM ticker style growth 51.000000
10 2025-01-04 00:00:00 XOM ticker style value 13.000000
11 2025-01-05 00:00:00 XOM ticker style value 14.000000
12 2025-01-06 00:00:00 AMZN ticker style value 40.000000
13 2025-01-06 00:00:00 AMZN ticker style growth 70.000000
14 2025-01-06 00:00:00 TSLA ticker style value 50.000000
15 2025-01-06 00:00:00 TSLA ticker style growth 80.000000
16 2025-01-06 00:00:00 XOM ticker style value 60.000000
17 2025-01-06 00:00:00 XOM ticker style growth 90.000000
18 2025-01-06 00:00:00 MSFT ticker style momentum -40.000000
19 2025-01-04 00:00:00 MSFT ticker style momentum -20.000000
20 2025-01-05 00:00:00 MSFT ticker style momentum -30.000000
21 2025-01-03 00:00:00 JPM ticker style momentum -5.000000
22 2025-01-03 00:00:00 MSFT ticker style momentum -10.000000

Housekeeping#

my_exposure_dataset.destroy()