Exposure Upload Commit Modes#

In this tutorial we are going to demonstrate the usage of the different commit modes for the Exposure Uploader.

As a prerequisite see the Uploaders Tutorial for a comprehensive overview of the Data Uploaders API.

Specifically, we will explore:

Initial exposure upload
Overwriting exposures
Appending arbitrary records
Appending new exposures as they arrive
Adding new factors
Replacing factors

For this tutorial we will use a very small dataframe which makes it easier to point out the behavior of the different upload modes. We test our implementation on global datasets with 20 years of history to ensure they scale in a performant way.

For simplicity we’ll be using the fast_commit functionality and skip the staging step.

Imports & Setup#

For this tutorial notebook, you will need to import the following packages.

import io

import pandas as pd
import polars as pl

from bayesline.apiclient import BayeslineApiClient

We will also need to have a Bayesline API client configured.

bln = BayeslineApiClient.new_client(
    endpoint="https://[ENDPOINT]",
    api_key="[API-KEY]",
)

uploaders = bln.equity.uploaders
exposure_uploaders = uploaders.get_data_type("exposures")

my_exposure_dataset = exposure_uploaders.create_or_replace_dataset("Test-Exposure-Modes")

Finally we are adding a helper function below to help us print out the data changes between versions.

def highlight_changes(old: pl.DataFrame, new: pl.DataFrame):
    cols = ["date", "asset_id", "asset_id_type", "factor_group", "factor", "exposure"]
    unchanged = pl.concat([old, new]).group_by(cols).len().filter(pl.col("len") > 1).drop("len").to_pandas()
    dropped = old.join(new, on=cols[:-1], how="anti").to_pandas()
    added = new.join(old, on=cols[:-1], how="anti").to_pandas()

    df = pd.concat([added, dropped, unchanged], ignore_index=True)
    mask = pd.concat([
        pd.DataFrame({col: ["background-color: #d4edda"] * len(added) for col in cols}),
        pd.DataFrame({col: ["background-color: #f8d7da; text-decoration: line-through"] * len(dropped) for col in cols}),
        pd.DataFrame({col: [""] * len(unchanged) for col in cols}),
    ], ignore_index=True)
    return df.style.apply(lambda _: mask, axis=None)

Availabe Commit Modes#

Each uploader has different commit modes that are tailored to the type of data that is being handled. For example for exposures a common use case is to add a new set of factors, but leaving any existing factors untouched. For this use case the exposures uploader provides a dedicated commit mode which we’ll explore later.

Below we show how to obtain the available commit modes together with a description of their behavior.

my_exposure_dataset.get_commit_modes()

{'append': 'Appends new factor/date combinations to the existing data. Collisions will be ignored.',
 'append_factor': 'Appends new factors to the existing data. Collisions with existing factors will be ignored.',
 'overwrite': 'Overwrites the entire dataset with the new data.',
 'overwrite_factor': 'Overwrites every factor present in the incoming data.',
 'append_from': 'Appends new factor/date combinations to the existing data but only after the last date in the existing data. Collisions will be ignored.',
 'overwrite_from': 'Overwrites the entire dataset with the new data but only after the last date in the existing data.'}

Initial Exposure Upload#

The initial exposure upload is applied against a new exposure dataset. For this we can use the append mode but overwrite or upsert would work just the same.

Below we create our starting dataframe which we’ll modify as we go through this tutorial. Recall that we can get examples of the required input format from the API as well.

my_exposure_dataset.get_parser("Wide-Format").get_examples()[0]

shape: (3, 9)

date	asset_id	asset_id_type	style^momentum_6	style^momentum_12	style^growth	market^market	industry^consumer	industry^tech
date	str	str	f64	f64	f64	f64	f64	f64
2025-01-06	"GOOG"	"cusip9"	-0.3	-0.2	1.2	1.0	null	1.0
2025-01-06	"AAPL"	"cusip9"	0.1	0.5	1.1	1.0	1.0	null
2025-01-07	"GOOG"	"cusip9"	-0.28	-0.19	1.21	1.0	null	1.0

data_csv = """
date	asset_id	asset_id_type	style^momentum
2025-01-01	AAPL	ticker	1
2025-01-01	GOOG	ticker	2
2025-01-02	AAPL	ticker	1.5
2025-01-02	GOOG	ticker	2.5
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

exposures_df

shape: (4, 4)

date	asset_id	asset_id_type	style^momentum
date	str	str	f64
2025-01-01	"AAPL"	"ticker"	1.0
2025-01-01	"GOOG"	"ticker"	2.0
2025-01-02	"AAPL"	"ticker"	1.5
2025-01-02	"GOOG"	"ticker"	2.5

my_exposure_dataset.fast_commit(exposures_df, mode="append")

UploadCommitResult(version=1, committed_names=[])

If we now confirm what was commited it’s (unsurprisingly) exactly the dataset as we uploaded it (after parsing).

highlight_changes(
    pl.DataFrame(schema=my_exposure_dataset.get_schema()),
    my_exposure_dataset.get_data(version=1).collect()
)

	date	asset_id	asset_id_type	factor_group	factor	exposure
0	2025-01-01 00:00:00	AAPL	ticker	style	momentum	1.000000
1	2025-01-01 00:00:00	GOOG	ticker	style	momentum	2.000000
2	2025-01-02 00:00:00	AAPL	ticker	style	momentum	1.500000
3	2025-01-02 00:00:00	GOOG	ticker	style	momentum	2.500000

Overwriting exposures#

Next up we say ‘we made a mistake and would like to overwrite the entire dataset’. Note that the easiest way to do this is to delete the dataset entirely, which would also delete the entire version history.

data_csv = """
date	asset_id	asset_id_type	style^value
2025-01-03	TSLA	ticker	8
2025-01-03	AMZN	ticker	9
2025-01-04	TSLA	ticker	10
2025-01-04	AMZN	ticker	11
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="overwrite")

UploadCommitResult(version=2, committed_names=[])

As expected the data has been completely overridden.

highlight_changes(
    my_exposure_dataset.get_data(version=1).collect(),
    my_exposure_dataset.get_data(version=2).collect()
)

	date	asset_id	asset_id_type	factor_group	factor	exposure
0	2025-01-03 00:00:00	TSLA	ticker	style	value	8.000000
1	2025-01-03 00:00:00	AMZN	ticker	style	value	9.000000
2	2025-01-04 00:00:00	TSLA	ticker	style	value	10.000000
3	2025-01-04 00:00:00	AMZN	ticker	style	value	11.000000
4	2025-01-01 00:00:00	AAPL	ticker	style	momentum	1.000000
5	2025-01-01 00:00:00	GOOG	ticker	style	momentum	2.000000
6	2025-01-02 00:00:00	AAPL	ticker	style	momentum	1.500000
7	2025-01-02 00:00:00	GOOG	ticker	style	momentum	2.500000

Note that we can still recover the previous version.

my_exposure_dataset.get_data(version=1).collect()

shape: (4, 6)

date	asset_id	asset_id_type	factor_group	factor	exposure
date	str	str	str	str	f32
2025-01-01	"AAPL"	"ticker"	"style"	"momentum"	1.0
2025-01-01	"GOOG"	"ticker"	"style"	"momentum"	2.0
2025-01-02	"AAPL"	"ticker"	"style"	"momentum"	1.5
2025-01-02	"GOOG"	"ticker"	"style"	"momentum"	2.5

Appending Arbitrary Records#

append means don’t overwrite existing records. It’s purpose is to prevent accidentally changing already uploaded data. For exposures a record already exists if for two records the date, asset_id, asset_id_type, factor_group and factor match. This means that any upload that uses the append mode will ignore (i.e. throw away) any record that already exists in the version that is appended to.

To show this with an example, we append below dataframe. Note how the record for TSLA on 2025-01-04 already exists with value 10.0 and note how after the commit this value stays unchanged.

append is the most generalized form of adding new data without overwriting existing data. In this example we append time (a new date), a new factor and we bring in a new asset as well. Later we’ll see how append_from and append_factor adds more protections to prevent accidentally overwriting existing data.

Also note that we’re using long format for this upload, an alternative option for formatting the input data prior to ingestion.

data_csv = """
date	asset_id	asset_id_type	factor_group	factor	exposure
2025-01-03	XOM	ticker	style	growth	50
2025-01-04	XOM	ticker	style	growth	51
2025-01-05	XOM	ticker	style	growth	51
2025-01-03	XOM	ticker	style	value	12
2025-01-04	XOM	ticker	style	value	13
2025-01-05	XOM	ticker	style	value	14
2025-01-04	TSLA	ticker	style	value	9999

"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="append")

UploadCommitResult(version=3, committed_names=[])

highlight_changes(
    my_exposure_dataset.get_data(version=2).collect(),
    my_exposure_dataset.get_data(version=3).collect()
)

	date	asset_id	asset_id_type	factor_group	factor	exposure
0	2025-01-03 00:00:00	XOM	ticker	style	growth	50.000000
1	2025-01-03 00:00:00	XOM	ticker	style	value	12.000000
2	2025-01-04 00:00:00	XOM	ticker	style	growth	51.000000
3	2025-01-05 00:00:00	XOM	ticker	style	growth	51.000000
4	2025-01-04 00:00:00	XOM	ticker	style	value	13.000000
5	2025-01-05 00:00:00	XOM	ticker	style	value	14.000000
6	2025-01-03 00:00:00	AMZN	ticker	style	value	9.000000
7	2025-01-04 00:00:00	AMZN	ticker	style	value	11.000000
8	2025-01-04 00:00:00	TSLA	ticker	style	value	10.000000
9	2025-01-03 00:00:00	TSLA	ticker	style	value	8.000000

Appending New Exposures#

Day over day new exposures arrive which need to be appended to the existing exposure upload. For this use case we may want to ensure that we don’t accidentally add any data for dates that already exist.

Consider the example where our last upload was on 2025-01-05 and we are now appending exposures 2025-01-06 and 2025-01-07. If in our input data some records from 2025-01-05 and before slipped in which does not already exist in the uploaded exposures (e.g. exposures with new assets were added), using append would insert these new records and therefore introduce new data into historical dates. This may or may not be desired depending on use case.

The commit mode append_from was designed to only append data that is strictly after the last date of the existing data, therefore ensuring no historical data accidentally slips in.

In the example below we add data for the date 2025-01-06 but also add a new historical record for asset MSFT for date 2025-01-05. Note how this record won’t make it through to the committed data.

data_csv = """
date	asset_id	asset_id_type	style^value	style^growth
2025-01-05	MSFT	ticker	1000	2000
2025-01-06	AMZN	ticker	40	70
2025-01-06	TSLA	ticker	50	80
2025-01-06	XOM	ticker	60	90
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="append_from")

UploadCommitResult(version=4, committed_names=[])

highlight_changes(
    my_exposure_dataset.get_data(version=3).collect(),
    my_exposure_dataset.get_data(version=4).collect()
)

	date	asset_id	asset_id_type	factor_group	factor	exposure
0	2025-01-06 00:00:00	AMZN	ticker	style	value	40.000000
1	2025-01-06 00:00:00	AMZN	ticker	style	growth	70.000000
2	2025-01-06 00:00:00	TSLA	ticker	style	value	50.000000
3	2025-01-06 00:00:00	TSLA	ticker	style	growth	80.000000
4	2025-01-06 00:00:00	XOM	ticker	style	value	60.000000
5	2025-01-06 00:00:00	XOM	ticker	style	growth	90.000000
6	2025-01-04 00:00:00	XOM	ticker	style	growth	51.000000
7	2025-01-03 00:00:00	AMZN	ticker	style	value	9.000000
8	2025-01-05 00:00:00	XOM	ticker	style	growth	51.000000
9	2025-01-04 00:00:00	TSLA	ticker	style	value	10.000000
10	2025-01-04 00:00:00	AMZN	ticker	style	value	11.000000
11	2025-01-03 00:00:00	XOM	ticker	style	value	12.000000
12	2025-01-03 00:00:00	TSLA	ticker	style	value	8.000000
13	2025-01-03 00:00:00	XOM	ticker	style	growth	50.000000
14	2025-01-05 00:00:00	XOM	ticker	style	value	14.000000
15	2025-01-04 00:00:00	XOM	ticker	style	value	13.000000

Adding New Factors#

Adding new factors works similarly to appending new exposures, only that here we ensure that no existing factors are edited. Any incoming record that contains a factor name that already exists will be discarded.

Below example uses the append_factor mode but the input dataframe also contains values for the existing factor value. Note how these records do not make it into the committed dataframe.

One possible use case for this mode is if the incoming data comes from a single spreadsheet and an additional factor was added to this spreadsheet. We want to ignore all factors that we have already captured.

data_csv = """
date	asset_id	asset_id_type	style^value	style^momentum
2025-01-03	JPM	ticker	1000	-5
2025-01-03	MSFT	ticker	1000	-10
2025-01-04	MSFT	ticker	2000	-20
2025-01-05	MSFT	ticker	3000	-30
2025-01-06	MSFT	ticker	4000	-40
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="append_factor")

UploadCommitResult(version=5, committed_names=[])

highlight_changes(
    my_exposure_dataset.get_data(version=4).collect(),
    my_exposure_dataset.get_data(version=5).collect()
)

	date	asset_id	asset_id_type	factor_group	factor	exposure
0	2025-01-03 00:00:00	JPM	ticker	style	momentum	-5.000000
1	2025-01-03 00:00:00	MSFT	ticker	style	momentum	-10.000000
2	2025-01-04 00:00:00	MSFT	ticker	style	momentum	-20.000000
3	2025-01-05 00:00:00	MSFT	ticker	style	momentum	-30.000000
4	2025-01-06 00:00:00	MSFT	ticker	style	momentum	-40.000000
5	2025-01-03 00:00:00	AMZN	ticker	style	value	9.000000
6	2025-01-05 00:00:00	XOM	ticker	style	value	14.000000
7	2025-01-06 00:00:00	TSLA	ticker	style	growth	80.000000
8	2025-01-06 00:00:00	AMZN	ticker	style	value	40.000000
9	2025-01-06 00:00:00	AMZN	ticker	style	growth	70.000000
10	2025-01-03 00:00:00	TSLA	ticker	style	value	8.000000
11	2025-01-06 00:00:00	XOM	ticker	style	growth	90.000000
12	2025-01-03 00:00:00	XOM	ticker	style	growth	50.000000
13	2025-01-05 00:00:00	XOM	ticker	style	growth	51.000000
14	2025-01-06 00:00:00	TSLA	ticker	style	value	50.000000
15	2025-01-04 00:00:00	XOM	ticker	style	value	13.000000
16	2025-01-04 00:00:00	TSLA	ticker	style	value	10.000000
17	2025-01-03 00:00:00	XOM	ticker	style	value	12.000000
18	2025-01-06 00:00:00	XOM	ticker	style	value	60.000000
19	2025-01-04 00:00:00	XOM	ticker	style	growth	51.000000
20	2025-01-04 00:00:00	AMZN	ticker	style	value	11.000000

Replacing Factors#

Replacing factors is the inverse of the behavior of appending factors. Any factor that has at least one record in the incoming dataframe will be wiped from the existing dataset and fully replaced with the incoming factor data.

In below example we fully replace the growth and value factors.

data_csv = """
date	asset_id	asset_id_type	style^value	style^growth
2025-01-05	AVGO	ticker	1000	2000
"""

exposures_df = pl.read_csv(io.StringIO(data_csv.strip()), separator="\t", try_parse_dates=True)

my_exposure_dataset.fast_commit(exposures_df, mode="overwrite_factor")

UploadCommitResult(version=6, committed_names=[])

highlight_changes(
    my_exposure_dataset.get_data(version=5).collect(),
    my_exposure_dataset.get_data(version=6).collect()
)

	date	asset_id	asset_id_type	factor_group	factor	exposure
0	2025-01-05 00:00:00	AVGO	ticker	style	value	1000.000000
1	2025-01-05 00:00:00	AVGO	ticker	style	growth	2000.000000
2	2025-01-03 00:00:00	TSLA	ticker	style	value	8.000000
3	2025-01-03 00:00:00	AMZN	ticker	style	value	9.000000
4	2025-01-04 00:00:00	TSLA	ticker	style	value	10.000000
5	2025-01-04 00:00:00	AMZN	ticker	style	value	11.000000
6	2025-01-03 00:00:00	XOM	ticker	style	growth	50.000000
7	2025-01-03 00:00:00	XOM	ticker	style	value	12.000000
8	2025-01-04 00:00:00	XOM	ticker	style	growth	51.000000
9	2025-01-05 00:00:00	XOM	ticker	style	growth	51.000000
10	2025-01-04 00:00:00	XOM	ticker	style	value	13.000000
11	2025-01-05 00:00:00	XOM	ticker	style	value	14.000000
12	2025-01-06 00:00:00	AMZN	ticker	style	value	40.000000
13	2025-01-06 00:00:00	AMZN	ticker	style	growth	70.000000
14	2025-01-06 00:00:00	TSLA	ticker	style	value	50.000000
15	2025-01-06 00:00:00	TSLA	ticker	style	growth	80.000000
16	2025-01-06 00:00:00	XOM	ticker	style	value	60.000000
17	2025-01-06 00:00:00	XOM	ticker	style	growth	90.000000
18	2025-01-06 00:00:00	MSFT	ticker	style	momentum	-40.000000
19	2025-01-04 00:00:00	MSFT	ticker	style	momentum	-20.000000
20	2025-01-05 00:00:00	MSFT	ticker	style	momentum	-30.000000
21	2025-01-03 00:00:00	JPM	ticker	style	momentum	-5.000000
22	2025-01-03 00:00:00	MSFT	ticker	style	momentum	-10.000000

Housekeeping#

my_exposure_dataset.destroy()

Exposure Upload Commit Modes

Contents