Risk Datasets#

In this tutorial we are going to demonstrate the usage of the Bayesline Risk Datasets API, which allows to define new datasets that are used as a foundation to estimate factor risk models.

A risk dataset comprises all underlying data necessary to build a risk model:

  • factor exposures (including style exposures, industry/regional exposures, etc.)

  • asset price data

  • fundamental data (market caps)

  • security master data

In this first iteration we are allowing to ingest custom exposures into the Bayesline ecosystem, leveraging Bayesline data for the rest. In subsequent product iterations the user will be able to bring custom data for all other items, allowing to mix and match which data is brought by the user and which data is brought by Bayesline.

In this notebook we will introduce and explore:

  • system datasets and user datasets and how to list them

  • how to create a new risk dataset

  • how to create a new risk dataset with custom exposures

  • how do add exposures to an existing risk dataset

Imports & Setup#

For this tutorial notebook, you will need to import the following packages.

import datetime as dt
import numpy as np
import polars as pl

from bayesline.apiclient import BayeslineApiClient

from bayesline.api.equity import (
    ExposureSettings,
    FactorRiskModelSettings, 
    IndustrySettings,
    RegionSettings,
    RiskDatasetHuberRegressionExposureSettings,
    RiskDatasetSettings,
    RiskDatasetUnitExposureSettings,
    RiskDatasetUploadedExposureSettings,
    UniverseSettings
)

We will also need to have a Bayesline API client configured.

bln = BayeslineApiClient.new_client(
    endpoint="https://[ENDPOINT]",
    api_key="[API-KEY]",
)

The main entrypoint for the Risk Datasets API sits on bln.equity.riskdatasets. All dataset functionality can be reached from here on out.

See here for relevant docs:

risk_datasets = bln.equity.riskdatasets

Obtaining Available Datasets#

To list existing datasets we utilize the get_dataset_names method. When creating new datasets the names will appear in this list and are used downstream when creating risk model specifications.

We distinguish system and user datasets. System datasets are available to all users, e.g. the Bayesline-Global dataset. User datasets are created and owned by an individual user.

# default is "All", i.e. both system and user datasets
risk_datasets.get_dataset_names()
['Bayesline-US-500-1y', 'Bayesline-US-All-1y']
risk_datasets.get_dataset_names(mode="User")
[]
risk_datasets.get_dataset_names(mode="System")
['Bayesline-US-500-1y', 'Bayesline-US-All-1y']

A default dataset is a system dataset that will be used in absence of specifying a concrete dataset when creating a risk model. It is by definition the first result of risk_datasets.get_dataset_names().

risk_datasets.get_default_dataset_name()
'Bayesline-US-500-1y'

Creating a New Dataset#

When creating a new risk dataset we utilize the create_new_dataset method for which we need to provide a dataset name and a RiskDatasetSettings object.

At the bare minimum we need to specify a reference dataset, which is an existing dataset that all input data will be sourced from. The custom nature then is introduced by selectively specifying which data is to be brought in by the user.

Note that below minimal configuration effectivelt creates a copy of the reference dataset.

See here for relevant docs:

settings = RiskDatasetSettings(
    reference_dataset="Bayesline-US-All-1y"
)
risk_dataset_api = risk_datasets.create_dataset("My-Dataset", settings=settings)

Above create_dataset invocation merely

  1. Adds the given settings into the settings registry under given name.

  2. Produces the physical dataset according to the given settings

Note that we could have simply saved the settings in the settings registry directly which would have skipped step 2. This is perfectly feasible but requires us to invoke the dataset creation separtely (explained below).

We can verify the settings registry creation by inspecting the registry directly. What we will notice is that system datasets are not included.

risk_datasets.settings.names()
{'My-Dataset': 0}

One thing we can immediately do on the returned risk_dataset_api is to describe it and inspect what the available styles, industries etc. are. Note that these immediately flow through to the relevant settings menus on bln.equity.universes, bln.equity.exposures etc.

See here for relevant docs:

risk_dataset_props = risk_dataset_api.describe()

risk_dataset_props.exposure_settings_menu.styles
{'size': ['log_market_cap', 'log_total_assets'],
 'value': ['book_to_price'],
 'growth': ['price_to_earnings'],
 'volatility': ['sigma', 'sigma_eps', 'beta'],
 'momentum': ['mom6', 'mom12'],
 'dividend': ['dividend_yield'],
 'leverage': ['debt_to_assets', 'debt_to_equity']}

Loading and Updating an Existing Dataset#

Step 2 from above can at any time be invoked manually to trigger a full recreation of the dataset, using the latest versions of all referenced datasets. To do this we simply load back the dataset we previousy created using either its name or globally unique identifier. Note that the system tracks the versions of all input data such that a dataset won’t be updated if it is already at the latest version.

risk_dataset_api = risk_datasets.load("My-Dataset")
update_result = risk_dataset_api.update()

The RiskDatasetUpdateResult gives summary information about the update process.

update_result
RiskDatasetUpdateResult()

Using the Custom Dataset#

We can now use the dataset to produce risk models.

riskmodel_engine = bln.equity.riskmodels.load(
    FactorRiskModelSettings.default(dataset="My-Dataset")
)
risk_model_api = riskmodel_engine.get()
risk_model_api.fret().head()
shape: (5, 23)
datemarket.Marketindustry.Energyindustry.Basic Materialsindustry.Industrialsindustry.Consumer Cyclicalsindustry.Consumer Non-Cyclicalsindustry.Financialsindustry.Healthcareindustry.Technologyindustry.Utilitiesindustry.Real Estateindustry.Institutions, Associations & Organizationsindustry.Government Activityindustry.Academic & Educational Servicesregion.United Statesstyle.Sizestyle.Valuestyle.Growthstyle.Volatilitystyle.Momentumstyle.Dividendstyle.Leverage
datef32f32f32f32f32f32f32f32f32f32f32f32f32f32f32f32f32f32f32f32f32f32
2024-07-110.0234580.0057090.0093240.0086470.0087580.0030740.0026030.011253-0.0107190.0070440.010640.00.0458040.0027670.0-0.0071010.0045-0.0018380.00308-0.007583-0.0034150.004746
2024-07-120.010211-0.0018270.001144-0.0003920.001112-0.0002880.000556-0.0019860.0000140.0032050.0002040.0-0.011616-0.0027750.0-0.001374-0.0016390.0003870.003474-0.001874-0.0000570.000243
2024-07-150.0087790.006315-0.0029550.005664-0.008414-0.0058680.005439-0.0046770.001263-0.0207230.0047290.00.001360.0127980.0-0.0014890.001622-0.0009910.0044940.000079-0.0018350.000902
2024-07-160.024727-0.0092140.0051970.0127510.008665-0.0001650.0005340.005181-0.005423-0.004748-0.0054650.0-0.0145450.0126460.0-0.0028380.004147-0.0008980.005786-0.005307-0.0015280.002371
2024-07-17-0.0088560.0042020.003835-0.001280.0013890.012219-0.0021390.00676-0.003915-0.0007280.0053770.0-0.0114220.0155760.00.0004750.003278-0.000828-0.013532-0.0099780.0013710.002853

Updating System Datasets#

Users with administrator permissions can update system wide datasets, e.g. Bayesline-Global. In practice this means that the remote source will be checked for new data (e.g. as part of a daily data update) and any changes will be incorporated into the Bayesline ecosystem. Updating a system dataset affects all users.

bayesline_risk_dataset = risk_datasets.load("Bayesline-US-All-1y")
bayesline_risk_dataset.update()
RiskDatasetUpdateResult()

Creating a Custom Risk Dataset#

For the remainder of the tutorial we will:

  • upload sample exposures

  • upload a simulated time series (e.g. an oil price)

  • create a custom risk dataset using the uploaded exposures and esitmate huber regression exposures for the simulated time series.

Uploading Custom Exposures#

In below example we will be uploading a set of sample exposures for the top 100 US companies. For this we are first using the Exposures API to create the sample exposures and then the Uploaders API (see the Data Uploaders Tutorial for a detailed walk through) to upload the exposures as a custom exposure dataset.

Creating Sample Exposures#

exposures_api = bln.equity.exposures.load(ExposureSettings(
    market=False,
    styles={
        "Size": ["log_market_cap"],
        "Momentum": ["mom12"],
    },
    regions=None,
    industries=None,
))
universe_settings = UniverseSettings(
    dataset="Bayesline-US-All-1y",
    region=RegionSettings(
        hierarchy="continent",
        include=["USA"],
    )
)
exposures_df = exposures_api.get(universe_settings)
exposures_df.tail()
shape: (5, 4)
datebayesidstyle.Sizestyle.Momentum
datestrf32f32
2025-07-08"ZSPC"-0.812988-0.376465
2025-07-08"ZTR"-0.7866211.366211
2025-07-08"ZUMZ"-0.904785-1.102539
2025-07-08"ZVIA"-0.9047851.3671875
2025-07-08"ZVVT"-1.0468750.839844
top_100_assets = (
    exposures_df
    .group_by("bayesid")
    .agg(pl.col("style.Size").mean())
    .sort("style.Size")
    .tail(100)
    .select("bayesid")
)

top_100_assets.head()
shape: (5, 1)
bayesid
str
"VRTX"
"BSX"
"92290874"
"MDT"
"SBUX"
exposures_df = (
    exposures_df
    .join(top_100_assets, on="bayesid", how="semi")
)

Upload the Sample Exposures#

uploaders = bln.equity.uploaders
uploaders.get_data_types()
['exposures', 'factors', 'hierarchies', 'portfolios']
exposure_uploader = uploaders.get_data_type("exposures")
exposure_dataset = exposure_uploader.get_or_create_dataset("My-US-Top100-Exposures")

For the uploader we need to provide one of the accepted input formats. Below we choose the Long-Format parser and transform our exposures_df to fit this format.

exposure_dataset.get_parser_names()
['Long-Format', 'Wide-Format']
exposure_dataset.get_parser("Long-Format").get_examples()[0]
shape: (8, 6)
dateasset_idasset_id_typefactor_groupfactorexposure
datestrstrstrstrf64
2025-01-06"GOOG""cusip9""style""momentum_6"-0.3
2025-01-06"GOOG""cusip9""market""market"1.0
2025-01-06"AAPL""cusip9""style""momentum_6"0.1
2025-01-06"AAPL""cusip9""market""market"1.0
2025-01-07"GOOG""cusip9""style""momentum_6"-0.28
2025-01-07"GOOG""cusip9""market""market"1.0
2025-01-07"AAPL""cusip9""style""momentum_6"0.0
2025-01-07"AAPL""cusip9""market""market"1.0
upload_df =(
    exposures_df.filter(pl.col("date") <= dt.date(2024, 12, 31))
    .rename({"bayesid": "asset_id", "style.Size": "size", "style.Momentum": "momentum"})
    .unpivot(
        on=["size", "momentum"],
        index=["date", "asset_id"],
        variable_name="factor",
        value_name="exposure",
    )
    .with_columns(
        pl.lit("bayesid").alias("asset_id_type"),
        pl.lit("style").alias("factor_group")
    )
)

upload_df.tail()
shape: (5, 6)
dateasset_idfactorexposureasset_id_typefactor_group
datestrstrf32strstr
2024-12-31"VZ""momentum"0.144531"bayesid""style"
2024-12-31"WFC""momentum"1.463867"bayesid""style"
2024-12-31"WMT""momentum"2.277344"bayesid""style"
2024-12-31"XOM""momentum"-0.068665"bayesid""style"
2024-12-31"Y0486S10""momentum"0.584961"bayesid""style"
exposure_dataset.fast_commit(upload_df, mode="append")
UploadCommitResult(version=1, committed_names=[])

To verify that our data was uploaded correctly we can obtain the data back from the exposure dataset.

exposure_dataset.get_data().collect().head()
shape: (5, 6)
dateasset_idasset_id_typefactor_groupfactorexposure
datestrstrstrstrf32
2024-07-10"00163T10""bayesid""style""size"2.400391
2024-07-10"00287Y10""bayesid""style""size"3.003906
2024-07-10"04206820""bayesid""style""size"2.431641
2024-07-10"09253U10""bayesid""style""size"2.611328
2024-07-10"30303M10""bayesid""style""size"3.517578

Uploading Time Series Data#

In below example we will be uploading two hypothetical time series (e.g. the oil price or the total returns of a technology index). Those can then be used to run create asset level exposures using Bayesline’s huber regression framework.

Creating Sample Time Series#

Below we simply create two random time series by sampling a normal distribution with a positive drift.

dates = upload_df["date"].unique().sort()
mu, sigma = 0.0002, 0.01
rng = np.random.default_rng(seed=42)
returns_oil = rng.normal(mu, sigma, size=len(dates))
returns_tech = rng.normal(mu, sigma, size=len(dates))

returns_df = pl.DataFrame({
    "date": dates,
    "returns_oil": returns_oil,
    "returns_tech": returns_tech
})

returns_df.tail()
shape: (5, 3)
datereturns_oilreturns_tech
datef64f64
2024-12-27-0.006348-0.020432
2024-12-280.004661-0.005711
2024-12-29-0.004350.006109
2024-12-30-0.012056-0.015616
2024-12-31-0.0125790.014959

Upload the Factor Time Series#

This is the same as above, only that we use the factors data type for the uploader.

uploaders.get_data_types()
['exposures', 'factors', 'hierarchies', 'portfolios']
factor_uploader = bln.equity.uploaders.get_data_type("factors")
factor_ts_dataset = factor_uploader.get_or_create_dataset("Oil-and-Tech-Returns")
factor_ts_dataset.fast_commit(returns_df, mode="append")
UploadCommitResult(version=1, committed_names=[])
factor_ts_dataset.get_data().collect().tail()
shape: (5, 3)
datefactorvalue
datestrf32
2024-12-27"returns_tech"-0.020432
2024-12-28"returns_tech"-0.005711
2024-12-29"returns_tech"0.006109
2024-12-30"returns_tech"-0.015616
2024-12-31"returns_tech"0.014959

Creating a Custom Risk Dataset#

Recall that we named the custom exposure dataset My-US-Top100-Exposures and the factor time series dataset Oil-and-Tech-Returns. We will use these name to specify that the exposure input data for the new risk dataset should be sourced from these uploads.

Also note from above that as a factor group we specified style. Factor groups are used to logically group exposures into styles, regions, industries, etc. This is particularly important if we bring more than one set of industry or region schemas (e.g. TRBC and GICS). Below we specify only the style_factor_group (meaning no other exposure groups will be brought in, even if they existed in our uploaded exposure dataset).

Lastly, note that below we specify exposures as a list. We can reference more than one exposure upload and create a consolidated risk dataset from it. In fact we do just that in this example where we bring in two different sources of exposures.

Note below nuances:

  • we always need to specify 1) a market factor, 2) at least one region hierarchy and 3) at least one industry hierarchy. In absence of bringing our own we can stub in unit exposure dummies.

Below we use default settings for both the uploaded exposures and the huber regressions. There is an extensive set of available options, see below for relevant docs:

riskdataset_settings = RiskDatasetSettings(
    reference_dataset="Bayesline-US-All-1y",
    exposures=[
        RiskDatasetUnitExposureSettings.market(),
        RiskDatasetUnitExposureSettings.industry(),
        RiskDatasetUnitExposureSettings.region(),
        RiskDatasetUploadedExposureSettings(
            exposure_source="My-US-Top100-Exposures",
            style_factor_group="style",
        ),
        RiskDatasetHuberRegressionExposureSettings(
            tsfactors_source="Oil-and-Tech-Returns",
        ),
    ],
)
risk_dataset_api = risk_datasets.create_dataset("My-Risk-Dataset", settings=riskdataset_settings)
risk_dataset_props = risk_dataset_api.describe()
risk_dataset_props.exposure_settings_menu.styles
{'momentum-all': ['momentum'],
 'size-all': ['size'],
 'returns_oil-all': ['returns_oil'],
 'returns_tech-all': ['returns_tech']}
risk_dataset_props.exposure_settings_menu.industries
{'industry': {'ALL': ['industry']}}
risk_dataset_props.exposure_settings_menu.regions
{'region': {'ALL': ['world']}}

First we might be interested in what the huber regression based exposures worked out to be. Everything is linked with the rest of the Bayesline ecosystem so we can simply pick them up through the Exposures API.

universe_settings = UniverseSettings(
    dataset="My-Risk-Dataset",
    industry=IndustrySettings(
        hierarchy="industry",
        include=["ALL"],
    ),
    region=RegionSettings(
        hierarchy="region",
        include=["ALL"],
    ),
)

exposure_settings = ExposureSettings(
    market=True,
    styles={
        "momentum-all": ["momentum"],
        "size-all": ["size"],
        "returns_oil-all": ["returns_oil"],
        "returns_tech-all": ["returns_tech"],
    },
    regions=None, # no region factors
    industries=None, # no industry factors
)
exposures_api = bln.equity.exposures.load(exposure_settings)
exposures_df_my_model = exposures_api.get(universe_settings)
exposures_df_my_model.filter(pl.col("bayesid") == "AAPL").tail()
shape: (5, 7)
datebayesidmarket.marketstyle.momentumstyle.sizestyle.returns_oilstyle.returns_tech
datestrf32f32f32f32f32
2024-12-27"AAPL"1.01.9052733.683594-0.3049320.029251
2024-12-28"AAPL"1.01.9023443.652344-0.3061520.031982
2024-12-29"AAPL"1.01.9052733.652344-0.3051760.029434
2024-12-30"AAPL"1.01.8310553.65625-0.274170.042938
2024-12-31"AAPL"1.01.9990233.720703-0.150391-0.03952

Below we can now build a factor risk model with the risk dataset we just created.

riskmodel_engine = bln.equity.riskmodels.load(
    FactorRiskModelSettings(
        universe=universe_settings,
        exposures=exposure_settings
    )
)
risk_model_api = riskmodel_engine.get()
risk_model_api.fret().tail()
shape: (5, 6)
datemarket.marketstyle.momentumstyle.sizestyle.returns_oilstyle.returns_tech
datef32f32f32f32f32
2024-12-240.0080570.0000970.001062-0.00185-0.001395
2024-12-260.007146-0.000793-0.002486-0.000478-0.00074
2024-12-27-0.0106910.0004980.0000930.000510.00228
2024-12-30-0.0064270.000687-0.0024080.0009110.001666
2024-12-310.002228-0.001488-0.0007770.0001370.00099

Adding Exposures to an Existing Custom Risk Dataset#

As a last step in this tutorial we will add exposures for 2025 to our existing exposures upload and then update the risk dataset we already created.

Recall the steps from above to obtain some sample exposures up to the end of 2024. We will follow the same steps here (note that we’ll reuse the same top 100 assets from above).

Also note that:

  • we won’t update the factor time series to demonstrate the behavior in case of only partially available exposures.

  • the dataframe we upload also contains 2024 dates. These will be ignored when uploading in append mode (i.e. any existing date/factor combindations are ignored).

exposures_df.tail()
shape: (5, 4)
datebayesidstyle.Sizestyle.Momentum
datestrf32f32
2025-07-08"VZ"2.5253910.363037
2025-07-08"WFC"2.7753910.862305
2025-07-08"WMT"3.3222661.860352
2025-07-08"XOM"3.144531-0.605469
2025-07-08"Y0486S10"3.3222661.59082
exposures_df = exposures_df.join(top_100_assets, on="bayesid", how="semi")
exposures_df
shape: (36_400, 4)
datebayesidstyle.Sizestyle.Momentum
datestrf32f32
2024-07-10"00163T10"2.400391-0.861328
2024-07-10"00287Y10"3.0039061.042969
2024-07-10"04206820"2.4316410.567383
2024-07-10"09253U10"2.6113280.699219
2024-07-10"30303M10"3.5175781.431641
2025-07-08"VZ"2.5253910.363037
2025-07-08"WFC"2.7753910.862305
2025-07-08"WMT"3.3222661.860352
2025-07-08"XOM"3.144531-0.605469
2025-07-08"Y0486S10"3.3222661.59082
upload_df =(
    exposures_df
    .rename({"bayesid": "asset_id", "style.Size": "size", "style.Momentum": "momentum"})
    .unpivot(
        on=["size", "momentum"],
        index=["date", "asset_id"],
        variable_name="factor",
        value_name="exposure",
    )
    .with_columns(
        pl.lit("bayesid").alias("asset_id_type"),
        pl.lit("style").alias("factor_group")
    )
)

upload_df.tail()
shape: (5, 6)
dateasset_idfactorexposureasset_id_typefactor_group
datestrstrf32strstr
2025-07-08"VZ""momentum"0.363037"bayesid""style"
2025-07-08"WFC""momentum"0.862305"bayesid""style"
2025-07-08"WMT""momentum"1.860352"bayesid""style"
2025-07-08"XOM""momentum"-0.605469"bayesid""style"
2025-07-08"Y0486S10""momentum"1.59082"bayesid""style"

Note how below we choose the append mode which allows us to add data rather than overwrite previous data.

exposure_dataset.fast_commit(upload_df, mode="append")
UploadCommitResult(version=2, committed_names=[])
exposure_dataset.version_history()
{2: datetime.datetime(2025, 7, 22, 0, 51, 27, 491000, tzinfo=datetime.timezone.utc),
 1: datetime.datetime(2025, 7, 22, 0, 50, 55, 321000, tzinfo=datetime.timezone.utc),
 0: datetime.datetime(2025, 7, 22, 0, 50, 55, 291000, tzinfo=datetime.timezone.utc)}

New exposures have been uploaded, as a last step we need to update our risk dataset. Note that as of now we need to manually update the risk dataset to bring in the changes. In a future release functionality will be added to automatically trigger the risk dataset update if input data changes.

risk_dataset_api.update()
RiskDatasetUpdateResult()

Fitting the risk model again we’ll find that the new exposures have been captured.

riskmodel_engine = bln.equity.riskmodels.load(
    FactorRiskModelSettings(
        universe=universe_settings,
        exposures=exposure_settings,
    )
)
risk_model_api = riskmodel_engine.get()
risk_model_api.fret().tail()
shape: (5, 6)
datemarket.marketstyle.momentumstyle.sizestyle.returns_oilstyle.returns_tech
datef32f32f32f32f32
2025-07-010.011207-0.000886-0.002891-0.006410.004614
2025-07-020.010951-0.0035040.000299-0.0017850.001008
2025-07-030.007630.000962-0.0007980.0654740.064018
2025-07-07-0.0128220.0021410.000374-0.000057-0.000315
2025-07-080.007008-0.0054670.001510.00.0

Housekeeping#

Below demonstrates how to delete risk datasets.

risk_datasets.delete_dataset("My-Dataset")
RawSettings(model_type='RiskDatasetSettings', name='My-Dataset', identifier=0, exists=True, raw_json={'reference_dataset': 'Bayesline-US-All-1y', 'exposures': [{'exposure_type': 'referenced', 'market_factor_groups': None, 'region_factor_groups': None, 'industry_factor_groups': None, 'style_factor_groups': None, 'other_factor_groups': None}], 'exchange_codes': None, 'trim_assets': 'ccy_union', 'trim_start_date': 'earliest_start', 'trim_end_date': 'latest_end'}, references=[], extra={})
risk_datasets.delete_dataset("My-Risk-Dataset")
RawSettings(model_type='RiskDatasetSettings', name='My-Risk-Dataset', identifier=1, exists=True, raw_json={'reference_dataset': 'Bayesline-US-All-1y', 'exposures': [{'exposure_type': 'unit', 'factor': 'market', 'factor_group': 'market', 'factor_type': 'market'}, {'exposure_type': 'unit', 'factor': 'industry', 'factor_group': 'industry', 'factor_type': 'industry'}, {'exposure_type': 'unit', 'factor': 'world', 'factor_group': 'region', 'factor_type': 'region'}, {'exposure_type': 'uploaded', 'exposure_source': 'My-US-Top100-Exposures', 'market_factor_group': None, 'region_factor_group': None, 'industry_factor_group': None, 'style_factor_group': 'style', 'other_factor_group': None, 'style_factor_fill_miss': True, 'style_factor_huberize': True}, {'exposure_type': 'huber_regression', 'tsfactors_source': 'Oil-and-Tech-Returns', 'factor_group': 'huber_style', 'include': 'All', 'exclude': [], 'fill_miss': True, 'window': 126, 'epsilon': 1.35, 'alpha': 0.0001, 'alpha_start': 10.0, 'student_t_level': None, 'clip': [None, None], 'huberize': True, 'huberize_maintain_zeros': False, 'impute': True, 'currency': 'USD', 'calendar': {'dataset': None, 'filters': [['XNYS']]}}], 'exchange_codes': None, 'trim_assets': 'ccy_union', 'trim_start_date': 'earliest_start', 'trim_end_date': 'latest_end'}, references=[], extra={})
exposure_dataset.destroy()
factor_ts_dataset.destroy()