# Initial Exposures Upload and Risk Dataset Creation

Use this notebook for the initial upload and risk dataset creation of a set of factor exposures using daily CSV files.

In [2]:
import datetime as dt
import itertools as it
import shutil
import tempfile
import zipfile

from pathlib import Path

from tqdm import tqdm

from bayesline.apiclient import BayeslineApiClient
from bayesline.api.equity import (
    ExposureSettings, 
    IndustrySettings, 
    RegionSettings,
    RiskDatasetSettings,
    RiskDatasetUploadedExposureSettings,
    UniverseSettings,
)

In [None]:
bln = BayeslineApiClient.new_client(
    endpoint="https://[ENDPOINT]",
    api_key="[API-KEY]",
)

## Exposure Upload

In [None]:
exposure_dir = Path("PATH/TO/EXPOSURES")
assert exposure_dir.exists()


In [4]:
exposure_dataset_name = "My-Exposures"

Below creates a new exposure uploader for the chosen dataset name `My-Exposures`. See the [Uploaders Tutorial](https://docs.bayesline.com/latest/notebooks/tutorial_uploaders.html) for a deep dive into the `Uploaders API`.

In [5]:
exposure_uploader = bln.equity.uploaders.get_data_type("exposures")
uploader = exposure_uploader.get_or_create_dataset(exposure_dataset_name)

In [6]:
# list all csv files and group them by year
# expects file pattern "*_YYYY-MM-DD.csv"

all_files = sorted(exposure_dir.glob("*.csv"))
existing_files = uploader.get_staging_results().keys()

files_by_year = {
    k: list(v) 
    for k, v in 
    it.groupby(all_files, lambda x: int(x.name.split("_")[1].split(".")[0].split("-")[0]))
}
files_by_year.keys()

print(f"Found {len(all_files)} files.")
print("Years:", ", ".join(map(str, files_by_year)))

Found 31 files.
Years: 2025


Below we batch the daily CSV files into annual ZIP files. Creating batched ZIP files is recommended as it will be much faster to upload and process compared to individually uploading daily files.

In [7]:
temp_dir = Path(tempfile.mkdtemp())
print(f"Created temp directory: {temp_dir}")

Created temp directory: /tmp/tmphtk46llo


In [8]:
for year, files in tqdm(files_by_year.items()):
    zip_path = temp_dir / f"exposures_{year}.zip"

    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for file in files:
            zipf.writestr(file.name, file.read_bytes())

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:16<00:00, 16.14s/it]

100%|██████████| 1/1 [00:16<00:00, 16.14s/it]




As a next step we iterate over the annual Zip files and stage them in the uploader. See the [Uploaders Tutorial](https://docs.bayesline.com/latest/notebooks/tutorial_uploaders.html#staging-data) for more details on the *staging* and *commit* concepts.

In [9]:
for year in files_by_year.keys():
    zip_path = temp_dir / f"exposures_{year}.zip"
    result = uploader.stage_file(zip_path)
    assert result.success

In [10]:
shutil.rmtree(temp_dir)

### Data Commit

Lastly we can collect some descriptive statistics as a sense check before committing the data to the uploader.

In [11]:
staging_data_summary_df = uploader.get_staging_data_detail_summary()
staging_data_summary_df

_name,date,n_assets,min_exposure,max_exposure,mean_exposure,median_exposure,std_exposure
str,date,u32,f32,f32,f32,f32,f32
"""exposures_2025-05-01""",2025-05-01,35804,-4.0625,4.09375,0.241719,0.49707,1.013115
"""exposures_2025-05-02""",2025-05-02,35798,-4.0625,4.09375,0.24169,0.497559,1.013026
"""exposures_2025-05-03""",2025-05-03,35793,-4.0625,4.09375,0.241701,0.497559,1.012994
"""exposures_2025-05-04""",2025-05-04,35793,-4.0625,4.09375,0.24172,0.497559,1.013005
"""exposures_2025-05-05""",2025-05-05,35796,-4.0625,4.09375,0.241675,0.49707,1.012967
…,…,…,…,…,…,…,…
"""exposures_2025-05-27""",2025-05-27,35767,-4.046875,4.097656,0.245158,0.498535,1.010661
"""exposures_2025-05-28""",2025-05-28,35763,-4.046875,4.09375,0.243136,0.495605,1.01173
"""exposures_2025-05-29""",2025-05-29,35761,-4.046875,4.09375,0.243173,0.49585,1.011687
"""exposures_2025-05-30""",2025-05-30,35756,-4.046875,4.09375,0.243129,0.496094,1.011694


In [12]:
uploader.commit(mode="append")

UploadCommitResult(version=1, committed_names=['exposures_2025-05-28', 'exposures_2025-05-08', 'exposures_2025-05-19', 'exposures_2025-05-04', 'exposures_2025-05-11', 'exposures_2025-05-15', 'exposures_2025-05-23', 'exposures_2025-05-02', 'exposures_2025-05-22', 'exposures_2025-05-20', 'exposures_2025-05-07', 'exposures_2025-05-25', 'exposures_2025-05-05', 'exposures_2025-05-31', 'exposures_2025-05-29', 'exposures_2025-05-30', 'exposures_2025-05-27', 'exposures_2025-05-21', 'exposures_2025-05-06', 'exposures_2025-05-17', 'exposures_2025-05-26', 'exposures_2025-05-24', 'exposures_2025-05-09', 'exposures_2025-05-01', 'exposures_2025-05-14', 'exposures_2025-05-18', 'exposures_2025-05-16', 'exposures_2025-05-12', 'exposures_2025-05-13', 'exposures_2025-05-10', 'exposures_2025-05-03'])

## Risk Dataset Creation

Below creates a new *Risk Dataset* using above uploaded exposures. See the [Risk Datasets Tutorial](https://docs.bayesline.com/latest/notebooks/tutorial_datasets.html) for a deep dive into the `Risk Datasets API`.

In [13]:
risk_datasets = bln.equity.riskdatasets

In [14]:
# exisint datasets which can be used as reference datasets
risk_datasets.get_dataset_names()

['Bayesline-US-500-1y', 'Bayesline-US-All-1y']

In [15]:
risk_dataset_name = "My-Risk-Dataset"

In [16]:
risk_datasets.delete_dataset_if_exists(risk_dataset_name)

We need to specify an assignment of which exposures are *style*, *region*, etc. Below lists those *factor groups* as they were extracted from the uploaded exposures.

In [17]:
uploader.get_data(columns=["factor_group"], unique=True).collect()

factor_group
str
"""industry"""
"""market"""
"""style"""
"""region"""


See API docs for [`RiskDatasetSettings`](https://docs.bayesline.com/latest/_autosummary/bayesline.api.equity.RiskDatasetSettings.html) and [`RiskDatasetUploadedExposureSettings`](https://docs.bayesline.com/latest/_autosummary/bayesline.api.equity.RiskDatasetUploadedExposureSettings.html) for other potential settings.

In [18]:
settings = RiskDatasetSettings(
    reference_dataset="Bayesline-US-All-1y",
    exposures=[
        RiskDatasetUploadedExposureSettings(
            exposure_source=exposure_dataset_name,
            market_factor_group="market",
            style_factor_group="style",
            industry_factor_group="industry",
            region_factor_group="region",
            style_factor_fill_miss=True,
        ),
    ],
    trim_start_date=dt.date(2025, 5, 1),
)

Lastly we create the new dataset followed by describing its properties after creation.

In [19]:
my_risk_dataset = risk_datasets.create_dataset(risk_dataset_name, settings)

In [20]:
my_risk_dataset.describe().universe_settings_menu

UniverseSettingsMenu(id_types=['bayesid'], exchanges=['ARCX', 'BVCA', 'BVMF', 'DIFX', 'DSMD', 'ETFP', 'FRAB', 'HSTC', 'JBUL', 'PFTS', 'ROCO', 'SHSC', 'SZSC', 'WBDM', 'XADS', 'XAMM', 'XAMS', 'XASE', 'XASX', 'XATH', 'XBAH', 'XBEL', 'XBEY', 'XBKF', 'XBKK', 'XBOG', 'XBOM', 'XBOS', 'XBRA', 'XBRU', 'XBRV', 'XBUD', 'XBUE', 'XCAI', 'XCAN', 'XCAS', 'XCSE', 'XCYS', 'XDUB', 'XEQY', 'XETB', 'XHEL', 'XHKG', 'XHNX', 'XICE', 'XIDX', 'XJAM', 'XJAS', 'XJSE', 'XKAR', 'XKLS', 'XKOS', 'XKRX', 'XKUW', 'XLIM', 'XLIS', 'XLIT', 'XLJU', 'XLON', 'XLUX', 'XMAD', 'XMAL', 'XMAU', 'XMEX', 'XMUS', 'XNAI', 'XNAM', 'XNAS', 'XNCM', 'XNSA', 'XNSE', 'XNYS', 'XNZE', 'XOSL', 'XPAE', 'XPAR', 'XPHS', 'XPRM', 'XPSX', 'XQUI', 'XRIS', 'XSAU', 'XSEC', 'XSES', 'XSGO', 'XSHE', 'XSHG', 'XSSC', 'XSTC', 'XSTO', 'XSWX', 'XTAE', 'XTAI', 'XTAL', 'XTKS', 'XTSE', 'XTSX', 'XTUN', 'XWAR', 'XZAG', 'XZIM'], industry={'industry': {'ALL': ['industry.Academic & Educational Services', 'industry.Basic Materials', 'industry.Consumer Cyclicals', 'in

We can pull some exposures from the new risk dataset to verify.

In [21]:
exposures_api = bln.equity.exposures.load(
    ExposureSettings(
        industries=None,
        regions=None,
    )
)

In [22]:
# note that the industry and region hierarchy names tie out with the factor groups we specified above

df = exposures_api.get(
    UniverseSettings(
        dataset=risk_dataset_name, 
        industry=IndustrySettings(hierarchy="industry", include="All"),
        region=RegionSettings(hierarchy="region", include="All")
    )
)

df.tail()

date,bayesid,market.market.Market,style.style.Dividend,style.style.Growth,style.style.Leverage,style.style.Momentum,style.style.Size,style.style.Value,style.style.Volatility
date,str,f32,f32,f32,f32,f32,f32,f32,f32
2025-05-31,"""ZSPC""",1.0,-0.568359,0.364502,-0.083862,-0.183472,-1.768555,-1.72168,1.799805
2025-05-31,"""ZTR""",1.0,1.834961,-0.286621,-0.321289,1.28125,-0.738281,1.160156,-1.362305
2025-05-31,"""ZUMZ""",1.0,-1.166016,-2.34375,-0.848145,-1.31543,-0.782715,1.0625,0.819824
2025-05-31,"""ZVIA""",1.0,-0.069031,-1.301758,-1.975586,1.680664,-1.431641,-0.696289,1.248047
2025-05-31,"""ZVVT""",1.0,-0.533203,-0.907715,1.454102,0.082458,-1.37207,-1.277344,2.097656
