Data¶
finance-datagen produces synthetic financial time series. Every
Rust generator emits an Apache Arrow
RecordBatch from a polars-free Rust core, which the Python layer
wraps into a polars.DataFrame via the pyarrow PyCapsule interface.
Python-only generators return polars DataFrame objects directly.
All public generator classes inherit from DataGenerator, a pydantic
base model that validates typed parameters at construction time. Each
generator can be run with .generate() or with one-shot iterator syntax
via next(generator).
This page documents the data: stochastic models, generator parameters, and output schemas.
Conventions¶
All tabular outputs share the following conventions:
Aspect |
Convention |
|---|---|
Timestamp column |
|
Symbol column |
|
Numeric columns |
|
Path length |
A path generator with |
Time grid |
Uniform: |
Reproducibility |
Fixed |
Most generators also have a matching generate_* convenience function
that instantiates the model for validation and returns .generate().
generate_prices is a plain alias for generate_gbm for examples and tests
that want a model-neutral name.
Enum-backed metadata¶
Generators that emit tradeable assets can add optional enum-backed metadata columns without changing their primary schema:
Family |
Optional fields |
|---|---|
|
|
|
|
|
|
|
|
Metadata values are validated by finance-enums. When an exchange is
provided to the portfolio, transaction, order, or execution generators,
dates and timestamps are drawn from the matching finance-dates
exchange calendar.
The dt parameter controls the modeling time step used in SDE
discretization; step_ms controls only the timestamp column. They are
independent: a daily model (dt = 1/252) can be emitted on a second or
minute timestamp grid for testing.
Models¶
Geometric Brownian Motion (GBM)¶
The classic Black-Scholes log-normal price process.
Discretized exactly in log-space:
GBM Parameters¶
Param |
Default |
Meaning |
|---|---|---|
|
|
initial price, must be > 0 |
|
|
drift, annualized |
|
|
volatility, annualized and nonnegative |
|
|
model time step in years |
|
|
number of return draws |
|
|
label written into the |
|
|
first timestamp, epoch ms UTC |
|
|
timestamp spacing, one day by default |
|
|
RNG seed |
GBM Schema¶
Column |
Type |
Notes |
|---|---|---|
|
|
uniform grid |
|
|
constant |
|
|
strictly positive |
Heston Stochastic Volatility¶
Two-factor SDE with mean-reverting variance and correlated price and variance shocks:
The implementation uses full-truncation Euler on variance, then log-Euler on price:
where \(v_t^+ = \\max(v_t, 0)\).
Heston Parameters¶
Param |
Default |
Meaning |
|---|---|---|
|
|
initial price > 0 |
|
|
initial variance >= 0 |
|
|
risk-neutral or physical drift |
|
|
mean-reversion speed, nonnegative |
|
|
long-run variance, nonnegative |
|
|
vol-of-vol, nonnegative |
|
|
leverage correlation, must satisfy |
|
|
model time step in years |
|
|
number of draws |
Heston Schema¶
Column |
Type |
Notes |
|---|---|---|
|
|
uniform grid |
|
|
constant |
|
|
strictly positive |
|
|
nonnegative after truncation |
GARCH(1,1) Returns¶
Discrete-time conditional-variance model in log returns:
When alpha + beta < 1, the initial variance is the unconditional
variance omega / (1 - alpha - beta). Otherwise it falls back to
omega.
GARCH Parameters¶
Param |
Default |
Meaning |
|---|---|---|
|
|
initial price > 0 |
|
|
mean log return |
|
|
constant variance term >= 0 |
|
|
shock weight >= 0 |
|
|
persistence weight >= 0 |
|
|
number of return draws |
GARCH Schema¶
Column |
Type |
Notes |
|---|---|---|
|
|
uniform grid |
|
|
constant |
|
|
updated with |
|
|
first row is |
|
|
conditional volatility, strictly positive |
OHLCV Synthesis From Close¶
ohlc_from_close() takes any close series and synthesizes plausible
Open/High/Low/Volume columns around it.
For each bar i:
open_i = close_{i-1} (open_0 = close_0)
high_i = max(open_i, close_i) * (1 + |U_1| * intrabar_vol)
low_i = min(open_i, close_i) * (1 - |U_2| * intrabar_vol)
ret_i = log(close_i / close_{i-1}) (ret_0 = 0)
vol_i = base_volume + vol_factor * |ret_i|
This guarantees high >= max(open, close) and
low <= min(open, close).
OHLCV Parameters¶
Param |
Default |
Meaning |
|---|---|---|
|
required |
iterable, numpy array, or |
|
|
per-bar high/low envelope width |
|
|
floor volume |
|
|
volume sensitivity to absolute log return |
|
|
symbol label |
|
|
first timestamp, epoch ms UTC |
|
|
timestamp spacing |
|
|
RNG seed |
OHLCV Schema¶
Column |
Type |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Cross-Sectional Panels¶
Three pydantic generator models produce panels for testing alpha and
risk code that operates over (date, symbol) pairs. The legacy
generate_signal, generate_factor_loadings, and generate_benchmark
helpers remain as thin wrappers around the matching model classes. They
use numpy.random.default_rng(seed).
SignalGenerator¶
Long-form panel [date, symbol, signal, fwd_returns] constructed so
that the cross-sectional Pearson IC of signal against fwd_returns
is approximately ic per date:
Param |
Default |
Meaning |
|---|---|---|
|
|
rows in the date dimension |
|
|
rows in the asset dimension |
|
|
target per-date Pearson IC, in |
|
|
cross-sectional standard deviation of fwd returns |
|
|
numpy RNG seed |
|
|
first date |
|
|
optional explicit symbol list |
Output schema: date, symbol, signal, fwd_returns.
Convenience wrapper: generate_signal(...).
FactorLoadingsGenerator¶
Wide-form Barra-style factor loadings, one row per asset. The market
factor is set to 1.0 when present. Other factors are drawn from a
standard normal distribution and standardized cross-sectionally.
Param |
Default |
Meaning |
|---|---|---|
|
|
row count |
|
|
factor columns |
|
|
numpy RNG seed |
|
|
optional explicit symbol list |
Output schema: symbol plus one numeric column per factor.
Convenience wrapper: generate_factor_loadings(...).
BenchmarkGenerator¶
Independent Gaussian benchmark return series with target annualized mean and volatility.
Param |
Default |
Meaning |
|---|---|---|
|
|
row count |
|
|
target annualized mean |
|
|
target annualized volatility |
|
|
annualization factor |
|
|
numpy RNG seed |
|
|
first date |
Output schema: date, benchmark.
Convenience wrapper: generate_benchmark(...).
Portfolio and Transaction Generators¶
These Python generators create deterministic post-trade fixtures for turnover, transaction cost, and execution quality tests.
PositionsGenerator¶
Long-form position panel with one row per (date, symbol). Per-date
absolute weights are normalized to gross_exposure; market_value is
weight * portfolio_value, and quantity is market_value / price.
Param |
Default |
Meaning |
|---|---|---|
|
|
number of dates |
|
|
assets per date |
|
|
equity denominator for market value |
|
|
per-date sum of absolute weights |
|
|
center of synthetic price distribution |
|
|
daily log-return volatility for marks |
|
|
numpy RNG seed |
|
|
first date |
|
|
optional explicit symbol list |
Output schema: date, symbol, price, quantity, market_value,
weight.
Convenience wrapper: generate_positions(...).
TransactionsGenerator¶
Synthetic transaction log with enum-backed side and position_effect
labels from finance-enums. amount is positive for Buy and
negative for Sell; opening and closing intent is represented by
position_effect. notional is abs(amount) * price; fees are
computed from fee_bps.
Param |
Default |
Meaning |
|---|---|---|
|
|
number of trade dates |
|
|
symbol universe size |
|
|
rows per date |
|
|
center of trade-price distribution |
|
|
lognormal price dispersion |
|
|
max absolute share amount per row |
|
|
explicit commission per trade |
|
|
explicit fee rate in basis points |
|
|
slippage or cost assumption column |
|
|
numpy RNG seed |
|
|
first trade date |
|
|
optional explicit symbol list |
Output schema: timestamp, symbol, amount, price, side,
position_effect, notional, commission, fees, bps.
Convenience wrapper: generate_transactions(...).
OrdersGenerator¶
Enum-backed order fixtures for execution-quality and post-trade tests.
Generated rows include side, order type, order status, and time-in-force
labels from finance-enums. When exchange is supplied, timestamps are
sampled from that exchange calendar’s regular sessions.
Param |
Default |
Meaning |
|---|---|---|
|
|
number of order dates |
|
|
symbol universe size |
|
|
rows per date |
|
|
center of synthetic limit-price distribution |
|
|
lognormal limit-price dispersion |
|
|
max order quantity per row |
|
|
numpy RNG seed |
|
|
first order date |
|
|
optional explicit symbol list |
Output schema: timestamp, symbol, order_id, side, order_type,
quantity, limit_price, order_status, time_in_force.
Convenience wrapper: generate_orders(...).
ExecutionsGenerator¶
Enum-backed execution fixtures for simulated fills. Generated rows have
execution IDs, synthetic order IDs, sides, fill prices, fill quantities,
liquidity flags, and time-in-force labels. When exchange is supplied,
timestamps are sampled from that exchange calendar’s regular sessions.
Param |
Default |
Meaning |
|---|---|---|
|
|
number of execution dates |
|
|
symbol universe size |
|
|
rows per date |
|
|
center of synthetic fill-price distribution |
|
|
lognormal fill-price dispersion |
|
|
max execution quantity per row |
|
|
numpy RNG seed |
|
|
first execution date |
|
|
optional explicit symbol list |
Output schema: timestamp, execution_id, order_id, symbol,
side, price, quantity, liquidity_flag, time_in_force.
Convenience wrapper: generate_executions(...).
Multi-Asset, Regime, and Market-Impact Generators¶
MultiAssetGBMGenerator¶
Correlated multi-asset GBM with either a constant off-diagonal
correlation rho or a caller-provided correlation matrix corr. The
output is long-form [timestamp, symbol, price, return]; the first row
for each symbol has return = 0.0.
Convenience wrapper: generate_multi_asset_gbm(...).
RegimeSwitchingGenerator¶
Single-symbol Markov regime-switching path. The transition matrix rows
must sum to one. Regime-specific means and volatilities generate log
returns, and the output includes an integer regime label per timestamp.
Convenience wrapper: generate_regime_switching(...).
MarketImpactCurveGenerator¶
Generates participation-rate curves for temporary and permanent market impact. The default model uses square-root temporary impact and linear permanent impact:
temporary_impact_bps = temporary_impact_coef * volatility * sqrt(participation_rate) * 10_000
permanent_impact_bps = permanent_impact_coef * volatility * participation_rate * 10_000
Output schema: symbol, participation_rate, adv, volatility,
temporary_impact_bps, permanent_impact_bps, total_impact_bps.
Convenience wrapper: generate_market_impact_curve(...).
Risk-Model Generators¶
StatisticalRiskModelGenerator¶
Creates a synthetic asset-return matrix from latent factors, then fits a
PCA-style statistical risk model. .generate() returns a dictionary
with three polars frames:
Key |
Schema |
|---|---|
|
|
|
|
|
|
Convenience wrapper: generate_statistical_risk_model(...).
FundamentalRiskModelGenerator¶
Creates Barra-style factor loadings with a categorical sector drawn
from the finance-enums sector taxonomy, a constant market exposure
of 1.0, standardized style factors
(value, momentum, size, quality, low_vol, growth by
default), and positive specific_variance.
Convenience wrapper: generate_fundamental_risk_model(...).
FactorCovarianceGenerator¶
Creates a symmetric positive semidefinite covariance matrix with a
leading factor label column and one numeric column per factor. Factor
volatilities decay by eigen_decay, and cross-factor correlations decay
with factor distance.
Convenience wrapper: generate_factor_covariance(...).
SpecificVarianceGenerator¶
Creates a positive idiosyncratic variance vector with lognormal
dispersion around target_vol ** 2.
Convenience wrapper: generate_specific_variance(...).
Reproducibility¶
Every generator and ohlc_from_close accept an optional seed: int.
Rust generators initialize a ChaCha8 PRNG (via rand_chacha), which
is portable across platforms and architectures. Python generators
initialize numpy.random.default_rng(seed) and are deterministic within
the same numpy version.
from finance_datagen import GBMGenerator
a = GBMGenerator(seed=42).generate()
b = GBMGenerator(seed=42).generate()
assert a.equals(b)
If seed is omitted, the generator seeds from OS entropy and the path
will differ on every call.
Why Arrow?¶
The Rust core never imports polars. Polars-rs and the polars Python
wheel use incompatible internal ABIs, so linking polars on both sides
of the FFI boundary leads to crashes that are extremely hard to debug.
Arrow is a stable, language-agnostic columnar format: the Rust side
builds an arrow_array::RecordBatch, hands it to Python over the
Arrow C Data Interface
PyCapsule, and the Python side calls polars.from_arrow(batch) to wrap
the same buffers into a polars.DataFrame.
If you prefer to skip the polars wrapping, you can pull the raw
pyarrow.RecordBatch out of the Rust extension directly:
from finance_datagen.finance_datagen import GBMGenerator as RustGBM
batch = RustGBM(seed=0).record_batch() # pyarrow.RecordBatch