Experience #01 · Summer 2025
Sybella SA
Data engineering foundations for an internal backtesting platform in quantitative finance.
Context
Sybella is a young firm working on quantitative finance tools. They needed an in-house backtesting platform to replace the opaque proprietary "black boxes" they were using — the goal being full auditability over signal construction, performance evaluation, and risk management. The original mission also included an AI-driven signal layer on top.
Ramp-up
The first three weeks were reading, not coding. Two books: Wesley Gray's Quantitative Momentum and Tobias Carlisle's The Acquirer's Multiple. They argue opposite philosophies — momentum and value — but they share a recurring lesson: durable performance comes from accumulating small robust blocks, not from finding a silver bullet.
Then five VectorBT Pro tutorials, in order: Basic RSI, Stop Signals, Pairs Trading, Portfolio Optimization, Signal Development. The first four covered standard quant patterns — momentum thresholds, stop-loss / take-profit / trailing-stop logic, mean-reversion on cointegrated pairs, Sharpe-based allocation. Signal Development was the most formative because it reframed how I thought about what a signal even is.
In VectorBT Pro a signal is a boolean mask — True or False per timestamp — rather than a pre-sized order. The distinction looks academic and turns out to be the heart of the discipline. Generating the signal (the theoretical hypothesis about when to enter or exit) becomes strictly separate from sizing it into a portfolio (how much capital, what frictions, what stops). Variants of the same idea can then be compared under identical conditions: same data, same friction assumptions, same look-ahead guard. Partitions of consecutive True values prevent signal bursts when an indicator hovers around a threshold; missing values are treated explicitly rather than dropped silently.
The methodological points that carried over to the data lake work later came from this tutorial more than from the books: never let a future timestamp leak into a feature, sanitise NaNs deliberately, separate signal generation from simulation, and treat parameter sweeps as experiments rather than as hyperparameter tuning.
The pivot
By the end of the ramp-up, the picture was clear: the project lacked a usable data foundation. No reproducibility, no audit trail, no time-travel. Every backtest result would have been impossible to defend. The mission was deliberately re-scoped to building that foundation first — a versioned, auditable financial data lake. The AI layer was deferred.
Data foundations come before models. AI on bad data is worse than no AI.
Three iterations, one delivered
01 — Apache Iceberg + PyIceberg + MinIO
Promising on paper but blocked by Pandas nanosecond timestamps (Iceberg only supports microsecond), painful S3 path-style configuration on Windows, and ingestion times far above the one-hour target. Documented and pivoted.
02 — Spark + Hive locally
Faster ingestion in theory, unstable in practice on Windows: heavy dependency stack, metastore issues, out-of-memory errors during audits. Pivoted again.
03 — PostgreSQL with an SCD2 model
The simple choice, and the robust one. Each row is versioned with
valid_from, valid_thru, and
is_active columns. Snapshots provide true point-in-time
queries. Full-load uses yearly partitioning for parallelism,
COPY ... FROM STDIN for streamed insertion, and
index-drop-then-recreate to accelerate massive writes. Incremental
updates use a configurable lookback window (typically 7–14 days)
and a source_hash per row to detect changes.
Result
The data engineering deliverable stacks as a small ecosystem
rather than a single script. The full-load pipeline
(03_run_full.py) ingests
80 million rows in under one hour by
parallelising across years and threading by symbol, dropping
indexes during write and recreating them after. The incremental
updater (03_run_incr.py) re-checks the last 7-14
days against a per-row source_hash to catch
upstream corrections, and inserts around 1 million
versioned rows per day. Three reference-data syncers
keep the schema's static dimensions current: metadata, fundamentals,
and index memberships.
A server-side run_ts guarantees temporal coherence
across symbols, candles, and snapshots — every record written
in the same run shares the same timestamp, regardless of
client clock drift. A versioned migration system and a
pg_dump-based backup/restore harness make the
whole thing operable on a personal workstation, with a clear
path to a server deployment when the team is ready. The data
lake is now the foundation on which Sybella's future
backtesting work stands.
What I learned
Pivoting is engineering, not failure.
Robust simplicity beat fragile complexity. PostgreSQL plus SCD2 turned out more auditable, more maintainable, and faster in practice than the state-of-the-art Iceberg and Spark stack on this scope. The hardest part wasn't the technical work — it was the lucidity to abandon two attempts before they trapped the entire project.
Documenting why three iterations happened is more valuable than pretending the first guess was right.
Working alone wouldn't have worked. I'm a third-year intern in a domain I had never touched, with one summer to ship something useful. Asking before guessing, integrating into the team's daily rhythm, and shipping modest pieces reliably — that's what made anything compound.
On the AI side: I used LLMs daily as a productivity tool, but treated them as a junior colleague — never an autopilot. You get the answers your questions deserve.