Experience #01 · Summer 2025

Sybella SA

Data engineering foundations for an internal backtesting platform in quantitative finance.

Context

Sybella is a young firm working on quantitative finance tools. They needed an in-house backtesting platform to replace the opaque proprietary "black boxes" they were using — the goal being full auditability over signal construction, performance evaluation, and risk management. The original mission also included an AI-driven signal layer on top.

Ramp-up

The first three weeks were reading, not coding. Two books: Wesley Gray's Quantitative Momentum and Tobias Carlisle's The Acquirer's Multiple. They argue opposite philosophies — momentum and value — but they share a recurring lesson: durable performance comes from accumulating small robust blocks, not from finding a silver bullet.

Then five VectorBT Pro tutorials, in order: Basic RSI, Stop Signals, Pairs Trading, Portfolio Optimization, Signal Development. The first four covered standard quant patterns — momentum thresholds, stop-loss / take-profit / trailing-stop logic, mean-reversion on cointegrated pairs, Sharpe-based allocation. Signal Development was the most formative because it reframed how I thought about what a signal even is.

In VectorBT Pro a signal is a boolean mask — True or False per timestamp — rather than a pre-sized order. The distinction looks academic and turns out to be the heart of the discipline. Generating the signal (the theoretical hypothesis about when to enter or exit) becomes strictly separate from sizing it into a portfolio (how much capital, what frictions, what stops). Variants of the same idea can then be compared under identical conditions: same data, same friction assumptions, same look-ahead guard. Partitions of consecutive True values prevent signal bursts when an indicator hovers around a threshold; missing values are treated explicitly rather than dropped silently.

The methodological points that carried over to the data lake work later came from this tutorial more than from the books: never let a future timestamp leak into a feature, sanitise NaNs deliberately, separate signal generation from simulation, and treat parameter sweeps as experiments rather than as hyperparameter tuning.

The pivot

By the end of the ramp-up, the picture was clear: the project lacked a usable data foundation. No reproducibility, no audit trail, no time-travel. Every backtest result would have been impossible to defend. The mission was deliberately re-scoped to building that foundation first — a versioned, auditable financial data lake. The AI layer was deferred.

Data foundations come before models. AI on bad data is worse than no AI.

Three iterations, one delivered

01 — Apache Iceberg + PyIceberg + MinIO

Promising on paper but blocked by Pandas nanosecond timestamps (Iceberg only supports microsecond), painful S3 path-style configuration on Windows, and ingestion times far above the one-hour target. Documented and pivoted.

02 — Spark + Hive locally

Faster ingestion in theory, unstable in practice on Windows: heavy dependency stack, metastore issues, out-of-memory errors during audits. Pivoted again.

03 — PostgreSQL with an SCD2 model

The simple choice, and the robust one. Each row is versioned with valid_from, valid_thru, and is_active columns. Snapshots provide true point-in-time queries. Full-load uses yearly partitioning for parallelism, COPY ... FROM STDIN for streamed insertion, and index-drop-then-recreate to accelerate massive writes. Incremental updates use a configurable lookback window (typically 7–14 days) and a source_hash per row to detect changes.

Result

The data engineering deliverable stacks as a small ecosystem rather than a single script. The full-load pipeline (03_run_full.py) ingests 80 million rows in under one hour by parallelising across years and threading by symbol, dropping indexes during write and recreating them after. The incremental updater (03_run_incr.py) re-checks the last 7-14 days against a per-row source_hash to catch upstream corrections, and inserts around 1 million versioned rows per day. Three reference-data syncers keep the schema's static dimensions current: metadata, fundamentals, and index memberships.

A server-side run_ts guarantees temporal coherence across symbols, candles, and snapshots — every record written in the same run shares the same timestamp, regardless of client clock drift. A versioned migration system and a pg_dump-based backup/restore harness make the whole thing operable on a personal workstation, with a clear path to a server deployment when the team is ready. The data lake is now the foundation on which Sybella's future backtesting work stands.

What I learned

Pivoting is engineering, not failure.

Robust simplicity beat fragile complexity. PostgreSQL plus SCD2 turned out more auditable, more maintainable, and faster in practice than the state-of-the-art Iceberg and Spark stack on this scope. The hardest part wasn't the technical work — it was the lucidity to abandon two attempts before they trapped the entire project.

Documenting why three iterations happened is more valuable than pretending the first guess was right.

Working alone wouldn't have worked. I'm a third-year intern in a domain I had never touched, with one summer to ship something useful. Asking before guessing, integrating into the team's daily rhythm, and shipping modest pieces reliably — that's what made anything compound.

On the AI side: I used LLMs daily as a productivity tool, but treated them as a junior colleague — never an autopilot. You get the answers your questions deserve.