The Data Lake: 93 Million Rows and Growing

Section A

Infrastructure overview

Every decision the system makes flows through four stages. Raw data enters from multiple sources, gets processed into signals, stored in a purpose-built data layer, and consumed by the trading engines in real time.

Stage 1

Data Sources

→

Stage 2

Processing

→

Stage 3

Storage

→

Stage 4

Engines

Source 01

Market Data

Yahoo Finance, Databento (CME GLBX.MDP3), real-time pricing for 3,792 instruments

Source 02

Earnings Transcripts

Quarterly earnings call transcripts for 500+ public companies, cached locally in transcripts.db

Source 03

Fundamental Data

Point-in-time fundamentals (127 dates ingested), analyst estimates (10K+ rows, 1,115 instruments)

Source 04

Alternative Data

Crypto funding rates, exchange order books, VIX term structure

Section B

The numbers

Instruments Monitored

3,792

Updated daily across all asset classes

Signal Bank Rows

93M+

DuckDB columnar storage

Transcripts Scored

2,896

NLP-scored earnings calls

CME Futures Contracts

Databento tick data pipeline

Equities in Universe

4,784

Dynamic, auto-updated universe

Crypto Perpetuals

212

Perpetual futures contracts

Fundamental Snapshots

127

Point-in-time, no lookahead

Analyst Estimates

10K+

Records across 1,115 instruments

Historical Depth

16 yr

2010 – 2026, continuous coverage

Section C

Why point-in-time matters

No lookahead bias

Every fundamental data point is stored as it was known at that time. We never use information that wasn't available when the decision was made. This prevents lookahead bias — the most common and most dangerous error in quantitative research.

When a company reports earnings on February 3rd, that data enters our system on February 3rd. Not before. Not backdated. Our 127 fundamental snapshots reconstruct the exact information landscape that existed at each decision point across 16 years of history.

Most quantitative shops skip this step because it is expensive and tedious. We consider it non-negotiable.

Section D

The data moat

Instruments matter more than strategies. Scale matters more than exploration.

Every new instrument we add creates combinatorial expansion with every existing archetype. Adding 100 crypto perpetuals doesn't just test 100 things — it tests 100 × 173 archetypes = 17,300 new configurations.

This is why the factory gets more powerful every day. The data moat compounds. Each new data source multiplies the value of every existing strategy template. The marginal cost of testing drops while the marginal value of discovery rises.

Section E

Infrastructure stack

Every component was chosen for a specific reason. No general-purpose frameworks, no cloud dependencies for latency-sensitive operations, no unnecessary abstractions.

DuckDB

Signal bank (93M rows, columnar, fast analytics). Optimized for analytical queries across billions of signal observations. Sub-second aggregation across the full history.

SQLite

Research database, paper trading state, runner state. Lightweight, zero-config, battle-tested. One file per purpose, no server overhead.

Databento

CME futures tick data pipeline. 39 instruments, GLBX.MDP3 feed, continuous forward updates through May 2026.

QuantConnect

Walk-forward backtesting platform. Every strategy is validated on genuinely out-of-sample data before it enters the portfolio.

Mac Mini

24/7 paper trading engine. Always-on, low power, dedicated hardware. No cloud provider outages, no billing surprises.

Launchd

Automated scheduling. Factory pipeline at 2am, data updates, monitoring, and health checks — all running without human intervention.

Section F

30,000+ lines of purpose-built code

runner/

Paper trading engine, regime detection, circuit breakers, signal aggregation

research/

Factory pipeline, scanner, strategy validation, walk-forward testing

signal_bank/

DuckDB producers, signal composers, information coefficient tracker

crypto/

Crypto carry executor, funding rate monitor, exchange integration

showcase/

NLP pipeline, earnings transcript scoring, CEO psychology analysis

Every line of code was written to solve a specific problem. No frameworks, no templates, no boilerplate. Purpose-built infrastructure for one thing: finding and deploying uncorrelated edges.

The data infrastructure is not a feature we market. It is the foundation that makes everything else possible. The strategies change, the allocations shift, the markets evolve — but the data layer compounds silently underneath, getting deeper and wider every day. That is the moat.