Verification

The Data Lake: 93 Million Rows and Growing

The infrastructure behind the system — what we monitor, how we store it, and why the data moat matters more than the signal moat.

Technical document Prism Capital Research Updated June 2026
Section A

Infrastructure overview

Every decision the system makes flows through four stages. Raw data enters from multiple sources, gets processed into signals, stored in a purpose-built data layer, and consumed by the trading engines in real time.

Stage 1
Data Sources
Stage 2
Processing
Stage 3
Storage
Stage 4
Engines
Source 01
Market Data
Yahoo Finance, Databento (CME GLBX.MDP3), real-time pricing for 3,792 instruments
Source 02
Earnings Transcripts
Quarterly earnings call transcripts for 500+ public companies, cached locally in transcripts.db
Source 03
Fundamental Data
Point-in-time fundamentals (127 dates ingested), analyst estimates (10K+ rows, 1,115 instruments)
Source 04
Alternative Data
Crypto funding rates, exchange order books, VIX term structure
Section B

The numbers

Transcripts Scored
2,896
NLP-scored earnings calls
CME Futures Contracts
39
Databento tick data pipeline
Equities in Universe
4,784
Dynamic, auto-updated universe
Crypto Perpetuals
212
Perpetual futures contracts
Fundamental Snapshots
127
Point-in-time, no lookahead
Analyst Estimates
10K+
Records across 1,115 instruments
Historical Depth
16 yr
2010 – 2026, continuous coverage
Section C

Why point-in-time matters

No lookahead bias

Every fundamental data point is stored as it was known at that time. We never use information that wasn't available when the decision was made. This prevents lookahead bias — the most common and most dangerous error in quantitative research.

When a company reports earnings on February 3rd, that data enters our system on February 3rd. Not before. Not backdated. Our 127 fundamental snapshots reconstruct the exact information landscape that existed at each decision point across 16 years of history.

Most quantitative shops skip this step because it is expensive and tedious. We consider it non-negotiable.

Section D

The data moat

Instruments matter more than strategies. Scale matters more than exploration.

Every new instrument we add creates combinatorial expansion with every existing archetype. Adding 100 crypto perpetuals doesn't just test 100 things — it tests 100 × 173 archetypes = 17,300 new configurations.

This is why the factory gets more powerful every day. The data moat compounds. Each new data source multiplies the value of every existing strategy template. The marginal cost of testing drops while the marginal value of discovery rises.

Section E

Infrastructure stack

Every component was chosen for a specific reason. No general-purpose frameworks, no cloud dependencies for latency-sensitive operations, no unnecessary abstractions.

DuckDB
Signal bank (93M rows, columnar, fast analytics). Optimized for analytical queries across billions of signal observations. Sub-second aggregation across the full history.
SQLite
Research database, paper trading state, runner state. Lightweight, zero-config, battle-tested. One file per purpose, no server overhead.
Databento
CME futures tick data pipeline. 39 instruments, GLBX.MDP3 feed, continuous forward updates through May 2026.
QuantConnect
Walk-forward backtesting platform. Every strategy is validated on genuinely out-of-sample data before it enters the portfolio.
Mac Mini
24/7 paper trading engine. Always-on, low power, dedicated hardware. No cloud provider outages, no billing surprises.
Launchd
Automated scheduling. Factory pipeline at 2am, data updates, monitoring, and health checks — all running without human intervention.
Section F

30,000+ lines of purpose-built code

runner/
Paper trading engine, regime detection, circuit breakers, signal aggregation
research/
Factory pipeline, scanner, strategy validation, walk-forward testing
signal_bank/
DuckDB producers, signal composers, information coefficient tracker
crypto/
Crypto carry executor, funding rate monitor, exchange integration
showcase/
NLP pipeline, earnings transcript scoring, CEO psychology analysis

Every line of code was written to solve a specific problem. No frameworks, no templates, no boilerplate. Purpose-built infrastructure for one thing: finding and deploying uncorrelated edges.

The data infrastructure is not a feature we market. It is the foundation that makes everything else possible. The strategies change, the allocations shift, the markets evolve — but the data layer compounds silently underneath, getting deeper and wider every day. That is the moat.

← Previous
NLP Evidence
Next →
The Seas Framework