Data Layer MVP
Source: Notion | Last edited: 2025-12-26 | ID: 2be2d2dc-3ef...
Build an MVP market data warehouse + ingestion pipeline that AlphaForge can use for backtests using our own DB (not local files). The MVP must support reproducible experiments via snapshot_id / data_version, and include a real-time feed proposal.
Data Sources
Section titled “Data Sources”Spot:
-
Binance Spot
-
OKX Spot
-
Bybit Spot Perp / Futures:
-
Binance Perp Futures (market data + funding rate)
-
OKX Perp (funding rate) Access method: CCXT Pro and/or native exchange APIs (vendor to recommend best approach).
Instruments / Symbols
Section titled “Instruments / Symbols”Spot symbols (across Spot venues):
-
BTC/USDT, ETH/USDT, SOL/USDT, XRP/USDT, SOL/BTC, ETH/BTC Funding (perp) symbols (Binance + OKX):
-
BTC/USDT, ETH/USDT, SOL/USDT, XRP/USDT, BNB/USDT Index:
-
Crypto Fear & Greed Index (Alternative.me) Source:
https://alternative.me/crypto/fear-and-greed-index/(daily is sufficient)
Required Data Types
Section titled “Required Data Types”- OHLCV bars: at minimum
1mand1h - Funding rates (perp): timestamped, normalized
- Instrument metadata: canonical IDs/mapping per venue (precision/tick/lot sizes if available)
- Reproducibility:
snapshot_id/data_versionso a backtest can pin a dataset version
Storage & Consumption (MVP)
Section titled “Storage & Consumption (MVP)”- Preferred DB: **ClickHouse + Apache Iceberg engine **(open to alternatives with justification)
- No Data API required for MVP. Instead provide stable read contracts:
- canonical DB views:
v_bars,v_funding,v_instruments,v_fear_gree``d - optional thin Python SDK for querying (nice-to-have)
- canonical DB views:
Ingestion Requirements
Section titled “Ingestion Requirements”- Historical backfill for the specified venues/symbols/timeframes (date ranges provided by us)
- Incremental updater (scheduled jobs are fine)
- Idempotent writes / deduplication (re-running ingestion must not create duplicates or corrupt data)
- Data quality checks (missing bars, invalid OHLC, duplicates, etc.)
Real-time Feed (Proposal Required)
Section titled “Real-time Feed (Proposal Required)”In addition to backfill + incremental updates, deliver a real-time feed proposal covering:
- WebSocket vs polling vs hybrid per exchange
- Minimal outputs (e.g., real-time bars builder, trades stream, best bid/ask—recommend scope)
- Reconnect + gap-fill strategy, ordering/dedup keys, correctness guarantees
- Operational plan: deployment, monitoring metrics, alerting (Optional stretch goal: minimal real-time implementation for 1–2 exchanges.)
Deliverables
Section titled “Deliverables”- Design Doc (schema + key strategy,
ORDER BY/PARTITION BY, snapshot strategy, dedup/idempotency, failure modes) - DB DDL + init/migrations
- Backfill tool + incremental updater
- Canonical views (and optional Python SDK)a
- Data quality checks + report
- AlphaForge integration demo (run at least one backtest from DB)
- Deployment README (AWS) + runbooks
- Real-time feed proposal
Acceptance Criteria (AWS Environment)
Section titled “Acceptance Criteria (AWS Environment)”We will accept the MVP when:
- The database and ingestion services are deployed and running reliably in our AWS environment
- The database is already populated with the agreed historical dataset (backfill completed)
- Incremental updates run automatically on a defined schedule, with observable job status and failure alerts/logs
- AlphaForge can run ≥ 1 backtest using the provided views/SDK against the AWS-hosted database
- Re-running a backtest pinned to a specific
snapshot_idis reproducible (within float tolerance) - Data quality checks can be executed in AWS and produce actionable output (gaps/duplicates/invalid data)
- The real-time feed proposal is delivered with clear tradeoffs and a recommended implementation plan