Skip to content

Data Layer MVP

Source: Notion | Last edited: 2025-12-26 | ID: 2be2d2dc-3ef...


Build an MVP market data warehouse + ingestion pipeline that AlphaForge can use for backtests using our own DB (not local files). The MVP must support reproducible experiments via snapshot_id / data_version, and include a real-time feed proposal.


Spot:

  • Binance Spot

  • OKX Spot

  • Bybit Spot Perp / Futures:

  • Binance Perp Futures (market data + funding rate)

  • OKX Perp (funding rate) Access method: CCXT Pro and/or native exchange APIs (vendor to recommend best approach).


Spot symbols (across Spot venues):

  • BTC/USDT, ETH/USDT, SOL/USDT, XRP/USDT, SOL/BTC, ETH/BTC Funding (perp) symbols (Binance + OKX):

  • BTC/USDT, ETH/USDT, SOL/USDT, XRP/USDT, BNB/USDT Index:

  • Crypto Fear & Greed Index (Alternative.me) Source: https://alternative.me/crypto/fear-and-greed-index/ (daily is sufficient)


  • OHLCV bars: at minimum 1m and 1h
  • Funding rates (perp): timestamped, normalized
  • Instrument metadata: canonical IDs/mapping per venue (precision/tick/lot sizes if available)
  • Reproducibility: snapshot_id/data_version so a backtest can pin a dataset version

  • Preferred DB: **ClickHouse + Apache Iceberg engine **(open to alternatives with justification)
  • No Data API required for MVP. Instead provide stable read contracts:
    • canonical DB views: v_bars, v_funding, v_instruments, v_fear_gree``d
    • optional thin Python SDK for querying (nice-to-have)

  • Historical backfill for the specified venues/symbols/timeframes (date ranges provided by us)
  • Incremental updater (scheduled jobs are fine)
  • Idempotent writes / deduplication (re-running ingestion must not create duplicates or corrupt data)
  • Data quality checks (missing bars, invalid OHLC, duplicates, etc.)

In addition to backfill + incremental updates, deliver a real-time feed proposal covering:

  • WebSocket vs polling vs hybrid per exchange
  • Minimal outputs (e.g., real-time bars builder, trades stream, best bid/ask—recommend scope)
  • Reconnect + gap-fill strategy, ordering/dedup keys, correctness guarantees
  • Operational plan: deployment, monitoring metrics, alerting (Optional stretch goal: minimal real-time implementation for 1–2 exchanges.)

  1. Design Doc (schema + key strategy, ORDER BY/PARTITION BY, snapshot strategy, dedup/idempotency, failure modes)
  2. DB DDL + init/migrations
  3. Backfill tool + incremental updater
  4. Canonical views (and optional Python SDK)a
  5. Data quality checks + report
  6. AlphaForge integration demo (run at least one backtest from DB)
  7. Deployment README (AWS) + runbooks
  8. Real-time feed proposal

We will accept the MVP when:

  • The database and ingestion services are deployed and running reliably in our AWS environment
  • The database is already populated with the agreed historical dataset (backfill completed)
  • Incremental updates run automatically on a defined schedule, with observable job status and failure alerts/logs
  • AlphaForge can run ≥ 1 backtest using the provided views/SDK against the AWS-hosted database
  • Re-running a backtest pinned to a specific snapshot_id is reproducible (within float tolerance)
  • Data quality checks can be executed in AWS and produce actionable output (gaps/duplicates/invalid data)
  • The real-time feed proposal is delivered with clear tradeoffs and a recommended implementation plan