Data Layer Snapshot Strategy
Source: Notion | Last edited: 2025-12-10 | ID: 2c12d2dc-3ef...
AlphaForge Data Layer – Snapshot Strategy with Chained Materialized Views
Section titled “AlphaForge Data Layer – Snapshot Strategy with Chained Materialized Views”This design integrates the three-layer architecture (Landing → Staging → Data Mart) with a snapshot tracking system for reproducible backtests and fast latest reads.
Key principle: Data flows through validation layers, then automatically propagates to downstream tables via chained materialized views, eliminating manual dual INSERTs.
1) Architecture Overview
Section titled “1) Architecture Overview”Complete Data Flow
Section titled “Complete Data Flow”Ingestion Workflow ↓ (minimal validation)landing_bars (raw data) ↓ [Application-level DQ validation + transformation]staging_bars (validated, source of truth) ↓ [Automatic via chained materialized views] ├→ bars (production table) ├→ bars_snapshots (audit trail for reproducibility) └→ bars_current (fast latest reads for live trading)Data flow: Write raw data to landing_bars, validate and transform in application layer, then INSERT to staging_bars. All downstream tables update automatically via materialized views.
2) Goals
Section titled “2) Goals”-
Reproducibility Every experiment/backtest can pin a specific dataset version and remain stable even if we later fill gaps or correct historical points.
-
Live-read performance Live trading / monitoring can read “latest bars” quickly without doing snapshot-folding on every request.
-
Operational simplicity
- Three-stage ingestion: landing → staging (with validation) → production
- Single INSERT point to production
- ClickHouse handles downstream propagation automatically via MVs
- Avoid per-snapshot databases/tables and OLTP-style row updates
3) Data Model
Section titled “3) Data Model”3.1 staging_bars (validated source of truth)
Section titled “3.1 staging_bars (validated source of truth)”Purpose: Clean, validated data after DQ checks. Single source of truth for all downstream tables.
Schema:
CREATE TABLE IF NOT EXISTS analytics.staging_bars ( id UInt64, instrument_id UInt32, interval String, timestamp DateTime, timestamp_unix Int64, open Float64, high Float64, low Float64, close Float64, volume Float64, metadata String, -- Lineage column deleted_at** DateTime,** ingested_at ** DateTime**)ENGINE = ReplacingMergeTree(ingested_at)PARTITION BY toDate(timestamp)ORDER BY (instrument_id, interval, timestamp_unix);Key columns:
ingested_at: Timestamp when data was ingested (used as version for deduplication)
3.2 bars (production table)
Section titled “3.2 bars (production table)”Purpose: Main production table for end users. Queried via v_bars view.
Schema:
CREATE TABLE IF NOT EXISTS analytics.bars ( id UInt64, instrument_id UInt32, interval String, timestamp DateTime, timestamp_unix Int64, open Float64, high Float64, low Float64, close Float64, volume Float64, metadata String, deleted_at DateTime, ingested_at DateTime)ENGINE = MergeTreePARTITION BY toDate(timestamp)ORDER BY (instrument_id, interval, timestamp_unix, ingested_at);Populated by:
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_barsTO barsAS SELECT id, instrument_id, interval, timestamp, timestamp_unix, open, high, low, close, volume, metadata, deleted_at ingested_at,FROM staging_bars;3.3 bars_snapshots (append-only audit trail)
Section titled “3.3 bars_snapshots (append-only audit trail)”Purpose: Full snapshot log for audit & reproducible “as-of” queries.
Schema:
CREATE TABLE IF NOT EXISTS analytics.bars_snapshots ( id UInt64, instrument_id UInt32, interval String, timestamp DateTime, timestamp_unix Int64, open Float64, high Float64, low Float64, close Float64, volume Float64, metadata String, ingested_at DateTime, deleted_at DateTime,)ENGINE = MergeTreePARTITION BY toDate(ingested_at) -- Daily partitioning by ingestion time for snapshot query performanceORDER BY (instrument_id, interval, timestamp_unix, ingested_at);Key differences from bars:
- No
ReplacingMergeTree— keeps ALL snapshots ORDER BYincludesingested_atfor efficient as-of queries Populated by:
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_bars_snapshotsTO bars_snapshotsAS SELECT id, instrument_id, interval, timestamp, timestamp_unix, open, high, low, close, volume, metadata, deleted_at, ingested_atFROM staging_bars;Note: This MV includes ALL rows in the append-only audit trail.
Partitioning strategy:
- Why ****
PARTITION BY toDate(ingested_at): Daily partitions for the large volume of bars data. Snapshot queries filter byingested_at <= cutoff, so partitioning by ingestion date allows ClickHouse to skip entire daily partitions where all data was ingested after the snapshot cutoff - Performance impact: Daily partitions provide finer granularity than monthly, which is important given the high volume of 1-minute bars across multiple exchanges and instruments. This reduces partition scan overhead for snapshot queries
- Time complexity: With daily
ingested_atpartitioning, snapshot queries run in O(M) time where M = rows ingested before cutoff, with better partition pruning than monthly partitions - Example: For snapshot
2025-12-08, ClickHouse only reads daily partitions from the start of data through2025-12-08, automatically skipping any future partitions that contain later corrections
3.4 bars_current (fast latest reads)
Section titled “3.4 bars_current (fast latest reads)”Purpose: Optimized for fast “latest data” queries (live trading, dashboards).
Schema:
CREATE TABLE IF NOT EXISTS analytics.bars_current ( id UInt64, instrument_id UInt32, interval String, timestamp DateTime, timestamp_unix Int64, open Float64, high Float64, low Float64, close Float64, volume Float64, metadata String, deleted_at DateTime, ingested_at DateTime)ENGINE = ReplacingMergeTree(ingested_at)PARTITION BY toDate(timestamp)ORDER BY (instrument_id, interval, timestamp_unix);Populated by:
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_bars_currentTO bars_currentAS SELECT id, instrument_id, interval, timestamp, timestamp_unix, open, high, low, close, volume, metadata, ingested_at, deleted_atFROM bars;Note: This MV chains from bars table (not staging_bars), creating a clean separation.
4) Ingestion Flow
Section titled “4) Ingestion Flow”4.1 Snapshot Publishing
Section titled “4.1 Snapshot Publishing”Publishing a new snapshot before starting the ingestion flow:
CREATE TABLE IF NOT EXISTS analytics.snapshots ( snapshot_id String, cutoff_ingested_at DateTime, created_at DateTime DEFAULT now(),)ENGINE = MergeTreeORDER BY snapshot_id;
INSERT INTO analytics.snapshots VALUES ( 'snapshot_2025_12_08', -- Identifier / UUID '2025-12-08 12:00:00', -- cutoff_ingested_at now(),);Snapshot is immutable: snapshot_id → cutoff_ingested_at
This gives us a clean “release” boundary for datasets used in experiments.
4.2 Two-Stage INSERT Flow
Section titled “4.2 Two-Stage INSERT Flow”For each ingestion run:
Step 1: Write raw data to landing layer
INSERT INTO landing_bars ( id, instrument_id, interval, timestamp, timestamp_unix, open, high, low, close, volume, metadata) VALUES ( ...);Step 2: Application validates and transforms
- Read from
landing_bars - Perform DQ checks (invalid OHLC, missing data, duplicates, etc.)
- Transform and enrich data Step 3: Write validated data to staging
INSERT INTO staging_bars ( id, instrument_id, interval, timestamp, timestamp_unix, open, high, low, close, volume, metadata) VALUES ( ...);Step 4: ClickHouse automatically populates downstream tables:
bars(viamv_bars)bars_snapshots(viamv_bars_snapshots)bars_current(viamv_bars_current)
5) Querying Patterns
Section titled “5) Querying Patterns”5.1 Latest Reads (Live Trading)
Section titled “5.1 Latest Reads (Live Trading)”Use bars_current for fast latest reads:
SELECT *FROM bars_current FINALWHERE instrument_id = 1234 AND interval = '1m' AND timestamp >= now() - INTERVAL 7 DAYORDER BY timestamp;WHERE deleted_at is NULL;Important: ReplacingMergeTree deduplication is eventual (after merges). For strong consistency in small windows:
- Use
argMax(..., ingested_at)for narrow ranges - Use
FINALonly for narrow queries (avoid for large scans)
5.2 Reproducible “As-Of” Snapshot Reads (Backtesting)
Section titled “5.2 Reproducible “As-Of” Snapshot Reads (Backtesting)”Query bars_snapshots with snapshot folding:
-- Resolve snapshot to cutoff timestampWITH (SELECT cutoff_ingested_at FROM snapshots WHERE snapshot_id = 'snapshot_2025_12_08') AS cutoff
SELECT instrument_id, interval, timestamp, argMax(open, ingested_at) AS open, argMax(high, ingested_at) AS high, argMax(low, ingested_at) AS low, argMax(close, ingested_at) AS close, argMax(volume, ingested_at) AS volume, argMax(deleted_at, ingested_at) AS deleted_at,FROM bars_snapshotsWHERE ingested_at <= cutoff AND timestamp >= '2025-01-01' AND timestamp < '2026-01-01' AND interval = '1m' AND deleted_at is NULLGROUP BY instrument_id, interval, timestampORDER BY instrument_id, interval, timestamp;This returns the dataset exactly as it existed at the snapshot cutoff, even if later we patched the same period.
6) Why This Design?
Section titled “6) Why This Design?”Why Single INSERT with Materialized Views?
Section titled “Why Single INSERT with Materialized Views?”ClickHouse lacks transaction safety:
- ClickHouse does not support ACID transactions across multiple tables
- No
BEGIN/COMMIT/ROLLBACKfor coordinating multi-table writes - Each INSERT is atomic for that table only, but multiple INSERTs can fail independently Problems with manual dual/triple INSERT:
# ❌ UNSAFE: No transaction coordinationINSERT INTO bars (...) VALUES (...); # Might succeedINSERT INTO bars_snapshots (...) VALUES (...); # Might failINSERT INTO bars_current (...) VALUES (...); # Might failIf any INSERT fails:
- ❌ Data inconsistency: Some tables updated, others not
- ❌ Orphaned records:
barshas data butbars_snapshotsdoesn’t - ❌ Manual cleanup required: Need application-level compensation logic
- ❌ Retry complexity: Must track which INSERTs succeeded Materialized views solve this:
# ✅ SAFE: Single INSERT + automatic propagationINSERT INTO staging_bars (...) VALUES (...);# ClickHouse automatically triggers all 3 MVs atomicallyWhy this is better:
- Single point of failure: If
staging_barsINSERT succeeds, ClickHouse guarantees MV propagation - ClickHouse manages consistency: The database handles MV triggers internally
- Simplified error handling: Only need to retry the single INSERT to
staging_bars - No orphaned data: MVs are triggered as part of the same internal operation
- Reduced code complexity: Application layer doesn’t coordinate multi-table writes Important caveat:
- MV propagation is eventually consistent (not immediate)
- For critical reads requiring strong consistency, query
staging_barsdirectly - For normal production queries, the slight delay is acceptable
Why ingested_at Instead of snapshot_id?
Section titled “Why ingested_at Instead of snapshot_id?”Advantages of using ingested_at timestamp:
-
✅ Natural ordering - timestamps are self-explanatory
-
✅ No need to manage monotonic ID generation
-
✅ Human-readable - easier to debug and audit
-
✅ Works perfectly with ClickHouse’s ReplacingMergeTree Clock skew consideration:
-
If ingestion runs on multiple servers with slightly different clocks, you could have ordering issues
-
Mitigation: Ensure NTP sync across ingestion servers, or use a centralized timestamp service
Advantages
Section titled “Advantages”✅ Clear ingestion flow: Landing → validation → staging → automatic downstream propagation
✅ Automatic propagation: ClickHouse handles downstream updates
✅ Clear validation boundary: DQ checks before staging
✅ Reproducibility: Complete snapshot history with snapshot pinning
✅ Fast latest reads: Optimized bars_current table
✅ Operational simplicity: No manual multi-table coordination
✅ Consistency without transactions: ClickHouse MVs provide better consistency than manual INSERTs
✅ Simple versioning: Natural timestamps instead of managing monotonic IDs
Comparison to Alternatives
Section titled “Comparison to Alternatives”vs. Dual/Triple INSERT:
-
Eliminates manual coordination across tables
-
Reduces code complexity (1 INSERT vs 3)
-
ClickHouse guarantees consistency via MVs
-
Simpler error handling and retry logic
-
No risk of partial writes causing data inconsistency vs. Table cloning per snapshot:
-
No cold-start (no “restore DB from S3” workflow)
-
No proliferation of per-snapshot tables
-
Efficient small patches (append-only updates)