Skip to content

Data Layer Snapshot Strategy

Source: Notion | Last edited: 2025-12-10 | ID: 2c12d2dc-3ef...


AlphaForge Data Layer – Snapshot Strategy with Chained Materialized Views

Section titled “AlphaForge Data Layer – Snapshot Strategy with Chained Materialized Views”

This design integrates the three-layer architecture (Landing → Staging → Data Mart) with a snapshot tracking system for reproducible backtests and fast latest reads.

Key principle: Data flows through validation layers, then automatically propagates to downstream tables via chained materialized views, eliminating manual dual INSERTs.


Ingestion Workflow
↓ (minimal validation)
landing_bars (raw data)
↓ [Application-level DQ validation + transformation]
staging_bars (validated, source of truth)
↓ [Automatic via chained materialized views]
├→ bars (production table)
├→ bars_snapshots (audit trail for reproducibility)
└→ bars_current (fast latest reads for live trading)

Data flow: Write raw data to landing_bars, validate and transform in application layer, then INSERT to staging_bars. All downstream tables update automatically via materialized views.


  1. Reproducibility Every experiment/backtest can pin a specific dataset version and remain stable even if we later fill gaps or correct historical points.

  2. Live-read performance Live trading / monitoring can read “latest bars” quickly without doing snapshot-folding on every request.

  3. Operational simplicity

  • Three-stage ingestion: landing → staging (with validation) → production
  • Single INSERT point to production
  • ClickHouse handles downstream propagation automatically via MVs
  • Avoid per-snapshot databases/tables and OLTP-style row updates

3.1 staging_bars (validated source of truth)

Section titled “3.1 staging_bars (validated source of truth)”

Purpose: Clean, validated data after DQ checks. Single source of truth for all downstream tables.

Schema:

CREATE TABLE IF NOT EXISTS analytics.staging_bars (
id UInt64,
instrument_id UInt32,
interval String,
timestamp DateTime,
timestamp_unix Int64,
open Float64,
high Float64,
low Float64,
close Float64,
volume Float64,
metadata String,
-- Lineage column
deleted_at** DateTime,**
ingested_at ** DateTime**
)
ENGINE = ReplacingMergeTree(ingested_at)
PARTITION BY toDate(timestamp)
ORDER BY (instrument_id, interval, timestamp_unix);

Key columns:

  • ingested_at: Timestamp when data was ingested (used as version for deduplication)

Purpose: Main production table for end users. Queried via v_bars view.

Schema:

CREATE TABLE IF NOT EXISTS analytics.bars (
id UInt64,
instrument_id UInt32,
interval String,
timestamp DateTime,
timestamp_unix Int64,
open Float64,
high Float64,
low Float64,
close Float64,
volume Float64,
metadata String,
deleted_at DateTime,
ingested_at DateTime
)
ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (instrument_id, interval, timestamp_unix, ingested_at);

Populated by:

CREATE MATERIALIZED VIEW IF NOT EXISTS mv_bars
TO bars
AS SELECT
id,
instrument_id,
interval,
timestamp,
timestamp_unix,
open,
high,
low,
close,
volume,
metadata,
deleted_at
ingested_at,
FROM staging_bars;

3.3 bars_snapshots (append-only audit trail)

Section titled “3.3 bars_snapshots (append-only audit trail)”

Purpose: Full snapshot log for audit & reproducible “as-of” queries.

Schema:

CREATE TABLE IF NOT EXISTS analytics.bars_snapshots (
id UInt64,
instrument_id UInt32,
interval String,
timestamp DateTime,
timestamp_unix Int64,
open Float64,
high Float64,
low Float64,
close Float64,
volume Float64,
metadata String,
ingested_at DateTime,
deleted_at DateTime,
)
ENGINE = MergeTree
PARTITION BY toDate(ingested_at) -- Daily partitioning by ingestion time for snapshot query performance
ORDER BY (instrument_id, interval, timestamp_unix, ingested_at);

Key differences from bars:

  • No ReplacingMergeTree — keeps ALL snapshots
  • ORDER BY includes ingested_at for efficient as-of queries Populated by:
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_bars_snapshots
TO bars_snapshots
AS SELECT
id,
instrument_id,
interval,
timestamp,
timestamp_unix,
open,
high,
low,
close,
volume,
metadata,
deleted_at,
ingested_at
FROM staging_bars;

Note: This MV includes ALL rows in the append-only audit trail.

Partitioning strategy:

  • Why ****PARTITION BY toDate(ingested_at): Daily partitions for the large volume of bars data. Snapshot queries filter by ingested_at <= cutoff, so partitioning by ingestion date allows ClickHouse to skip entire daily partitions where all data was ingested after the snapshot cutoff
  • Performance impact: Daily partitions provide finer granularity than monthly, which is important given the high volume of 1-minute bars across multiple exchanges and instruments. This reduces partition scan overhead for snapshot queries
  • Time complexity: With daily ingested_at partitioning, snapshot queries run in O(M) time where M = rows ingested before cutoff, with better partition pruning than monthly partitions
  • Example: For snapshot 2025-12-08, ClickHouse only reads daily partitions from the start of data through 2025-12-08, automatically skipping any future partitions that contain later corrections

Purpose: Optimized for fast “latest data” queries (live trading, dashboards).

Schema:

CREATE TABLE IF NOT EXISTS analytics.bars_current (
id UInt64,
instrument_id UInt32,
interval String,
timestamp DateTime,
timestamp_unix Int64,
open Float64,
high Float64,
low Float64,
close Float64,
volume Float64,
metadata String,
deleted_at DateTime,
ingested_at DateTime
)
ENGINE = ReplacingMergeTree(ingested_at)
PARTITION BY toDate(timestamp)
ORDER BY (instrument_id, interval, timestamp_unix);

Populated by:

CREATE MATERIALIZED VIEW IF NOT EXISTS mv_bars_current
TO bars_current
AS SELECT
id,
instrument_id,
interval,
timestamp,
timestamp_unix,
open,
high,
low,
close,
volume,
metadata,
ingested_at,
deleted_at
FROM bars;

Note: This MV chains from bars table (not staging_bars), creating a clean separation.


Publishing a new snapshot before starting the ingestion flow:

CREATE TABLE IF NOT EXISTS analytics.snapshots (
snapshot_id String,
cutoff_ingested_at DateTime,
created_at DateTime DEFAULT now(),
)
ENGINE = MergeTree
ORDER BY snapshot_id;
INSERT INTO analytics.snapshots VALUES (
'snapshot_2025_12_08', -- Identifier / UUID
'2025-12-08 12:00:00', -- cutoff_ingested_at
now(),
);

Snapshot is immutable: snapshot_id → cutoff_ingested_at

This gives us a clean “release” boundary for datasets used in experiments.

For each ingestion run:

Step 1: Write raw data to landing layer

INSERT INTO landing_bars (
id, instrument_id, interval, timestamp, timestamp_unix,
open, high, low, close, volume, metadata
) VALUES (
...
);

Step 2: Application validates and transforms

  • Read from landing_bars
  • Perform DQ checks (invalid OHLC, missing data, duplicates, etc.)
  • Transform and enrich data Step 3: Write validated data to staging
INSERT INTO staging_bars (
id, instrument_id, interval, timestamp, timestamp_unix,
open, high, low, close, volume, metadata
) VALUES (
...
);

Step 4: ClickHouse automatically populates downstream tables:

  • bars (via mv_bars)
  • bars_snapshots (via mv_bars_snapshots)
  • bars_current (via mv_bars_current)

Use bars_current for fast latest reads:

SELECT *
FROM bars_current FINAL
WHERE instrument_id = 1234
AND interval = '1m'
AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp;
WHERE deleted_at is NULL;

Important: ReplacingMergeTree deduplication is eventual (after merges). For strong consistency in small windows:

  • Use argMax(..., ingested_at) for narrow ranges
  • Use FINAL only for narrow queries (avoid for large scans)

5.2 Reproducible “As-Of” Snapshot Reads (Backtesting)

Section titled “5.2 Reproducible “As-Of” Snapshot Reads (Backtesting)”

Query bars_snapshots with snapshot folding:

-- Resolve snapshot to cutoff timestamp
WITH (SELECT cutoff_ingested_at FROM snapshots WHERE snapshot_id = 'snapshot_2025_12_08') AS cutoff
SELECT
instrument_id,
interval,
timestamp,
argMax(open, ingested_at) AS open,
argMax(high, ingested_at) AS high,
argMax(low, ingested_at) AS low,
argMax(close, ingested_at) AS close,
argMax(volume, ingested_at) AS volume,
argMax(deleted_at, ingested_at) AS deleted_at,
FROM bars_snapshots
WHERE ingested_at <= cutoff
AND timestamp >= '2025-01-01'
AND timestamp < '2026-01-01'
AND interval = '1m'
AND deleted_at is NULL
GROUP BY instrument_id, interval, timestamp
ORDER BY instrument_id, interval, timestamp;

This returns the dataset exactly as it existed at the snapshot cutoff, even if later we patched the same period.


Why Single INSERT with Materialized Views?

Section titled “Why Single INSERT with Materialized Views?”

ClickHouse lacks transaction safety:

  • ClickHouse does not support ACID transactions across multiple tables
  • No BEGIN/COMMIT/ROLLBACK for coordinating multi-table writes
  • Each INSERT is atomic for that table only, but multiple INSERTs can fail independently Problems with manual dual/triple INSERT:
# ❌ UNSAFE: No transaction coordination
INSERT INTO bars (...) VALUES (...); # Might succeed
INSERT INTO bars_snapshots (...) VALUES (...); # Might fail
INSERT INTO bars_current (...) VALUES (...); # Might fail

If any INSERT fails:

  • ❌ Data inconsistency: Some tables updated, others not
  • ❌ Orphaned records: bars has data but bars_snapshots doesn’t
  • ❌ Manual cleanup required: Need application-level compensation logic
  • ❌ Retry complexity: Must track which INSERTs succeeded Materialized views solve this:
# ✅ SAFE: Single INSERT + automatic propagation
INSERT INTO staging_bars (...) VALUES (...);
# ClickHouse automatically triggers all 3 MVs atomically

Why this is better:

  1. Single point of failure: If staging_bars INSERT succeeds, ClickHouse guarantees MV propagation
  2. ClickHouse manages consistency: The database handles MV triggers internally
  3. Simplified error handling: Only need to retry the single INSERT to staging_bars
  4. No orphaned data: MVs are triggered as part of the same internal operation
  5. Reduced code complexity: Application layer doesn’t coordinate multi-table writes Important caveat:
  • MV propagation is eventually consistent (not immediate)
  • For critical reads requiring strong consistency, query staging_bars directly
  • For normal production queries, the slight delay is acceptable

Advantages of using ingested_at timestamp:

  • ✅ Natural ordering - timestamps are self-explanatory

  • ✅ No need to manage monotonic ID generation

  • ✅ Human-readable - easier to debug and audit

  • ✅ Works perfectly with ClickHouse’s ReplacingMergeTree Clock skew consideration:

  • If ingestion runs on multiple servers with slightly different clocks, you could have ordering issues

  • Mitigation: Ensure NTP sync across ingestion servers, or use a centralized timestamp service


Clear ingestion flow: Landing → validation → staging → automatic downstream propagation

Automatic propagation: ClickHouse handles downstream updates

Clear validation boundary: DQ checks before staging

Reproducibility: Complete snapshot history with snapshot pinning

Fast latest reads: Optimized bars_current table

Operational simplicity: No manual multi-table coordination

Consistency without transactions: ClickHouse MVs provide better consistency than manual INSERTs

Simple versioning: Natural timestamps instead of managing monotonic IDs

vs. Dual/Triple INSERT:

  • Eliminates manual coordination across tables

  • Reduces code complexity (1 INSERT vs 3)

  • ClickHouse guarantees consistency via MVs

  • Simpler error handling and retry logic

  • No risk of partial writes causing data inconsistency vs. Table cloning per snapshot:

  • No cold-start (no “restore DB from S3” workflow)

  • No proliferation of per-snapshot tables

  • Efficient small patches (append-only updates)