Data Layer Snapshot Strategy

Source: Notion | Last edited: 2025-12-10 | ID: 2c12d2dc-3ef...

AlphaForge Data Layer – Snapshot Strategy with Chained Materialized Views

This design integrates the three-layer architecture (Landing → Staging → Data Mart) with a snapshot tracking system for reproducible backtests and fast latest reads.

Key principle: Data flows through validation layers, then automatically propagates to downstream tables via chained materialized views, eliminating manual dual INSERTs.

1) Architecture Overview

Complete Data Flow

Ingestion Workflow
  ↓ (minimal validation)
landing_bars (raw data)
  ↓ [Application-level DQ validation + transformation]
staging_bars (validated, source of truth)
  ↓ [Automatic via chained materialized views]
  ├→ bars (production table)
  ├→ bars_snapshots (audit trail for reproducibility)
    └→ bars_current (fast latest reads for live trading)

Data flow: Write raw data to landing_bars, validate and transform in application layer, then INSERT to staging_bars. All downstream tables update automatically via materialized views.

2) Goals

Reproducibility Every experiment/backtest can pin a specific dataset version and remain stable even if we later fill gaps or correct historical points.
Live-read performance Live trading / monitoring can read “latest bars” quickly without doing snapshot-folding on every request.
Operational simplicity

Three-stage ingestion: landing → staging (with validation) → production
Single INSERT point to production
ClickHouse handles downstream propagation automatically via MVs
Avoid per-snapshot databases/tables and OLTP-style row updates

3) Data Model

3.1 `staging_bars` (validated source of truth)

Purpose: Clean, validated data after DQ checks. Single source of truth for all downstream tables.

Schema:

CREATE TABLE IF NOT EXISTS analytics.staging_bars (
    id             UInt64,
    instrument_id  UInt32,
    interval       String,
    timestamp      DateTime,
    timestamp_unix Int64,
    open           Float64,
    high           Float64,
    low            Float64,
    close          Float64,
    volume         Float64,
    metadata       String,
    -- Lineage column
    deleted_at**     DateTime,**
    ingested_at  **  DateTime**
)
ENGINE = ReplacingMergeTree(ingested_at)
PARTITION BY toDate(timestamp)
ORDER BY (instrument_id, interval, timestamp_unix);

Key columns:

ingested_at: Timestamp when data was ingested (used as version for deduplication)

3.2 `bars` (production table)

Purpose: Main production table for end users. Queried via v_bars view.

Schema:

CREATE TABLE IF NOT EXISTS analytics.bars (
    id             UInt64,
    instrument_id  UInt32,
    interval       String,
    timestamp      DateTime,
    timestamp_unix Int64,
    open           Float64,
    high           Float64,
    low            Float64,
    close          Float64,
    volume         Float64,
    metadata       String,
    deleted_at     DateTime,
    ingested_at    DateTime
)
ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (instrument_id, interval, timestamp_unix, ingested_at);

Populated by:

CREATE MATERIALIZED VIEW IF NOT EXISTS mv_bars
TO bars
AS SELECT
    id,
    instrument_id,
    interval,
    timestamp,
    timestamp_unix,
    open,
    high,
    low,
    close,
    volume,
    metadata,
    deleted_at
    ingested_at,
FROM staging_bars;

3.3 `bars_snapshots` (append-only audit trail)

Purpose: Full snapshot log for audit & reproducible “as-of” queries.

Schema:

CREATE TABLE IF NOT EXISTS analytics.bars_snapshots (
    id             UInt64,
    instrument_id  UInt32,
    interval       String,
    timestamp      DateTime,
    timestamp_unix Int64,
    open           Float64,
    high           Float64,
    low            Float64,
    close          Float64,
    volume         Float64,
    metadata       String,
    ingested_at    DateTime,
    deleted_at     DateTime,
)
ENGINE = MergeTree
PARTITION BY toDate(ingested_at)  -- Daily partitioning by ingestion time for snapshot query performance
ORDER BY (instrument_id, interval, timestamp_unix, ingested_at);

Key differences from bars:

No ReplacingMergeTree — keeps ALL snapshots
ORDER BY includes ingested_at for efficient as-of queries Populated by:

CREATE MATERIALIZED VIEW IF NOT EXISTS mv_bars_snapshots
TO bars_snapshots
AS SELECT
    id,
    instrument_id,
    interval,
    timestamp,
    timestamp_unix,
    open,
    high,
    low,
    close,
    volume,
    metadata,
    deleted_at,
    ingested_at
FROM staging_bars;

Note: This MV includes ALL rows in the append-only audit trail.

Partitioning strategy:

Why ****PARTITION BY toDate(ingested_at): Daily partitions for the large volume of bars data. Snapshot queries filter by ingested_at <= cutoff, so partitioning by ingestion date allows ClickHouse to skip entire daily partitions where all data was ingested after the snapshot cutoff
Performance impact: Daily partitions provide finer granularity than monthly, which is important given the high volume of 1-minute bars across multiple exchanges and instruments. This reduces partition scan overhead for snapshot queries
Time complexity: With daily ingested_at partitioning, snapshot queries run in O(M) time where M = rows ingested before cutoff, with better partition pruning than monthly partitions
Example: For snapshot 2025-12-08, ClickHouse only reads daily partitions from the start of data through 2025-12-08, automatically skipping any future partitions that contain later corrections

3.4 `bars_current` (fast latest reads)

Purpose: Optimized for fast “latest data” queries (live trading, dashboards).

Schema:

CREATE TABLE IF NOT EXISTS analytics.bars_current (
    id             UInt64,
    instrument_id  UInt32,
    interval       String,
    timestamp      DateTime,
    timestamp_unix Int64,
    open           Float64,
    high           Float64,
    low            Float64,
    close          Float64,
    volume         Float64,
    metadata       String,
    deleted_at     DateTime,
    ingested_at    DateTime
)
ENGINE = ReplacingMergeTree(ingested_at)
PARTITION BY toDate(timestamp)
ORDER BY (instrument_id, interval, timestamp_unix);

Populated by:

CREATE MATERIALIZED VIEW IF NOT EXISTS mv_bars_current
TO bars_current
AS SELECT
    id,
    instrument_id,
    interval,
    timestamp,
    timestamp_unix,
    open,
    high,
    low,
    close,
    volume,
    metadata,
    ingested_at,
    deleted_at
FROM bars;

Note: This MV chains from bars table (not staging_bars), creating a clean separation.

4) Ingestion Flow

4.1 Snapshot Publishing

Publishing a new snapshot before starting the ingestion flow:

CREATE TABLE IF NOT EXISTS analytics.snapshots (
    snapshot_id         String,
    cutoff_ingested_at  DateTime,
    created_at          DateTime DEFAULT now(),
)
ENGINE = MergeTree
ORDER BY snapshot_id;

INSERT INTO analytics.snapshots VALUES (
    'snapshot_2025_12_08',  -- Identifier / UUID
    '2025-12-08 12:00:00',  -- cutoff_ingested_at
    now(),
);

Snapshot is immutable: snapshot_id → cutoff_ingested_at

This gives us a clean “release” boundary for datasets used in experiments.

4.2 Two-Stage INSERT Flow

For each ingestion run:

Step 1: Write raw data to landing layer

INSERT INTO landing_bars (
    id, instrument_id, interval, timestamp, timestamp_unix,
    open, high, low, close, volume, metadata
) VALUES (
    ...
);

Step 2: Application validates and transforms

Read from landing_bars
Perform DQ checks (invalid OHLC, missing data, duplicates, etc.)
Transform and enrich data Step 3: Write validated data to staging

INSERT INTO staging_bars (
    id, instrument_id, interval, timestamp, timestamp_unix,
    open, high, low, close, volume, metadata
) VALUES (
    ...
);

Step 4: ClickHouse automatically populates downstream tables:

bars (via mv_bars)
bars_snapshots (via mv_bars_snapshots)
bars_current (via mv_bars_current)

5) Querying Patterns

5.1 Latest Reads (Live Trading)

Use bars_current for fast latest reads:

SELECT *
FROM bars_current FINAL
WHERE instrument_id = 1234
  AND interval = '1m'
  AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp;
WHERE deleted_at is NULL;

Important: ReplacingMergeTree deduplication is eventual (after merges). For strong consistency in small windows:

Use argMax(..., ingested_at) for narrow ranges
Use FINAL only for narrow queries (avoid for large scans)

5.2 Reproducible “As-Of” Snapshot Reads (Backtesting)

Query bars_snapshots with snapshot folding:

-- Resolve snapshot to cutoff timestamp
WITH (SELECT cutoff_ingested_at FROM snapshots WHERE snapshot_id = 'snapshot_2025_12_08') AS cutoff

SELECT
    instrument_id,
    interval,
    timestamp,
    argMax(open,   ingested_at) AS open,
    argMax(high,   ingested_at) AS high,
    argMax(low,    ingested_at) AS low,
    argMax(close,  ingested_at) AS close,
    argMax(volume, ingested_at) AS volume,
    argMax(deleted_at, ingested_at) AS deleted_at,
FROM bars_snapshots
WHERE ingested_at <= cutoff
  AND timestamp >= '2025-01-01'
  AND timestamp < '2026-01-01'
  AND interval = '1m'
  AND deleted_at is NULL
GROUP BY instrument_id, interval, timestamp
ORDER BY instrument_id, interval, timestamp;

This returns the dataset exactly as it existed at the snapshot cutoff, even if later we patched the same period.

6) Why This Design?

Why Single INSERT with Materialized Views?

ClickHouse lacks transaction safety:

ClickHouse does not support ACID transactions across multiple tables
No BEGIN/COMMIT/ROLLBACK for coordinating multi-table writes
Each INSERT is atomic for that table only, but multiple INSERTs can fail independently Problems with manual dual/triple INSERT:

# ❌ UNSAFE: No transaction coordination
INSERT INTO bars (...) VALUES (...);           # Might succeed
INSERT INTO bars_snapshots (...) VALUES (...); # Might fail
INSERT INTO bars_current (...) VALUES (...);   # Might fail

If any INSERT fails:

❌ Data inconsistency: Some tables updated, others not
❌ Orphaned records: bars has data but bars_snapshots doesn’t
❌ Manual cleanup required: Need application-level compensation logic
❌ Retry complexity: Must track which INSERTs succeeded Materialized views solve this:

# ✅ SAFE: Single INSERT + automatic propagation
INSERT INTO staging_bars (...) VALUES (...);
# ClickHouse automatically triggers all 3 MVs atomically

Why this is better:

Single point of failure: If staging_bars INSERT succeeds, ClickHouse guarantees MV propagation
ClickHouse manages consistency: The database handles MV triggers internally
Simplified error handling: Only need to retry the single INSERT to staging_bars
No orphaned data: MVs are triggered as part of the same internal operation
Reduced code complexity: Application layer doesn’t coordinate multi-table writes Important caveat:

MV propagation is eventually consistent (not immediate)
For critical reads requiring strong consistency, query staging_bars directly
For normal production queries, the slight delay is acceptable

Why `ingested_at` Instead of `snapshot_id`?

Advantages of using ingested_at timestamp:

✅ Natural ordering - timestamps are self-explanatory
✅ No need to manage monotonic ID generation
✅ Human-readable - easier to debug and audit
✅ Works perfectly with ClickHouse’s ReplacingMergeTree Clock skew consideration:
If ingestion runs on multiple servers with slightly different clocks, you could have ordering issues
Mitigation: Ensure NTP sync across ingestion servers, or use a centralized timestamp service

Advantages

✅ Clear ingestion flow: Landing → validation → staging → automatic downstream propagation

✅ Automatic propagation: ClickHouse handles downstream updates

✅ Clear validation boundary: DQ checks before staging

✅ Reproducibility: Complete snapshot history with snapshot pinning

✅ Fast latest reads: Optimized bars_current table

✅ Operational simplicity: No manual multi-table coordination

✅ Consistency without transactions: ClickHouse MVs provide better consistency than manual INSERTs

✅ Simple versioning: Natural timestamps instead of managing monotonic IDs

Comparison to Alternatives

vs. Dual/Triple INSERT:

Eliminates manual coordination across tables
Reduces code complexity (1 INSERT vs 3)
ClickHouse guarantees consistency via MVs
Simpler error handling and retry logic
No risk of partial writes causing data inconsistency vs. Table cloning per snapshot:
No cold-start (no “restore DB from S3” workflow)
No proliferation of per-snapshot tables
Efficient small patches (append-only updates)

Data Layer Snapshot Strategy

AlphaForge Data Layer – Snapshot Strategy with Chained Materialized Views

1) Architecture Overview

Complete Data Flow

2) Goals

3) Data Model

3.1 staging_bars (validated source of truth)

3.2 bars (production table)

3.3 bars_snapshots (append-only audit trail)

3.4 bars_current (fast latest reads)

4) Ingestion Flow

4.1 Snapshot Publishing

4.2 Two-Stage INSERT Flow

5) Querying Patterns

5.1 Latest Reads (Live Trading)

5.2 Reproducible “As-Of” Snapshot Reads (Backtesting)

6) Why This Design?

Why Single INSERT with Materialized Views?

Why ingested_at Instead of snapshot_id?

Advantages

Comparison to Alternatives

3.1 `staging_bars` (validated source of truth)

3.2 `bars` (production table)

3.3 `bars_snapshots` (append-only audit trail)

3.4 `bars_current` (fast latest reads)

Why `ingested_at` Instead of `snapshot_id`?