DSL Similarity Measure

Source: Notion | Last edited: 2025-11-18 | ID: 2af2d2dc-3ef...

Absolutely—and this is the hidden superpower of using a DSL. If you engineer the pipeline across compile → canonicalize → fingerprint → vectorize, your strategies become addressable data objects. That unlocks “was this run before?”, “what’s similar?”, “how novel is this?”, plus reproducibility, dedup, caching, and compositional optimization.

What extra advantages do we gain?

1) One-click “has this been run?” via content addressing

Canonicalization: strip comments/whitespace; sort keys; normalize numerics (1e-3 → 0.001); expand macros; α-rename variables; topologically sort the DAG to a deterministic sequence.
Stable serialization: deterministically serialize the canonical IR/DAG (JSON/YAML/MsgPack).
Fingerprint: ExperimentID = SHA256( Canonical(IR) || DataSnapshotID || EngineVersion )

Same spec + same data + same engine ⇒ same ID.

Use it as a cache key (skip redundant runs) and as a registry primary key (exact dedup).

2) “What’s similar?” via vector search + near-duplicate hashing

Structure embeddings: learn embeddings over the AST/DAG (path-based or graph2vec/node2vec-style).
Semantic embeddings: embed descriptive text (objectives, constraints, alpha notes).
Parameter vector: normalized/scaled hyperparameters as a numeric vector.
Near-dup hash: SimHash/MinHash to catch near duplicates (robust to reordering/comments). Store in a vector DB (FAISS/Milvus); support top-k similarity and dedup thresholds (e.g., cosine > 0.97).

3) Quantified “Novelty Score”

Structural novelty: 1 − max_cosine(struct_vec, library)
Parameter novelty: distance to common/optimal parameter neighborhoods (Mahalanobis/spherical).
Functional novelty: new operators/domains + uncovered constraint combinations (coverage).
Outcome novelty: distance to the Pareto frontier (return–drawdown–turnover).
Composite: Novelty = w1*structure + w2*params + w3*function + w4*outcome (tunable weights). 4) Reproducibility, auditability, and diffability
The fingerprint anchors the evidence chain: DSL → IR/DAG → compiler output → runner version → container/dep image → data snapshot → metrics → order trail, all under the same ExperimentID.
Explainable diffs: AST/DAG diffs beat text diffs (“added Kalman smoothing”, “window 12→24”).
Compliance templates: auto-inject baseline checks (backtest baselines, risk caps, slippage/fee model versions) into reports. 5) Compute throttling & cost control
Exact dedup execution: hit cache by ExperimentID and return results immediately.
Incremental reuse: node-level DAG caching (features/data cuts) to skip redundant work.
Batch co-runs: merge shared prefixes across similar experiments—often 30%+ savings. 6) Search & automation
Uncovered-space discovery: coverage heatmaps from the vector index to find parameter/operator gaps.
Design agent: generate next DSL patches around successful/failed neighbors.
Guardrails: static analysis to flag overfit patterns (rule density, tight thresholds, leakage paths) and execution risks (liquidity/margin/laddering issues).

Minimal viable blueprint

A. Canonicalization & fingerprinting

dsl.parse() → AST
ir = compile_to_ir(ast) (macro expansion, defaults, type checks)
canon = canonicalize(ir) (strip comments/whitespace, sort keys, α-rename, topo sort, numeric normalization)
id = sha256(canon || data_snapshot_id || engine_version)
Registry insert: {id, canon, hash_components, owner, created_at} B. Vectors & near-dup hashing

vec_struct = embed_graph(ir)
vec_text = embed_text(ir.doc + alpha.description)
vec_params = encode_params(ir.hyperparams)
simhash = simhash(canon_text)
Upsert: {id, vec_struct, vec_text, vec_params, simhash, tags} C. Dedup / similarity / novelty
Dedup: exists(id) → return cached result
Similarity: topk = faiss.search(vec_struct ⊕ vec_params, k=20) + simhash_hamming<=3
Novelty: compute composite score and persist to novelty_scores D. Caching & write-back
Result key: ResultKey = sha256(id || runner_flags)
After run: write curves, orders, slippage, metrics, and artifacts (models, features, graphs) to artifacts/{id}/…
Reporting: auto-generate “A/B comparison + neighbor leaderboard + novelty score”

Registry schema (essentials)

experiments(id PK, canon_hash, data_hash, engine_ver, owner, created_at, status, novelty_score)
vectors(id FK, struct_vec[], text_vec[], param_vec[], simhash)
artifacts(id FK, path, type, sha256, created_at)
metrics(id FK, regime, horizon, sharpe, mdd, turnover, cost_model_ver, …)
neighbors(id FK, neighbor_id, sim, method, created_at)

Engineering tips

Float stability: canonicalize numeric literals to a fixed format to avoid hash jitter.
α-renaming: normalize variable names (e.g., f1,f2,…) so naming doesn’t affect similarity.
Topo order: topo sort + stable multi-key ordering (op-type/in-degree/position).
Multi-view embeddings: structure/text/params as separate channels; combine at query time.
MinHash coverage: apply to “feature-set/data-domain” bags to quantify functional differences.

One-liner summary

A DSL isn’t just “nicer to write”—it makes strategies addressable, comparable, deduplicable, traceable, and searchable.

With canonicalization + fingerprinting + vector search, you can instantly answer:

Was this run before? (exact dedup)
What is it similar to, and how? (neighbor recall + AST/DAG diffs)
Is it worth running? (novelty/coverage vs. cost)
How do we iterate next? (agent-generated DSL patches in uncovered neighborhoods) If you want, I can draft a reference skeleton (Python package + minimal CLI + FAISS schema) wired to your current DSL fields.