Skip to content

DSL Similarity Measure

Source: Notion | Last edited: 2025-11-18 | ID: 2af2d2dc-3ef...


Absolutely—and this is the hidden superpower of using a DSL. If you engineer the pipeline across compile → canonicalize → fingerprint → vectorize, your strategies become addressable data objects. That unlocks “was this run before?”, “what’s similar?”, “how novel is this?”, plus reproducibility, dedup, caching, and compositional optimization.

1) One-click “has this been run?” via content addressing

  • Canonicalization: strip comments/whitespace; sort keys; normalize numerics (1e-3 → 0.001); expand macros; α-rename variables; topologically sort the DAG to a deterministic sequence.

  • Stable serialization: deterministically serialize the canonical IR/DAG (JSON/YAML/MsgPack).

  • Fingerprint: ExperimentID = SHA256( Canonical(IR) || DataSnapshotID || EngineVersion )

    Same spec + same data + same engine ⇒ same ID.

    Use it as a cache key (skip redundant runs) and as a registry primary key (exact dedup).

2) “What’s similar?” via vector search + near-duplicate hashing

  • Structure embeddings: learn embeddings over the AST/DAG (path-based or graph2vec/node2vec-style).
  • Semantic embeddings: embed descriptive text (objectives, constraints, alpha notes).
  • Parameter vector: normalized/scaled hyperparameters as a numeric vector.
  • Near-dup hash: SimHash/MinHash to catch near duplicates (robust to reordering/comments). Store in a vector DB (FAISS/Milvus); support top-k similarity and dedup thresholds (e.g., cosine > 0.97).

3) Quantified “Novelty Score”

  • Structural novelty: 1 − max_cosine(struct_vec, library)

  • Parameter novelty: distance to common/optimal parameter neighborhoods (Mahalanobis/spherical).

  • Functional novelty: new operators/domains + uncovered constraint combinations (coverage).

  • Outcome novelty: distance to the Pareto frontier (return–drawdown–turnover).

  • Composite: Novelty = w1*structure + w2*params + w3*function + w4*outcome (tunable weights). 4) Reproducibility, auditability, and diffability

  • The fingerprint anchors the evidence chain: DSL → IR/DAG → compiler output → runner version → container/dep image → data snapshot → metrics → order trail, all under the same ExperimentID.

  • Explainable diffs: AST/DAG diffs beat text diffs (“added Kalman smoothing”, “window 12→24”).

  • Compliance templates: auto-inject baseline checks (backtest baselines, risk caps, slippage/fee model versions) into reports. 5) Compute throttling & cost control

  • Exact dedup execution: hit cache by ExperimentID and return results immediately.

  • Incremental reuse: node-level DAG caching (features/data cuts) to skip redundant work.

  • Batch co-runs: merge shared prefixes across similar experiments—often 30%+ savings. 6) Search & automation

  • Uncovered-space discovery: coverage heatmaps from the vector index to find parameter/operator gaps.

  • Design agent: generate next DSL patches around successful/failed neighbors.

  • Guardrails: static analysis to flag overfit patterns (rule density, tight thresholds, leakage paths) and execution risks (liquidity/margin/laddering issues).


A. Canonicalization & fingerprinting

  1. dsl.parse() → AST
  2. ir = compile_to_ir(ast) (macro expansion, defaults, type checks)
  3. canon = canonicalize(ir) (strip comments/whitespace, sort keys, α-rename, topo sort, numeric normalization)
  4. id = sha256(canon || data_snapshot_id || engine_version)
  5. Registry insert: {id, canon, hash_components, owner, created_at} B. Vectors & near-dup hashing
  • vec_struct = embed_graph(ir)

  • vec_text = embed_text(ir.doc + alpha.description)

  • vec_params = encode_params(ir.hyperparams)

  • simhash = simhash(canon_text)

  • Upsert: {id, vec_struct, vec_text, vec_params, simhash, tags} C. Dedup / similarity / novelty

  • Dedup: exists(id) → return cached result

  • Similarity: topk = faiss.search(vec_struct ⊕ vec_params, k=20) + simhash_hamming<=3

  • Novelty: compute composite score and persist to novelty_scores D. Caching & write-back

  • Result key: ResultKey = sha256(id || runner_flags)

  • After run: write curves, orders, slippage, metrics, and artifacts (models, features, graphs) to artifacts/{id}/…

  • Reporting: auto-generate “A/B comparison + neighbor leaderboard + novelty score”


  • experiments(id PK, canon_hash, data_hash, engine_ver, owner, created_at, status, novelty_score)
  • vectors(id FK, struct_vec[], text_vec[], param_vec[], simhash)
  • artifacts(id FK, path, type, sha256, created_at)
  • metrics(id FK, regime, horizon, sharpe, mdd, turnover, cost_model_ver, …)
  • neighbors(id FK, neighbor_id, sim, method, created_at)

  • Float stability: canonicalize numeric literals to a fixed format to avoid hash jitter.
  • α-renaming: normalize variable names (e.g., f1,f2,…) so naming doesn’t affect similarity.
  • Topo order: topo sort + stable multi-key ordering (op-type/in-degree/position).
  • Multi-view embeddings: structure/text/params as separate channels; combine at query time.
  • MinHash coverage: apply to “feature-set/data-domain” bags to quantify functional differences.

A DSL isn’t just “nicer to write”—it makes strategies addressable, comparable, deduplicable, traceable, and searchable.

With canonicalization + fingerprinting + vector search, you can instantly answer:

  • Was this run before? (exact dedup)
  • What is it similar to, and how? (neighbor recall + AST/DAG diffs)
  • Is it worth running? (novelty/coverage vs. cost)
  • How do we iterate next? (agent-generated DSL patches in uncovered neighborhoods) If you want, I can draft a reference skeleton (Python package + minimal CLI + FAISS schema) wired to your current DSL fields.