Advanced Guidelines for Zero-Magic-Number Financial Time Series Feature Engineering
Source: Notion | Last edited: 2025-06-04 | ID: 2082d2dc-3ef...
Seven Fundamental Principles (ABSOLUTE REQUIREMENTS)
Section titled “Seven Fundamental Principles (ABSOLUTE REQUIREMENTS)”PRINCIPLE 1: ZERO SYNTHETIC DATA TOLERANCE
Section titled “PRINCIPLE 1: ZERO SYNTHETIC DATA TOLERANCE”- Never create, generate, simulate, or interpolate data points
- Never use bootstrap resampling, synthetic distributions, or artificial data
- Never fill missing data with interpolated values
PRINCIPLE 2: PURE DATA-DRIVEN DERIVATION
Section titled “PRINCIPLE 2: PURE DATA-DRIVEN DERIVATION”- All parameters must derive from actual observed data
- Never use literature priors, empirical multipliers, or external defaults
- Never use hardcoded constants, including machine precision multipliers
- Example:
threshold = np.percentile(historical_data, optimal_percentile)whereoptimal_percentileis derived from data
PRINCIPLE 3: FAIL-FAST ON INSUFFICIENT DATA
Section titled “PRINCIPLE 3: FAIL-FAST ON INSUFFICIENT DATA”- No fallbacks, no degradation, no synthetic alternatives
- Insufficient data = explicit failure with clear error message
- Better to have no feature than a feature based on fake data
- Implementation:
if len(data) < data_derived_minimum: raise InsufficientDataError(f"Need {data_derived_minimum}, got {len(data)}")
PRINCIPLE 4: UNIVERSAL INSTRUMENT-AGNOSTIC OPERATION
Section titled “PRINCIPLE 4: UNIVERSAL INSTRUMENT-AGNOSTIC OPERATION”- Algorithms must work identically across all instruments without modification
- No crypto/forex/equity-specific logic anywhere in the system
- Raw numbers are meaningless - always use relative comparisons within windows
- Adaptive mechanisms (like change point detection thresholds or window sizing logic) must be instrument-agnostic and data-derived.
- Performance criteria derived from instrument’s own historical characteristics
PRINCIPLE 5: TEMPORAL INTEGRITY
Section titled “PRINCIPLE 5: TEMPORAL INTEGRITY”- Never use future information to make current decisions (no look-ahead bias)
- Lag between threshold calculation and application derived from data autocorrelation
- Never use current value to set its own threshold
- Strict temporal separation in all calculations
PRINCIPLE 6: REAL-TIME ADAPTIVITY
Section titled “PRINCIPLE 6: REAL-TIME ADAPTIVITY”- System continuously adapts to changing market conditions without manual intervention
- Adaptivity is driven by Change Point Detection and Regime Switching, allowing dynamic responses to market shifts.
- Automatic regime detection using unsupervised methods (HMM, change point detection)
- Adaptation speed scales with data-derived volatility characteristics
- Seamless parameter transitions during regime shifts (no data interpolation)
PRINCIPLE 7: COMPUTATIONAL DETERMINISM
Section titled “PRINCIPLE 7: COMPUTATIONAL DETERMINISM”- Same inputs always produce same outputs
- Computational resource bounds derived from data complexity and processing requirements
- Adaptation latency derived from data frequency and volatility characteristics
- Execution performance benchmarked against data-derived baselines
Four Implementation Rules
Section titled “Four Implementation Rules”RULE 1: Data-Driven Parameter Derivation
Section titled “RULE 1: Data-Driven Parameter Derivation”- Never use hardcoded constants: thresholds, percentages, window sizes, time factors, multipliers
- Derive from data distribution: percentiles, quantiles, statistical measures, autocorrelation decay
- Regularization factors: Derive from data magnitude relationships, not machine precision constants
- Statistical constants: Derive from data-driven optimization, not mathematical formulas
- Time calculations: Extract from actual timestamp patterns in data, account for observed irregularities
RULE 2: Window-Based Relative Analysis
Section titled “RULE 2: Window-Based Relative Analysis”- Always use window-based relativity: Compare within rolling windows for universal applicability
- Windowing and segmentation are driven by Change Point Detection, adapting dynamically to market regimes.
- Autocorrelation-based sizing:
window = autocorr_decay_point * data_derived_multiplier - Multi-scale analysis: Window sizes determined by data’s natural scale hierarchy, often guided by regime characteristics.
- Memory efficiency: Buffer sizes derived from data streaming characteristics
- Implementation:
threshold = np.percentile(rolling_window_values, data_optimized_percentile)
RULE 3: Robust Statistical Methods
Section titled “RULE 3: Robust Statistical Methods”- Replace mean/std with robust alternatives: median/MAD, winsorization
- Outlier resistance: Use data-adaptive robust methods
- Adaptive thresholds: Optimization based on data distribution characteristics, often regime-specific.
- Multi-hypothesis correction: Significance levels derived from multiple testing burden in data
- Exponential weighting: Decay rates derived from data’s volatility clustering patterns
RULE 4: Multi-Scale Consensus Validation
Section titled “RULE 4: Multi-Scale Consensus Validation”- Cross-validation: Window sizes and validation periods derived from data regime patterns
- Hierarchical scale validation: Ensure consistency across data’s natural time scales
- Multi-resolution consensus: Weighting derived from scale-specific performance on historical data
- Multi-factor validation: Factor importance derived from historical predictive performance
- Contradiction resolution: Voting weights derived from factor reliability in data
- Ensemble/Multi-Expert Systems: Combine outputs from different models or windowing schemes for robust validation.
Implementation Hierarchy
Section titled “Implementation Hierarchy”Tier 1: Turnkey Statistical Solutions
Section titled “Tier 1: Turnkey Statistical Solutions”scipy.stats: All statistical functions, distributions, testsnumpy: Mathematical operations, percentiles, array operationspandas: Data manipulation, rolling operations, time series
Tier 2: Specialized Financial Libraries (Strongly Recommended Out-of-the-Box)
Section titled “Tier 2: Specialized Financial Libraries (Strongly Recommended Out-of-the-Box)”ruptures: State-of-the-art Change Point Detection library.quantstats: Financial metrics, volatility calculationsarch: GARCH models, volatility forecastingstatsmodels: Time series analysis, regime detection (includes some change point methods)
Tier 3: Machine Learning Frameworks
Section titled “Tier 3: Machine Learning Frameworks”sklearn: Gaussian Mixture Models, preprocessing, clustering, ensemble methods.scipy.optimize: Parameter estimation, threshold optimization
Tier 4: Custom Implementation (Last Resort)
Section titled “Tier 4: Custom Implementation (Last Resort)”- Only when no off-the-shelf solution exists
- Must follow all 7 principles and 4 rules above
- Extensive testing against known benchmarks
- Mandatory peer review and benchmark comparison
Validation Requirements
Section titled “Validation Requirements”Empirical Testing Standards
Section titled “Empirical Testing Standards”- Historical Backtesting: Test across multiple market regimes, including periods with significant change points.
- Stress Testing: Extreme scenarios (2008 crisis, Terra Luna, COVID) - ensure adaptivity during crises.
- Cross-Asset Validation: Ensure generalizability across instruments.
- Scale-Invariance Testing: Verify behavior across volatility differences observed in data.
- Regime Transition Testing: Ensure smooth adaptation during market shifts - specifically test detection and transition speed.
- Data Sufficiency Testing: Validate explicit failure when data insufficient.
- Real-Time Performance: Computational efficiency benchmarked against data processing requirements, including the overhead of adaptive mechanisms.
Documentation Standards
Section titled “Documentation Standards”- Mathematical Derivation: Document all formula sources and data-driven optimizations.
- Parameter Justification: Explain all data-driven parameter choices with empirical evidence, including how they adapt to regimes.
- Failure Logic: Document explicit failure scenarios and data requirements.
- Performance Benchmarks: Compare against data-derived baselines, not arbitrary targets. Document adaptation speed and change point detection latency.
Success Metrics
Section titled “Success Metrics”Principle Compliance Indicators
Section titled “Principle Compliance Indicators”- Zero synthetic data contamination across all features (Principle 1)
- Zero hardcoded constants in core calculations (Principle 2)
- Explicit failure handling for insufficient data scenarios (Principle 3)
- Cross-instrument universality measured relative to instrument characteristics (Principle 4)
- Zero look-ahead bias in all temporal calculations (Principle 5)
- Regime detection accuracy measured against ground truth regime changes (Principle 6)
- Deterministic outputs given same inputs (Principle 7)
Performance Indicators
Section titled “Performance Indicators”- Gradient informativeness compared to data-derived baseline methods.
- Crisis identification within timeframes derived from historical crisis patterns.
- False positive rate relative to data’s natural noise characteristics.
- Adaptation latency relative to data frequency and volatility patterns.
- Memory efficiency relative to data complexity requirements.
- Computational scalability benchmarked against data processing demands.
- Change Point Detection Accuracy: Measured against ground truth or through robust proxies.
- Regime Transition Speed: Time taken to adapt parameters after a detected change point. This framework integrates state-of-the-art adaptive techniques and recommends out-of-the-box solutions, further enhancing the robustness and real-time capabilities of the feature engineering process while strictly adhering to the core principles.