Terry’s Project Details Q1 2025

Source: Notion | Last edited: 2025-04-01 | ID: 1c82d2dc-3ef...

SR&ED Claim – Terry Li’s Work on Advanced Data Infrastructure for Financial Time Series Forecasting - Q1 2025

Technological Uncertainty & Advancement

Ongoing Challenges from 2024

Building on the foundational work from 2024, the first quarter of 2025 presented increasing challenges in data acquisition, storage, and retrieval for financial time series analysis. Market uncertainty continued from 2024, with even higher volatility creating unprecedented demands on our data infrastructure:

Existing data retrieval mechanisms couldn’t efficiently handle the increased volume of high-frequency trading data
Standard caching solutions proved inadequate for the zero-latency requirements of our trading models
Open-source data integrity tools failed to provide the validation guarantees required for financial data
Off-the-shelf APIs lacked the sophistication to handle time boundary precision necessary for accurate backtesting

Existing Technologies and Their Limitations

While the 2024 work focused on dynamic loopback frameworks for feature engineering, we discovered that the underlying data infrastructure itself was becoming a bottleneck. Existing technologies presented several limitations:

Although Binance provides sample code and API documentation (https://github.com/binance/binance-spot-api-docs), they lack robust error handling, rate limiting management, and performance optimization
Standard Binance API clients (Python-Binance: https://github.com/sammchardy/python-binance, CCXT: https://github.com/ccxt/ccxt) encountered rate limiting and connection instability issues
Conventional data storage formats (CSV, Parquet) couldn’t support the random access patterns needed for our dynamic window calculations
The lack of industry standards in cryptocurrency and high-frequency trading domains meant we couldn’t rely on established solutions
Third-party data vendors such as Kaiko (https://www.kaiko.com/) failed to provide coherent standards that would meet our requirements

Technical Requirements and Novel Approach

To overcome these limitations, a complete redesign of our Data Source Manager was necessary, resulting in the creation of the Binance Data Services repository (https://github.com/Eon-Labs/binance-data-services). This technological advancement:

Implements zero-copy memory-mapped file access using Apache Arrow, a significant departure from traditional file-based approaches
Creates an intelligent mediator architecture to dynamically select optimal data sources based on recency and availability
Develops a sophisticated multi-stage validation pipeline ensuring data integrity across the entire data lifecycle
Introduces comprehensive error handling and rate limit management not available in any standard solutions The development of this system represented a true technological uncertainty, as it required innovative approaches to solve problems not adequately addressed by existing solutions.

Systematic Investigation & Experimental Development

Methodology and Hypothesis Testing

The development process followed a structured experimental methodology:

Problem Identification - Isolating specific performance and reliability bottlenecks in the data pipeline
Hypothesis Formulation - Proposing novel architectural approaches to address each bottleneck
Prototype Development - Implementing experimental solutions for each component
Performance Testing - Rigorously measuring improvements against quantifiable metrics
Integration and Refinement - Combining successful components into a cohesive system

Experimental Iterations and Results

First Iteration: Memory-Mapped Data Access Architecture

Approach: Implemented initial Apache Arrow MMAP integration for data access
Technical Details: Developed column-based data storage with zero-copy reads using memory mapping
Risks: Memory management complexities and potential memory leaks with large datasets
Results: Achieved at least threefold (3x) performance improvement over traditional file access methods but encountered stability issues with concurrent access
Documentation: Performance testing results archived at [internal reference: EON-LABS-DS-PERF-2025-Q1-001]

Second Iteration: Unified Cache Management System

Approach: Developed specialized caching system with partial data loading capabilities
Technical Details: Implemented monthly file organization with JSON metadata tracking and SHA-256 integrity verification
Risks: Cache invalidation logic complexity and potential for stale data
Results: Reduced repeated data access times from seconds to milliseconds while maintaining data integrity
Documentation: Implementation details and testing results archived at [internal reference: EON-LABS-DS-CACHE-2025-Q1-002]

Third Iteration: Intelligent Data Source Selection

Approach: Created dynamic source selection between REST and Vision APIs based on data age and availability
Technical Details: Implemented 36-hour threshold logic with automatic fallback mechanisms and exponential backoff retry
Risks: Race conditions between sources and potential for incomplete data with API failures
Technical Challenges: Aligning data terminologies between different sources and designing for extensibility to other exchanges and data types
Results: Successfully balanced data freshness and retrieval performance while enhancing system resilience
Documentation: Source selection algorithm and performance metrics archived at [internal reference: EON-LABS-DS-SOURCE-2025-Q1-003]

Performance Metrics and Validation

Implemented comprehensive benchmarking showing at least 3x improvement in data retrieval speed
Developed time boundary handling that follows Binance REST API convention (inclusive start date, exclusive end date)
Implemented SHA-256 checksum verification to ensure data integrity and prevent network data drop issues
Documentation of performance testing methodology and results archived at [internal reference: EON-LABS-DS-METRICS-2025-Q1-004]

Connection to 2024 Feature Engineering Work

This data infrastructure work directly supports and enhances the feature engineering work from 2024:

The dynamic loopback framework developed in 2024 requires efficient access to historical data at varying time windows
The precise time boundary handling in the new system ensures accurate feature calculation without look-ahead bias
The improved data retrieval performance enables more complex feature engineering techniques that were previously impractical
The enhanced data integrity validation ensures that feature calculations are based on reliable and accurate market data

Failed Approaches and Lessons Learned

Initial REST-Only Approach: Attempted to use Binance REST API exclusively for all data
- Failure Reason: Rate limiting and connection stability issues made this impractical for large historical data sets
- Technical Insight: Led to the development of the dual-source architecture with smart selection logic
- Documentation: Failure analysis and API limitations documented at [internal reference: EON-LABS-DS-FAIL-2025-Q1-001]
File-Based Caching: Tested traditional file-based caching using CSV and Parquet formats
- Failure Reason: Unacceptable performance degradation with large datasets and high concurrency
- Technical Insight: Drove the adoption of memory-mapped Arrow files with columnar access
- Documentation: Comparative performance analysis archived at [internal reference: EON-LABS-DS-FAIL-2025-Q1-002]
Standard Checksum Verification: Relied on Binance’s provided checksums for data validation
- Failure Reason: Discovered systematic issues with checksum verification for historical data files, where valid data content had consistent checksum mismatches correlating with file modification dates
- Technical Insight: Necessitated development of our own fail-safe system to verify data integrity and potentially generate synthetic data when official sources contained inconsistencies
- Documentation: Formal complaint filed with Binance at https://github.com/binance/binance-public-data/issues/405 with the following content:

We have identified a systematic issue with checksum verification for historical data files from Binance's public data repository. Our investigation shows that while the data content appears valid, there are consistent checksum mismatches that correlate with file modification dates.

Archived Evidence: Screenshots of the issue and correspondence with Binance support archived at [internal reference: EON-LABS-DS-BINANCE-ISSUE-2025-Q1-001]
Rate Limiting Management: Initial implementation struggled with Binance’s rate limitations
- Failure Reason: Binance restricts each retrieval to 1000 data points with strict endpoint throttling
- Technical Insight: Led to development of sophisticated caching and batching mechanisms to maximize throughput without triggering API blocking
- Documentation: Rate limiting patterns and mitigation strategies documented at [internal reference: EON-LABS-DS-RATE-2025-Q1-001]

Ongoing Challenges and Future Directions

Several technical challenges remain to be addressed in future quarters:

Integration of time boundary handling with backtesting systems, as current implementation follows API conventions but hasn’t been proven workable for backtesting
Development of more sophisticated data integrity validation for edge cases in market data
Resolving inconsistencies in Binance’s checksum system, potentially requiring synthetic data generation when official sources are unreliable
Integration with real-time streaming data sources for live trading applications The work completed in Q1 2025 has established a solid foundation for these future developments while significantly advancing our capability to efficiently process and analyze financial time series data.