Skip to content

Terry’s Project Details Q1 2025

Source: Notion | Last edited: 2025-04-01 | ID: 1c82d2dc-3ef...


SR&ED Claim – Terry Li’s Work on Advanced Data Infrastructure for Financial Time Series Forecasting - Q1 2025

Section titled “SR&ED Claim – Terry Li’s Work on Advanced Data Infrastructure for Financial Time Series Forecasting - Q1 2025”

Building on the foundational work from 2024, the first quarter of 2025 presented increasing challenges in data acquisition, storage, and retrieval for financial time series analysis. Market uncertainty continued from 2024, with even higher volatility creating unprecedented demands on our data infrastructure:

  • Existing data retrieval mechanisms couldn’t efficiently handle the increased volume of high-frequency trading data
  • Standard caching solutions proved inadequate for the zero-latency requirements of our trading models
  • Open-source data integrity tools failed to provide the validation guarantees required for financial data
  • Off-the-shelf APIs lacked the sophistication to handle time boundary precision necessary for accurate backtesting

Existing Technologies and Their Limitations

Section titled “Existing Technologies and Their Limitations”

While the 2024 work focused on dynamic loopback frameworks for feature engineering, we discovered that the underlying data infrastructure itself was becoming a bottleneck. Existing technologies presented several limitations:

  • Although Binance provides sample code and API documentation (https://github.com/binance/binance-spot-api-docs), they lack robust error handling, rate limiting management, and performance optimization
  • Standard Binance API clients (Python-Binance: https://github.com/sammchardy/python-binance, CCXT: https://github.com/ccxt/ccxt) encountered rate limiting and connection instability issues
  • Conventional data storage formats (CSV, Parquet) couldn’t support the random access patterns needed for our dynamic window calculations
  • The lack of industry standards in cryptocurrency and high-frequency trading domains meant we couldn’t rely on established solutions
  • Third-party data vendors such as Kaiko (https://www.kaiko.com/) failed to provide coherent standards that would meet our requirements

To overcome these limitations, a complete redesign of our Data Source Manager was necessary, resulting in the creation of the Binance Data Services repository (https://github.com/Eon-Labs/binance-data-services). This technological advancement:

  • Implements zero-copy memory-mapped file access using Apache Arrow, a significant departure from traditional file-based approaches
  • Creates an intelligent mediator architecture to dynamically select optimal data sources based on recency and availability
  • Develops a sophisticated multi-stage validation pipeline ensuring data integrity across the entire data lifecycle
  • Introduces comprehensive error handling and rate limit management not available in any standard solutions The development of this system represented a true technological uncertainty, as it required innovative approaches to solve problems not adequately addressed by existing solutions.

Systematic Investigation & Experimental Development

Section titled “Systematic Investigation & Experimental Development”

The development process followed a structured experimental methodology:

  1. Problem Identification - Isolating specific performance and reliability bottlenecks in the data pipeline
  2. Hypothesis Formulation - Proposing novel architectural approaches to address each bottleneck
  3. Prototype Development - Implementing experimental solutions for each component
  4. Performance Testing - Rigorously measuring improvements against quantifiable metrics
  5. Integration and Refinement - Combining successful components into a cohesive system

First Iteration: Memory-Mapped Data Access Architecture

Section titled “First Iteration: Memory-Mapped Data Access Architecture”
  • Approach: Implemented initial Apache Arrow MMAP integration for data access
  • Technical Details: Developed column-based data storage with zero-copy reads using memory mapping
  • Risks: Memory management complexities and potential memory leaks with large datasets
  • Results: Achieved at least threefold (3x) performance improvement over traditional file access methods but encountered stability issues with concurrent access
  • Documentation: Performance testing results archived at [internal reference: EON-LABS-DS-PERF-2025-Q1-001]

Second Iteration: Unified Cache Management System

Section titled “Second Iteration: Unified Cache Management System”
  • Approach: Developed specialized caching system with partial data loading capabilities
  • Technical Details: Implemented monthly file organization with JSON metadata tracking and SHA-256 integrity verification
  • Risks: Cache invalidation logic complexity and potential for stale data
  • Results: Reduced repeated data access times from seconds to milliseconds while maintaining data integrity
  • Documentation: Implementation details and testing results archived at [internal reference: EON-LABS-DS-CACHE-2025-Q1-002]

Third Iteration: Intelligent Data Source Selection

Section titled “Third Iteration: Intelligent Data Source Selection”
  • Approach: Created dynamic source selection between REST and Vision APIs based on data age and availability
  • Technical Details: Implemented 36-hour threshold logic with automatic fallback mechanisms and exponential backoff retry
  • Risks: Race conditions between sources and potential for incomplete data with API failures
  • Technical Challenges: Aligning data terminologies between different sources and designing for extensibility to other exchanges and data types
  • Results: Successfully balanced data freshness and retrieval performance while enhancing system resilience
  • Documentation: Source selection algorithm and performance metrics archived at [internal reference: EON-LABS-DS-SOURCE-2025-Q1-003]
  • Implemented comprehensive benchmarking showing at least 3x improvement in data retrieval speed
  • Developed time boundary handling that follows Binance REST API convention (inclusive start date, exclusive end date)
  • Implemented SHA-256 checksum verification to ensure data integrity and prevent network data drop issues
  • Documentation of performance testing methodology and results archived at [internal reference: EON-LABS-DS-METRICS-2025-Q1-004]

Connection to 2024 Feature Engineering Work

Section titled “Connection to 2024 Feature Engineering Work”

This data infrastructure work directly supports and enhances the feature engineering work from 2024:

  • The dynamic loopback framework developed in 2024 requires efficient access to historical data at varying time windows
  • The precise time boundary handling in the new system ensures accurate feature calculation without look-ahead bias
  • The improved data retrieval performance enables more complex feature engineering techniques that were previously impractical
  • The enhanced data integrity validation ensures that feature calculations are based on reliable and accurate market data
  • Initial REST-Only Approach: Attempted to use Binance REST API exclusively for all data
    • Failure Reason: Rate limiting and connection stability issues made this impractical for large historical data sets
    • Technical Insight: Led to the development of the dual-source architecture with smart selection logic
    • Documentation: Failure analysis and API limitations documented at [internal reference: EON-LABS-DS-FAIL-2025-Q1-001]
  • File-Based Caching: Tested traditional file-based caching using CSV and Parquet formats
    • Failure Reason: Unacceptable performance degradation with large datasets and high concurrency
    • Technical Insight: Drove the adoption of memory-mapped Arrow files with columnar access
    • Documentation: Comparative performance analysis archived at [internal reference: EON-LABS-DS-FAIL-2025-Q1-002]
  • Standard Checksum Verification: Relied on Binance’s provided checksums for data validation
    • Failure Reason: Discovered systematic issues with checksum verification for historical data files, where valid data content had consistent checksum mismatches correlating with file modification dates
    • Technical Insight: Necessitated development of our own fail-safe system to verify data integrity and potentially generate synthetic data when official sources contained inconsistencies
    • Documentation: Formal complaint filed with Binance at https://github.com/binance/binance-public-data/issues/405 with the following content:
We have identified a systematic issue with checksum verification for historical data files from Binance's public data repository. Our investigation shows that while the data content appears valid, there are consistent checksum mismatches that correlate with file modification dates.
  • Archived Evidence: Screenshots of the issue and correspondence with Binance support archived at [internal reference: EON-LABS-DS-BINANCE-ISSUE-2025-Q1-001]
  • Rate Limiting Management: Initial implementation struggled with Binance’s rate limitations
    • Failure Reason: Binance restricts each retrieval to 1000 data points with strict endpoint throttling
    • Technical Insight: Led to development of sophisticated caching and batching mechanisms to maximize throughput without triggering API blocking
    • Documentation: Rate limiting patterns and mitigation strategies documented at [internal reference: EON-LABS-DS-RATE-2025-Q1-001]

Several technical challenges remain to be addressed in future quarters:

  • Integration of time boundary handling with backtesting systems, as current implementation follows API conventions but hasn’t been proven workable for backtesting
  • Development of more sophisticated data integrity validation for edge cases in market data
  • Resolving inconsistencies in Binance’s checksum system, potentially requiring synthetic data generation when official sources are unreliable
  • Integration with real-time streaming data sources for live trading applications The work completed in Q1 2025 has established a solid foundation for these future developments while significantly advancing our capability to efficiently process and analyze financial time series data.