Why we do novel data reduction (feature extraction) for OHLCV in sec(s) & <5min?

Source: Notion | Last edited: 2025-03-26 | ID: 1062d2dc-3ef...

Technical Explanation of Feature Engineering for High-Frequency OHLCV Data

In the realm of financial data analysis, OHLCV—which stands for Open, High, Low, Close, and Volume—serves as a fundamental dataset for modeling and predicting market movements. When dealing with high-frequency OHLCV data captured in seconds and under five minutes, feature engineering becomes a critical process to harness the raw data’s potential while mitigating inherent challenges.

Pros and Cons of High-Frequency OHLCV Data

High-frequency OHLCV data offers substantial advantages:

Raw and Informative: The granularity of data captured every second or within five-minute intervals provides a detailed and comprehensive view of market dynamics. This richness enables the detection of subtle patterns and immediate market reactions that lower-frequency data might overlook. However, this abundance of data also introduces significant drawbacks:
Too Much Noise: High-frequency data is often riddled with random fluctuations and transient anomalies that do not necessarily indicate meaningful market trends. This noise complicates the extraction of reliable signals necessary for accurate predictions and decision-making.

Trade-Offs Arising from Noise

The excessive noise in high-frequency OHLCV data necessitates careful consideration of two primary trade-offs:

Immediacy vs. Noise:

High Predictive Power: Leveraging the most recent data can enhance the accuracy of short-term predictions by capturing immediate market movements.
Increased Noise: However, the same immediacy introduces more noise, making it challenging to distinguish genuine signals from random fluctuations.

Profit vs. Transaction Costs:

More Trading Opportunities: High-frequency trading can capitalize on numerous short-term opportunities, potentially increasing profitability.
Higher Transaction Costs: Conversely, the frequency of trades leads to increased costs, such as brokerage fees and slippage, which can erode overall profits.

Feature Extraction as a Mitigation Strategy

To address these trade-offs, Feature Extraction emerges as a pivotal solution. By transforming raw OHLCV data into meaningful features, we can enhance signal quality and optimize trading strategies. Feature extraction encompasses various methodologies, each with its own set of advantages and challenges.

1. Classical Technical Analysis (TA)

Classical TA involves the use of technical indicators derived from price and volume data to predict future market movements. Common indicators include Moving Averages (MA), Moving Average Convergence Divergence (MACD), and Relative Strength Index (RSI).

Pros:
- TA-Lib Availability: A widely-used library that offers a comprehensive suite of technical indicators, facilitating their implementation and integration into models.
Cons:
1. Prized in Effect / α Decay:
- Issue: Technical indicators may lose their effectiveness over time as market conditions evolve.
- Solutions:
  - Novelties: Techniques such as anchor-rebasing and dynamic lookback periods are employed to adapt indicators to changing market environments, thereby mitigating α decay.
1. Too Lagging:
- Issue: Indicators often rely on historical data, causing delays in signal generation.
- Solutions:
  - Dependent on More Recency Data: Utilizing indicators like Exponential Moving Averages (EMA) instead of Simple Moving Averages (SMA) enhances responsiveness to recent price changes, reducing lag.
1. Still Too Much Noise:
- Issue: Even with technical indicators, high-frequency data remains noisy due to rapid price fluctuations.
- Solutions:
  - Smoothing by Reducing Variants: Implementing techniques such as cost-adjusted smoothing helps in minimizing noise, leading to clearer and more reliable signals.

2. Classical Normalization

Classical Normalization encompasses techniques like min-max scaling, z-score normalization, ARIMA, and one-hot encoding to preprocess data for machine learning models.

Pros:
- Python Modules Widely Available: Extensive support through libraries such as scikit-learn makes these techniques easily accessible and implementable.
Neutral Aspect:
- Final Touch-Up for ML Training Only: While normalization is essential for preparing data for machine learning algorithms, it does not inherently address the underlying noise in high-frequency data. Instead, it serves as a preprocessing step to ensure that features are on a comparable scale, enhancing model performance.

Conclusion

Feature engineering for high-frequency OHLCV data is a balancing act between leveraging the richness of raw data and mitigating the challenges posed by excessive noise. By employing robust feature extraction techniques, such as Classical Technical Analysis and Normalization, analysts can enhance predictive power and optimize trading strategies while managing transaction costs. Additionally, thoughtful visualization practices ensure that complex processes are communicated effectively, fostering better understanding and informed decision-making across diverse audiences.

This comprehensive approach not only harnesses the full potential of high-frequency data but also addresses the practical limitations inherent in its application, paving the way for more accurate and efficient financial modeling and trading strategies.

https://mermaid.ink/svg/pako:eNqtVv9u4jgQfhUrq13tSoGD0NIqe1oJhVZFKj9U0P1xx-lkHAd8TezIdtqmVR9qX-Se6WacEKBwvX8KKk3smfk-j78Z-8VjKuZe6K01zTdkMVxKAp_Pn0nEpdU0JROYJ0xlKyGFXJPpzW30m09mWhmfUBmTSElTObmpP5ae-0-EJIazr-Yb-UK-pPb7eSbkryv9A__QO8SHO_pIfiEjmSidUSseOA5iRDe7UIpkBdsQqYThS-_PCqehuNA05i2VJIZQLQzSS7TK3lIirdYPMhoDs1GW8VhQVpIH00aEyUHgnfVsAdbAMhF2awpg0lBmhZIAYKw55nMQnlSx3dxoXHOYQdgbsd5A_sCSbVc8U49cN_Ea6wgpS6Y5NTwmk9NJ2LEkxxSd0WxRrwnRx0pzlzjIFkJP81xpW0hhBd8tqXGJasJc70evdulkDq45tQVAXD2Beioq1JC5Sgt83i4wcuGvryD6scMei63d_4OMud2ouF7x9ZXzWwwgfpRSYwQDJS8GTlXjgU_Gg2jok7v5yCfcsgaw9osmB34TFGcqnikCuRAg5VZGn3zy3DIMEurj4OBuhKGV5K2Nsgdxd0W1R8bV0JsSgtGK-KzSX10Hg9atWBH6QEVKV-lOBI151EXG27qZafEMgoEK5EnCmYUS--cniTmj5QnXYN8VSy6l6zWo44Rpb990bkWaEntQozgeFxxGSa4F48Q4I7ahcu0mV07KsF-cgk_dIJSuOgSBvJKYWkpyJaQ9Tt9EPfAUhUoKGaMkYeFbjlHXsZxMsQU1ho4plWyjdEtzQIeFud2KS0kzwUiq1P2KsnsnhPYx5JDnHLCkRdIZFo_mjEuocUe04RE0PALHY3iFW3jkjdD7ARy_r1TSVK3LkFyNB7BtxnIKSUrIfDz4dkxpnillN9jtViWwiQuGzw_QAqm0e6npNZR6jtJ8PJ0uboDVyQBIZBujYsXb67YPrd_Yv2j8dwGs4m_vSfqgTnbqnvDCHSQDk4MWK79oUhfagc5nJdQw5ikuUtjjRxHztDwh-8Yb-0cd3gW4FpBIEB_qschxBA4WMr4lYFEdX0qm5YmM2jJ1pwdYsxSSYEtHnYHWBawbdquyZbjYIU9A3bC6BLQdforPkrM49o3V6p6HnwJ-EfeC-rUFa7CbMMihVyRK2paBygy7Z_i-Uho2qoWtuDBht5M_fX8DgvA1CEDAtwFh_eAyuPwQEFO35gYpSXqrToOUnF8k3YsPQZK1EGqcDn4bHNrB78ekrb65NOuBT4PTcZ_3cfrN-yOHs8-GK5XG7yA3OhrkeVqikBwXbvaI1ZeLmts-ZTK7m859aPk-lIMT1sFsNJ3MfTgxfTgNfWxz-BP4rqpRIAfGcIA1-3kwAW3Rh5bkVz3gP4ygoLabBOOe72UcKlrEcD18QbulZzc8gzoM4THFzCy9pXwFQ1pYNS8l80KrC-57WhXrjRcmNDXwVuTQ6vhQULhkZs0oXICs0uPq-uluob6XU_m7Utk2DLx64Yv35MHeB-3gon92edYPgqDX7fpeCaNt3MxecNk575_3-p3g8tX3nl2A7uu_fyx4mw

graph TD
    %% Central Node combining OHLCV, Pros, and Cons
    OHLCV["OHLCV in sec(s) & <5min<br><br>Pros:<br>Raw / Informative<br>Cons:<br>Too much noise"]

    %% Trade-offs arising from Cons
    OHLCV --> IM["Immediacy vs.<br>Noise"]
    OHLCV --> PT["Profit vs.<br>Transaction Costs"]

    %% Immediacy vs. Noise
    IM --> IMP["High Predictive<br>Power"]
    IM --> IMC["Increased Noise"]

    %% Profit vs. Transaction Costs
    PT --> PTP["More Trading<br>Opportunities"]
    PT --> PTC["Higher Transaction<br>Costs"]

    %% Feature Extraction as Solution
    IMC --> FE["Feature Extraction"]
    PTC --> FE

    %% Feature Extraction Methods
    FE --> TA["Classical TA:<br>MA, MACD, RSI, etc"]
    FE --> CN["Classical Normalization:<br>min-max, z-score,<br>ARIMA, one-hot etc"]

    %% Classical TA Pros and Cons
    TA --> TAP["Pros:<br>TA-Lib available"]
    TA --> TAC1["Cons:<br>Prized in effect / α decay"]
    TA --> TAC2["Cons:<br>Too lagging"]
    TA --> TAC3["Cons:<br>Still too much noise<br>due to price still change<br>based on each sec(s) or <5 min data point"]

    %% Novelties under TAC1
    TAC1 --> NOV["Novelties:<br>anchor-rebasing,<br>dynamic lookback, etc."]

    %% Dependent on more recency data under TAC2
    TAC2 --> DEP["Dependent on more<br>recency data:<br>(analogy: EMA instead of SMA)"]

    %% Smoothing by reducing variants under TAC3
    TAC3 --> SMOOTH["Smoothing by reducing<br>variants:<br>(e.g., cost_adjusted)"]

    %% Classical Normalization Pros and Neutral Aspect
    CN --> CNP["Pros:<br>Python modules widely available"]
    CN --> CNE["Neutral:<br>Final touch up<br>for ML training only"]

    %% Styling for dark theme
    classDef pros fill:#2e7d32,stroke:#4caf50,stroke-width:2px,font-size:14px,color:#ffffff;
    classDef cons fill:#c62828,stroke:#ef5350,stroke-width:2px,font-size:14px,color:#ffffff;
    classDef solutions fill:#f57f17,stroke:#ffa000,stroke-width:2px,font-size:14px,color:#ffffff;
    classDef neutral fill:#616161,stroke:#9e9e9e,stroke-width:2px,font-size:14px,color:#ffffff;
    classDef central fill:#424242,stroke:#ffffff,stroke-width:2px,font-size:16px,font-weight:bold,color:#ffffff;

    class OHLCV central;
    class PROS,TAP,CNP pros;
    class CONS,IMC,PTC,TAC1,TAC2,TAC3 cons;
    class FE solutions;
    class NOV,DEP,SMOOTH solutions;
    class CNE neutral;