Skip to content

Data Pre-processing

Source: Notion | Last edited: 2022-12-21 | ID: 522e70da-5da...


  1. Polars is faster and easier to learn than Pandas and Dask.
  2. Polars cannot do importing from a folder with multiple CSV files. Use Dask to do it instead.
  3. Use Python Dictionary to make looping faster than dataframe. Dataframe is only fastest when doing vectorized manipulation
  4. TradingView indicator Pine script can be converted to Python script
  • range bar is a useful to make hellishly large tick data files smaller and more manageable

Zigzag (unevenly distributed non-repainting high/low pivots)

Section titled “Zigzag (unevenly distributed non-repainting high/low pivots)”
  • Uneven: Unlike moving average of the last n-bars, Zigzag pivot point shows up unevenly in a time series
  • Non-repainting: Classical Zigzag indicator repaints due to the uncertain end points
  • stand-alone Proven to be significant by NST
  • Zigzag, itself, apart from being feature inputs, is potentially an ideal candidate to offer more features columns within every range bar. Think of it as a matryoshka doll (套娃).
  • dateUnix, as a reference index to join the dataframe
  • a, column bar index loop location WHEN a new pivot is found
  • i, the bar index loop location WHERE the pivot is location
  • p, price level of the pivot
  • h, **True **means a Zigzag pivot High, **False **means a Zigzag pivot Low
  • h1, created by shift(2), meaning the previous pivot high
  • w, higher water mark magnitude calculated by the current pivot high minus the previous pivot high; and current pivot low minus the previous pivot low
  • std, rolling_std(100), qua, rolling_quantile(100), skw, rolling_skew(100)
    • they’re used to scaling the w (higher water mark) and distance between pivot
  • TradingView PINE current bar position is translated to Series[bi_]
  • TradingView PINE reference to the number of bar back as denoted in Series[reference_position] is translated to:
    • list[bi_ - reference_position]
    • where list is a Python list
  • Feature scaling https://en.wikipedia.org/wiki/Feature_scaling

  • Increase frequency of changes, e.g. smaller deviation percentage threshold in Zigzag (0.618% instead of 1.618%)

  • Pre-decompress very large zip data into CSV first.

    • Failed to conserve HDD space by using the Python zipfile module on the Binance trade-by-trade data files, which have a file size of several hundred megabytes. The zipfile module’s opening operation takes too long when decompressing huge zip files. Make it a separate procedure that operates on CSV that has already been decompressed.
  • Train, Validation, Test Datasets