Data Pre-processing

Source: Notion | Last edited: 2022-12-21 | ID: 522e70da-5da...

Polars is faster and easier to learn than Pandas and Dask.
Polars cannot do importing from a folder with multiple CSV files. Use Dask to do it instead.
Use Python Dictionary to make looping faster than dataframe. Dataframe is only fastest when doing vectorized manipulation
TradingView indicator Pine script can be converted to Python script

Data Representation

range bar is a useful to make hellishly large tick data files smaller and more manageable

Uneven: Unlike moving average of the last n-bars, Zigzag pivot point shows up unevenly in a time series
Non-repainting: Classical Zigzag indicator repaints due to the uncertain end points
stand-alone Proven to be significant by NST
Zigzag, itself, apart from being feature inputs, is potentially an ideal candidate to offer more features columns within every range bar. Think of it as a matryoshka doll (套娃).

dateUnix, as a reference index to join the dataframe
a, column bar index loop location WHEN a new pivot is found
i, the bar index loop location WHERE the pivot is location
p, price level of the pivot
h, **True **means a Zigzag pivot High, **False **means a Zigzag pivot Low
h1, created by shift(2), meaning the previous pivot high
w, higher water mark magnitude calculated by the current pivot high minus the previous pivot high; and current pivot low minus the previous pivot low
std, rolling_std(100), qua, rolling_quantile(100), skw, rolling_skew(100)
- they’re used to scaling the w (higher water mark) and distance between pivot

TradingView PINE current bar position is translated to Series[bi_]
TradingView PINE reference to the number of bar back as denoted in Series[reference_position] is translated to:
- list[bi_ - reference_position]
- where list is a Python list

Feature scaling https://en.wikipedia.org/wiki/Feature_scaling
Increase frequency of changes, e.g. smaller deviation percentage threshold in Zigzag (0.618% instead of 1.618%)
Pre-decompress very large zip data into CSV first.
- Failed to conserve HDD space by using the Python zipfile module on the Binance trade-by-trade data files, which have a file size of several hundred megabytes. The zipfile module’s opening operation takes too long when decompressing huge zip files. Make it a separate procedure that operates on CSV that has already been decompressed.
Train, Validation, Test Datasets