Data Pre-processing
Source: Notion | Last edited: 2022-12-21 | ID: 522e70da-5da...
- Polars is faster and easier to learn than Pandas and Dask.
- Polars cannot do importing from a folder with multiple CSV files. Use Dask to do it instead.
- Use Python Dictionary to make looping faster than dataframe. Dataframe is only fastest when doing vectorized manipulation
- TradingView indicator Pine script can be converted to Python script
Data Representation
Section titled “Data Representation”down-sampling
Section titled “down-sampling”- range bar is a useful to make hellishly large tick data files smaller and more manageable
Zigzag (unevenly distributed non-repainting high/low pivots)
Section titled “Zigzag (unevenly distributed non-repainting high/low pivots)”- Uneven: Unlike moving average of the last n-bars, Zigzag pivot point shows up unevenly in a time series
- Non-repainting: Classical Zigzag indicator repaints due to the uncertain end points
- stand-alone Proven to be significant by NST
- Zigzag, itself, apart from being feature inputs, is potentially an ideal candidate to offer more features columns within every range bar. Think of it as a matryoshka doll (套娃).
Column name
Section titled “Column name”- dateUnix, as a reference index to join the dataframe
- a, column bar index loop location WHEN a new pivot is found
- i, the bar index loop location WHERE the pivot is location
- p, price level of the pivot
- h, **True **means a Zigzag pivot High, **False **means a Zigzag pivot Low
- h1, created by shift(2), meaning the previous pivot high
- w, higher water mark magnitude calculated by the current pivot high minus the previous pivot high; and current pivot low minus the previous pivot low
- std, rolling_std(100), qua, rolling_quantile(100), skw, rolling_skew(100)
- they’re used to scaling the w (higher water mark) and distance between pivot
TradingView Pine Script
Section titled “TradingView Pine Script”- TradingView PINE current bar position is translated to Series[bi_]
- TradingView PINE reference to the number of bar back as denoted in Series[reference_position] is translated to:
- list[bi_ - reference_position]
- where list is a Python list
Tips & Tricks to Improve Performance
Section titled “Tips & Tricks to Improve Performance”-
Feature scaling https://en.wikipedia.org/wiki/Feature_scaling
-
Increase frequency of changes, e.g. smaller deviation percentage threshold in Zigzag (0.618% instead of 1.618%)
-
Pre-decompress very large zip data into CSV first.
- Failed to conserve HDD space by using the Python zipfile module on the Binance trade-by-trade data files, which have a file size of several hundred megabytes. The zipfile module’s opening operation takes too long when decompressing huge zip files. Make it a separate procedure that operates on CSV that has already been decompressed.