Turing Design (WIP)
Source: Notion | Last edited: 2024-06-05 | ID: be2bf5de-6b5...
Background
Section titled “Background”As of today, the feature file in the Turing Model (previously named Enigma) is static. The version of it is controlled by git branches.
When making predictions using models trained with a different versioned feature, we have to create a separate docker container to contain the same repo on a different branch. When there are more versioned features that need to be trained/used for prediction, it is very inconvenient.
Solution
Section titled “Solution”To solve that problem, we want to make these things configurable: feature files, the model structure file, as well as activation and loss functions.
To do that, we will start by making feature files dynamic. Namely, we want to achieve the following:
- When making a prediction, the code knows precisely how to process the features from a parameter featureset_name we pass in. The featureset_name can be any of the following.
prod_v11_size50prod_v15_size79prod_v18_size126prod_v15_size79_based_test_validation_v3_size80prod_v18_size126_based_test_validation_v12_size119
# once the test featureset 满足了某个条件,就可以被promote成 prod feature set# The postfix *size* should be auto-addedThe feature code snippet should be stored somewhere.
- option 1: in s3 when first being submitted, and then moved in a github repo once they are verified and promoted to the prod feature set.
- When testing new features, the method to generate the featureset, and the featureset_name will be submitted to a Touchstone service. The ‘touchstone’ service evaluates how good a featureset is on a given dataset. More specifically, it will do the following:
- use the featureset to generate features on the given train and validation dataset (upon which all other featureset are also evaluated)
- train and evaluate models on a given test dataset, possibly many times, for a statistically significant result. Also, examine the trained models’ feature importance.
- save the evaluation and model importance result to a db table, s3 file, or something like a ML flow service, so we can easily compare the performance of this featureset with the other ones.
Define the signature of the method
Section titled “Define the signature of the method”可以修改整个feature file, 对于每个 feature group, 如果不specify某个type,则代表没有修改。
def create_features(*base_featureset_name, new_featureset_name, selectors*): base_feature_helpers = import("*base_featureset_name"*) new_feature_helpers = import("*new_featureset_name"*) ohlcv = fetch_ohlcv()
feature_list = []
for selector in selectors: feature = new_feature_helpers.extract_by_type( feature_type, dates=dates, open_prices=open_prices, close_prices=close_prices, high_prices=high_prices, low_prices=low_prices, volumes=volumes, )
# 如果new_feature中没有定义这个feature type,则使用base feature中定义的 if len(feature) == 0: feature = base_feature_helpers.extract_by_type( feature_type, dates=dates, open_prices=open_prices, close_prices=close_prices, high_prices=high_prices, low_prices=low_prices, volumes=volumes, )
if len(feature) != len(ohlcv): raise inputError(f"feature {selector} parse is incorrect: feature length ({len(feature)}) != ohlcv length ({len(ohlcv})")
feature_list.append(feature)
return feature_list
def extract_raw_features( selectors, feature_type, dates=None, open_prices=None, close_prices=None, high_prices=None, low_prices=None, volumes=None, ): features = [] for selector in selectors: extract_by_type( feature_type, dates=dates, open_prices=open_prices, close_prices=close_prices, high_prices=high_prices, low_prices=low_prices, volumes=volumes, )
# 根据selectors 生成 features
return featuresROCP MACD RSI VROCP BOLL MA VMA
PRICE_VOLUME
MIN_MAX
TIME VOLATILITY PATTERN STOCH
添加新的feature
还是修改整个feature file?