Turing Design (WIP)

Source: Notion | Last edited: 2024-06-05 | ID: be2bf5de-6b5...

Background

As of today, the feature file in the Turing Model (previously named Enigma) is static. The version of it is controlled by git branches.

When making predictions using models trained with a different versioned feature, we have to create a separate docker container to contain the same repo on a different branch. When there are more versioned features that need to be trained/used for prediction, it is very inconvenient.

Solution

To solve that problem, we want to make these things configurable: feature files, the model structure file, as well as activation and loss functions.

To do that, we will start by making feature files dynamic. Namely, we want to achieve the following:

When making a prediction, the code knows precisely how to process the features from a parameter featureset_name we pass in. The featureset_name can be any of the following.

prod_v11_size50
prod_v15_size79
prod_v18_size126
prod_v15_size79_based_test_validation_v3_size80
prod_v18_size126_based_test_validation_v12_size119

# once the test featureset 满足了某个条件，就可以被promote成 prod feature set
# The postfix *size* should be auto-added

The feature code snippet should be stored somewhere.

option 1: in s3 when first being submitted, and then moved in a github repo once they are verified and promoted to the prod feature set.

When testing new features, the method to generate the featureset, and the featureset_name will be submitted to a Touchstone service. The ‘touchstone’ service evaluates how good a featureset is on a given dataset. More specifically, it will do the following:
use the featureset to generate features on the given train and validation dataset (upon which all other featureset are also evaluated)
train and evaluate models on a given test dataset, possibly many times, for a statistically significant result. Also, examine the trained models’ feature importance.
save the evaluation and model importance result to a db table, s3 file, or something like a ML flow service, so we can easily compare the performance of this featureset with the other ones.

Define the signature of the method

可以修改整个feature file, 对于每个 feature group, 如果不specify某个type，则代表没有修改。

def create_features(*base_featureset_name, new_featureset_name, selectors*):
 base_feature_helpers = import("*base_featureset_name"*)
 new_feature_helpers = import("*new_featureset_name"*)
 ohlcv = fetch_ohlcv()

 feature_list = []

 for selector in selectors:
  feature = new_feature_helpers.extract_by_type(
        feature_type,
        dates=dates,
        open_prices=open_prices,
        close_prices=close_prices,
        high_prices=high_prices,
        low_prices=low_prices,
        volumes=volumes,
    )

  # 如果new_feature中没有定义这个feature type，则使用base feature中定义的
  if len(feature) == 0:
   feature = base_feature_helpers.extract_by_type(
        feature_type,
        dates=dates,
        open_prices=open_prices,
        close_prices=close_prices,
        high_prices=high_prices,
        low_prices=low_prices,
        volumes=volumes,
    )

  if len(feature) != len(ohlcv):
   raise inputError(f"feature {selector} parse is incorrect: feature length ({len(feature)}) != ohlcv length ({len(ohlcv})")

  feature_list.append(feature)

 return feature_list

def extract_raw_features(
        selectors,
        feature_type,
        dates=None,
        open_prices=None,
        close_prices=None,
        high_prices=None,
        low_prices=None,
        volumes=None,
    ):
 features = []
 for selector in selectors:
  extract_by_type(
        feature_type,
        dates=dates,
        open_prices=open_prices,
        close_prices=close_prices,
        high_prices=high_prices,
        low_prices=low_prices,
        volumes=volumes,
    )

 # 根据selectors 生成 features

 return features

ROCP MACD RSI VROCP BOLL MA VMA

PRICE_VOLUME

MIN_MAX

TIME VOLATILITY PATTERN STOCH

添加新的feature

还是修改整个feature file？