Feature Set Development Guide: Keywords and Workflow

Source: Notion | Last edited: 2025-05-08 | ID: 1eb2d2dc-3ef...

This guide provides a visual overview of the key concepts and step-by-step workflow for creating new feature sets in the ml-feature-set framework.

Key Concepts

graph TD
    FS[FeatureSet] --> |Base Class| CFS[CustomFeatureSet]
    CFS --> |Implements| DD[data_dependencies]
    CFS --> |Implements| SLL[get_source_lookback_length]
    CFS --> |Implements| EF[extract_feature]
    CFS --> |Optional| GL[generate_labels]

    DD --> |Defines| DS[Data Sources]
    DS --> |Has| RF[Resample Factors]
    DS --> |Has| IP[is_primary]

    EF --> |Uses| GDS[get_data_source]
    EF --> |Uses| GSC[get_source_column]
    EF --> |Uses| SFB[set_features_batch]
    EF --> |Uses| BFS[backfill_features_safely]

    subgraph "Core Components"
        FS
        DD
        SLL
        EF
        GL
    end

    subgraph "Data Access"
        GDS
        GSC
        DS
    end

    subgraph "Feature Management"
        SFB
        BFS
    end

    style FS fill:#f9f,stroke:#333,stroke-width:2px
    style CFS fill:#bbf,stroke:#333,stroke-width:2px

Key Terms

Development Workflow

flowchart TD
    A[Start: Define Requirements] --> B[Create CustomFeatureSet Class]
    B --> C{Define data_dependencies}
    C --> D[Specify Data Sources]
    D --> E[Set Resampling Factors]
    E --> F[Implement get_source_lookback_length]
    F --> G[Implement extract_feature Method]
    G --> H[Get Data Sources]
    H --> I[Calculate Technical Indicators]
    I --> J[Process Data]
    J --> K[Store Features with set_features_batch]
    K --> L{Need Custom Labels?}
    L -- Yes --> M[Implement generate_labels]
    L -- No --> N[Use Default Labels]
    M --> O[Test Feature Set]
    N --> O
    O --> P[Validate Output]
    P --> Q[End: Feature Set Complete]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style Q fill:#9f9,stroke:#333,stroke-width:2px

Detailed Steps

1. Define Requirements

Identify what market patterns or signals you want to capture
Determine what technical indicators or calculations are needed
Consider what time scales are relevant (1x, 4x, 12x, etc.)

2. Create Feature Set Class

from ml_feature_set.feature_set import FeatureSet

class CustomFeatureSet(FeatureSet):
    """
    Your feature set description here

    Explain what market patterns this feature set captures
    and what indicators or techniques it uses.
    """

3. Define Data Dependencies

@property
def data_dependencies(self):
    """Return data source dependencies information"""
    return [
        {"source": "ohlcv", "resample_factors": [1, 12], "is_primary": True},
        # Add other data sources if needed
        # {"source": "fear_greed_index", "resample_factors": [1]},
    ]

4. Specify Historical Data Requirements

def get_source_lookback_length(self, source_name):
    """
    Get required historical data length for specific data source

    Args:
        source_name: Data source name (e.g., "ohlcv_1x", "ohlcv_12x")

    Returns:
        Required historical data length for the source
    """
    # Parse source name and resample factor
    parts = source_name.split("_")
    if len(parts) > 1 and parts[-1].endswith("x"):
        try:
            resample_factor = int(parts[-1].replace("x", ""))
            base_source = "_".join(parts[:-1])
        except ValueError:
            base_source = source_name
    else:
        base_source = source_name

    # Return appropriate lookback length
    if base_source == "ohlcv":
        return 200  # Adjust based on your feature needs

    # Handle other data sources
    raise ValueError(f"Unsupported data source: {source_name}")

5. Implement Feature Extraction

def extract_feature(self):
    """Extract features from data sources"""
    # Get data sources
    ohlcv_source = self.get_data_source("ohlcv_1x")
    df = ohlcv_source["data_df"].copy()

    # Validate data
    if "actual_ready_time" not in df.columns:
        raise ValueError("Data source missing 'actual_ready_time' column")

    # Ensure time index is set correctly
    if not isinstance(df.index, pd.DatetimeIndex):
        df["actual_ready_time"] = pd.to_datetime(df["actual_ready_time"])
        df.set_index("actual_ready_time", inplace=True)

    # Extract data
    close = df["close"].values

    # Calculate features
    features = {}
    features["rocp"] = np.minimum(np.maximum(talib.ROCP(close, timeperiod=1), -1), 1)
    # Add more features...

    # Save features
    self.set_features_batch(features)

6. Handle Multiple Time Scales (if needed)

# For resampled data (e.g., 12x)
ohlcv_12x_source = self.get_data_source("ohlcv_12x")
df_12x = ohlcv_12x_source["data_df"].copy()

# Generate features for 12x data
features_12x = self.generate_features(df_12x, suffix="_12x")

# Align with original time scale
backfilled_features_12x = self.backfill_features_safely(df.index, df_12x.index, features_12x)

# Combine features
all_features = {**features_original, **backfilled_features_12x}
self.set_features_batch(all_features)

7. Test and Validate

Create test data sources
Instantiate your feature set
Call build_features with test data
Verify feature shapes, ranges, and absence of NaN values

Common Feature Types

mindmap
  root((Feature Types))
    Price-Based
      Moving Averages
      Bollinger Bands
      Support/Resistance
      Pivot Points
    Momentum
      RSI
      MACD
      Stochastic
      ROC
    Volume-Based
      OBV
      Volume MA
      Price-Volume
      Chaikin Money Flow
    Volatility
      ATR
      Standard Deviation
      Bollinger Width
      Donchian Channel
    Pattern Recognition
      Fractals
      Candlestick Patterns
    Multi-Timeframe
      Resampled Indicators
      Trend Alignment
    External Data
      Fear and Greed Index
      Market Sentiment

Best Practices

Normalize Features: Keep feature values in reasonable ranges (often [-1, 1])
Handle Edge Cases: Use np.nan_to_num() to handle NaN values
Use Batch Operations: Set features in batches with set_features_batch()
Document Your Features: Explain what each feature represents
Validate Inputs: Check for required columns and proper time indexing
Optimize Performance: Avoid redundant calculations
Test Thoroughly: Verify feature behavior with different market conditions

Example Feature Set Structure

class CustomFeatureSet(FeatureSet):
    """Feature set description"""

    @property
    def data_dependencies(self):
        """Define data dependencies"""
        return [...]

    def get_source_lookback_length(self, source_name):
        """Specify historical data requirements"""
        return ...

    def extract_feature(self):
        """Main feature extraction logic"""
        # Get data
        # Calculate features
        # Save features

    def generate_features(self, df, suffix=""):
        """Helper method for feature calculation"""
        features = {}
        # Calculate features
        return features

    def generate_labels(self):
        """Optional custom label generation"""
        # Calculate custom labels
        return labels