Feature Set Development Guide: Keywords and Workflow
Source: Notion | Last edited: 2025-05-08 | ID: 1eb2d2dc-3ef...
This guide provides a visual overview of the key concepts and step-by-step workflow for creating new feature sets in the ml-feature-set framework.
Key Concepts
Section titled “Key Concepts”graph TD FS[FeatureSet] --> |Base Class| CFS[CustomFeatureSet] CFS --> |Implements| DD[data_dependencies] CFS --> |Implements| SLL[get_source_lookback_length] CFS --> |Implements| EF[extract_feature] CFS --> |Optional| GL[generate_labels]
DD --> |Defines| DS[Data Sources] DS --> |Has| RF[Resample Factors] DS --> |Has| IP[is_primary]
EF --> |Uses| GDS[get_data_source] EF --> |Uses| GSC[get_source_column] EF --> |Uses| SFB[set_features_batch] EF --> |Uses| BFS[backfill_features_safely]
subgraph "Core Components" FS DD SLL EF GL end
subgraph "Data Access" GDS GSC DS end
subgraph "Feature Management" SFB BFS end
style FS fill:#f9f,stroke:#333,stroke-width:2px style CFS fill:#bbf,stroke:#333,stroke-width:2pxKey Terms
Section titled “Key Terms”Development Workflow
Section titled “Development Workflow”flowchart TD A[Start: Define Requirements] --> B[Create CustomFeatureSet Class] B --> C{Define data_dependencies} C --> D[Specify Data Sources] D --> E[Set Resampling Factors] E --> F[Implement get_source_lookback_length] F --> G[Implement extract_feature Method] G --> H[Get Data Sources] H --> I[Calculate Technical Indicators] I --> J[Process Data] J --> K[Store Features with set_features_batch] K --> L{Need Custom Labels?} L -- Yes --> M[Implement generate_labels] L -- No --> N[Use Default Labels] M --> O[Test Feature Set] N --> O O --> P[Validate Output] P --> Q[End: Feature Set Complete]
style A fill:#f9f,stroke:#333,stroke-width:2px style Q fill:#9f9,stroke:#333,stroke-width:2pxDetailed Steps
Section titled “Detailed Steps”1. Define Requirements
Section titled “1. Define Requirements”- Identify what market patterns or signals you want to capture
- Determine what technical indicators or calculations are needed
- Consider what time scales are relevant (1x, 4x, 12x, etc.)
2. Create Feature Set Class
Section titled “2. Create Feature Set Class”from ml_feature_set.feature_set import FeatureSet
class CustomFeatureSet(FeatureSet): """ Your feature set description here
Explain what market patterns this feature set captures and what indicators or techniques it uses. """3. Define Data Dependencies
Section titled “3. Define Data Dependencies”@propertydef data_dependencies(self): """Return data source dependencies information""" return [ {"source": "ohlcv", "resample_factors": [1, 12], "is_primary": True}, # Add other data sources if needed # {"source": "fear_greed_index", "resample_factors": [1]}, ]4. Specify Historical Data Requirements
Section titled “4. Specify Historical Data Requirements”def get_source_lookback_length(self, source_name): """ Get required historical data length for specific data source
Args: source_name: Data source name (e.g., "ohlcv_1x", "ohlcv_12x")
Returns: Required historical data length for the source """ # Parse source name and resample factor parts = source_name.split("_") if len(parts) > 1 and parts[-1].endswith("x"): try: resample_factor = int(parts[-1].replace("x", "")) base_source = "_".join(parts[:-1]) except ValueError: base_source = source_name else: base_source = source_name
# Return appropriate lookback length if base_source == "ohlcv": return 200 # Adjust based on your feature needs
# Handle other data sources raise ValueError(f"Unsupported data source: {source_name}")5. Implement Feature Extraction
Section titled “5. Implement Feature Extraction”def extract_feature(self): """Extract features from data sources""" # Get data sources ohlcv_source = self.get_data_source("ohlcv_1x") df = ohlcv_source["data_df"].copy()
# Validate data if "actual_ready_time" not in df.columns: raise ValueError("Data source missing 'actual_ready_time' column")
# Ensure time index is set correctly if not isinstance(df.index, pd.DatetimeIndex): df["actual_ready_time"] = pd.to_datetime(df["actual_ready_time"]) df.set_index("actual_ready_time", inplace=True)
# Extract data close = df["close"].values
# Calculate features features = {} features["rocp"] = np.minimum(np.maximum(talib.ROCP(close, timeperiod=1), -1), 1) # Add more features...
# Save features self.set_features_batch(features)6. Handle Multiple Time Scales (if needed)
Section titled “6. Handle Multiple Time Scales (if needed)”# For resampled data (e.g., 12x)ohlcv_12x_source = self.get_data_source("ohlcv_12x")df_12x = ohlcv_12x_source["data_df"].copy()
# Generate features for 12x datafeatures_12x = self.generate_features(df_12x, suffix="_12x")
# Align with original time scalebackfilled_features_12x = self.backfill_features_safely(df.index, df_12x.index, features_12x)
# Combine featuresall_features = {**features_original, **backfilled_features_12x}self.set_features_batch(all_features)7. Test and Validate
Section titled “7. Test and Validate”- Create test data sources
- Instantiate your feature set
- Call
build_featureswith test data - Verify feature shapes, ranges, and absence of NaN values
Common Feature Types
Section titled “Common Feature Types”mindmap root((Feature Types)) Price-Based Moving Averages Bollinger Bands Support/Resistance Pivot Points Momentum RSI MACD Stochastic ROC Volume-Based OBV Volume MA Price-Volume Chaikin Money Flow Volatility ATR Standard Deviation Bollinger Width Donchian Channel Pattern Recognition Fractals Candlestick Patterns Multi-Timeframe Resampled Indicators Trend Alignment External Data Fear and Greed Index Market SentimentBest Practices
Section titled “Best Practices”- Normalize Features: Keep feature values in reasonable ranges (often [-1, 1])
- Handle Edge Cases: Use
np.nan_to_num()to handle NaN values - Use Batch Operations: Set features in batches with
set_features_batch() - Document Your Features: Explain what each feature represents
- Validate Inputs: Check for required columns and proper time indexing
- Optimize Performance: Avoid redundant calculations
- Test Thoroughly: Verify feature behavior with different market conditions
Example Feature Set Structure
Section titled “Example Feature Set Structure”class CustomFeatureSet(FeatureSet): """Feature set description"""
@property def data_dependencies(self): """Define data dependencies""" return [...]
def get_source_lookback_length(self, source_name): """Specify historical data requirements""" return ...
def extract_feature(self): """Main feature extraction logic""" # Get data # Calculate features # Save features
def generate_features(self, df, suffix=""): """Helper method for feature calculation""" features = {} # Calculate features return features
def generate_labels(self): """Optional custom label generation""" # Calculate custom labels return labels