Skip to content

Feature Set Development Guide: Keywords and Workflow

Source: Notion | Last edited: 2025-05-08 | ID: 1eb2d2dc-3ef...


This guide provides a visual overview of the key concepts and step-by-step workflow for creating new feature sets in the ml-feature-set framework.

graph TD
FS[FeatureSet] --> |Base Class| CFS[CustomFeatureSet]
CFS --> |Implements| DD[data_dependencies]
CFS --> |Implements| SLL[get_source_lookback_length]
CFS --> |Implements| EF[extract_feature]
CFS --> |Optional| GL[generate_labels]
DD --> |Defines| DS[Data Sources]
DS --> |Has| RF[Resample Factors]
DS --> |Has| IP[is_primary]
EF --> |Uses| GDS[get_data_source]
EF --> |Uses| GSC[get_source_column]
EF --> |Uses| SFB[set_features_batch]
EF --> |Uses| BFS[backfill_features_safely]
subgraph "Core Components"
FS
DD
SLL
EF
GL
end
subgraph "Data Access"
GDS
GSC
DS
end
subgraph "Feature Management"
SFB
BFS
end
style FS fill:#f9f,stroke:#333,stroke-width:2px
style CFS fill:#bbf,stroke:#333,stroke-width:2px
flowchart TD
A[Start: Define Requirements] --> B[Create CustomFeatureSet Class]
B --> C{Define data_dependencies}
C --> D[Specify Data Sources]
D --> E[Set Resampling Factors]
E --> F[Implement get_source_lookback_length]
F --> G[Implement extract_feature Method]
G --> H[Get Data Sources]
H --> I[Calculate Technical Indicators]
I --> J[Process Data]
J --> K[Store Features with set_features_batch]
K --> L{Need Custom Labels?}
L -- Yes --> M[Implement generate_labels]
L -- No --> N[Use Default Labels]
M --> O[Test Feature Set]
N --> O
O --> P[Validate Output]
P --> Q[End: Feature Set Complete]
style A fill:#f9f,stroke:#333,stroke-width:2px
style Q fill:#9f9,stroke:#333,stroke-width:2px
  • Identify what market patterns or signals you want to capture
  • Determine what technical indicators or calculations are needed
  • Consider what time scales are relevant (1x, 4x, 12x, etc.)
from ml_feature_set.feature_set import FeatureSet
class CustomFeatureSet(FeatureSet):
"""
Your feature set description here
Explain what market patterns this feature set captures
and what indicators or techniques it uses.
"""
@property
def data_dependencies(self):
"""Return data source dependencies information"""
return [
{"source": "ohlcv", "resample_factors": [1, 12], "is_primary": True},
# Add other data sources if needed
# {"source": "fear_greed_index", "resample_factors": [1]},
]
def get_source_lookback_length(self, source_name):
"""
Get required historical data length for specific data source
Args:
source_name: Data source name (e.g., "ohlcv_1x", "ohlcv_12x")
Returns:
Required historical data length for the source
"""
# Parse source name and resample factor
parts = source_name.split("_")
if len(parts) > 1 and parts[-1].endswith("x"):
try:
resample_factor = int(parts[-1].replace("x", ""))
base_source = "_".join(parts[:-1])
except ValueError:
base_source = source_name
else:
base_source = source_name
# Return appropriate lookback length
if base_source == "ohlcv":
return 200 # Adjust based on your feature needs
# Handle other data sources
raise ValueError(f"Unsupported data source: {source_name}")
def extract_feature(self):
"""Extract features from data sources"""
# Get data sources
ohlcv_source = self.get_data_source("ohlcv_1x")
df = ohlcv_source["data_df"].copy()
# Validate data
if "actual_ready_time" not in df.columns:
raise ValueError("Data source missing 'actual_ready_time' column")
# Ensure time index is set correctly
if not isinstance(df.index, pd.DatetimeIndex):
df["actual_ready_time"] = pd.to_datetime(df["actual_ready_time"])
df.set_index("actual_ready_time", inplace=True)
# Extract data
close = df["close"].values
# Calculate features
features = {}
features["rocp"] = np.minimum(np.maximum(talib.ROCP(close, timeperiod=1), -1), 1)
# Add more features...
# Save features
self.set_features_batch(features)

6. Handle Multiple Time Scales (if needed)

Section titled “6. Handle Multiple Time Scales (if needed)”
# For resampled data (e.g., 12x)
ohlcv_12x_source = self.get_data_source("ohlcv_12x")
df_12x = ohlcv_12x_source["data_df"].copy()
# Generate features for 12x data
features_12x = self.generate_features(df_12x, suffix="_12x")
# Align with original time scale
backfilled_features_12x = self.backfill_features_safely(df.index, df_12x.index, features_12x)
# Combine features
all_features = {**features_original, **backfilled_features_12x}
self.set_features_batch(all_features)
  • Create test data sources
  • Instantiate your feature set
  • Call build_features with test data
  • Verify feature shapes, ranges, and absence of NaN values
mindmap
root((Feature Types))
Price-Based
Moving Averages
Bollinger Bands
Support/Resistance
Pivot Points
Momentum
RSI
MACD
Stochastic
ROC
Volume-Based
OBV
Volume MA
Price-Volume
Chaikin Money Flow
Volatility
ATR
Standard Deviation
Bollinger Width
Donchian Channel
Pattern Recognition
Fractals
Candlestick Patterns
Multi-Timeframe
Resampled Indicators
Trend Alignment
External Data
Fear and Greed Index
Market Sentiment
  1. Normalize Features: Keep feature values in reasonable ranges (often [-1, 1])
  2. Handle Edge Cases: Use np.nan_to_num() to handle NaN values
  3. Use Batch Operations: Set features in batches with set_features_batch()
  4. Document Your Features: Explain what each feature represents
  5. Validate Inputs: Check for required columns and proper time indexing
  6. Optimize Performance: Avoid redundant calculations
  7. Test Thoroughly: Verify feature behavior with different market conditions
class CustomFeatureSet(FeatureSet):
"""Feature set description"""
@property
def data_dependencies(self):
"""Define data dependencies"""
return [...]
def get_source_lookback_length(self, source_name):
"""Specify historical data requirements"""
return ...
def extract_feature(self):
"""Main feature extraction logic"""
# Get data
# Calculate features
# Save features
def generate_features(self, df, suffix=""):
"""Helper method for feature calculation"""
features = {}
# Calculate features
return features
def generate_labels(self):
"""Optional custom label generation"""
# Calculate custom labels
return labels