Touchstone Service Original Spec
Source: Notion | Last edited: 2025-01-21 | ID: 786c98b2-0a6...
System Overview
Section titled “System Overview”See EI-1050 for a description of the system’s goals and basic flow.
A note regarding scope: while the immediate motivation for this work is the testing of feature sets or model structures, we are essentially proposing a configurable automated model training environment in the cloud.
Requirements By Component
Section titled “Requirements By Component”EL-nigma
Section titled “EL-nigma”- We’ll need the following enhancements to the training logic:
- Ability to train one model and exit. Currently it assumes “continuous training”.
- Ability to upload its results to permanent storage (S3).
- Ability to load a pluggable module given its path or name
- Ability to override any part of the configuration file at invocation time (implementation note: this can also be done by writing a temporary configuration file per run and pointing el-nigma to it).
- A way to query a training job for liveliness and progress status.
- A way to immediately abort and clean up a training task.
DynamoDB
Section titled “DynamoDB”- Add a table
TouchstoneSubmissionswhich stores metadata about module submissions:- System (feature_set, model_structure, …)
- Submission ID
- Submitting user
- Identifier of the submitted content in storage (Amplify Storage, backed by S3)
- Configuration parameters updated by our engineer:
- Prediction intervals and other command-line parameters
- Configuration overrides
- Resource requirements
- Baseline model used for comparisons
- Status: submitted, accepted, rejected, queued, running, success, error
- Queue position (for queued submissions)
- Training result:
- Performance score
- Similarity score vs. our baseline
- Links to charts
- Timestamps:
- Submission created
- Submission sent to job queue
- Submission evaluation ended
- Add a table
TouchstoneTaskswhich stores runtime information per training task:- Task ID
- Submission ID that created the task
- Current status: queued, running, success, error
- Cloud resource identifier (e.g. instance ID)
- Job start timestamp
- Progress metrics;
- Percent done estimate
- Estimated time to finish
- Existing post-train results table:
- Details model results (drawdown, etc) are saved to this table as usual.
Result storage
Section titled “Result storage”The training script normally stores results as follows:
- Model evaluation (metrics and parameters) in DynamoDB and in MLflow
- Performance charts as MLflow artifacts
- The models themselves in S3
For the time being we will continue with this approach. However, for evaluation of a submitted module we need to be able to aggregate the results from all models trained using that module. To enable this, a new field
model_group_idwill be added to the DynamoDB records for models trained as part of a submission.
EL-Admin and Amplify
Section titled “EL-Admin and Amplify”- Add a new user group for feature set testing.
- Users in this group can see a new tab “Feature Sets”, containing:
- A form for submitting a new feature set for evaluation
- For previous submissions by this user, basic metadata and possibly current status and/or results.
- The feature set submission form uploads a file via Amplify’s Storage API and populates related metadata.
- Similarly, a tab and a form are available for submitting new model structure implementation.
- EL-Admin administrators can:
- See metadata and current status for all submissions to the system
- Add or remove submissions from the evaluation queue
- Configure parameters for a submission
- Halt all running tasks belonging to a submission
Cloud resources for training
Section titled “Cloud resources for training”- We will use RunPod to run training jobs.
- We use their Pods architecture (not Serverless) at this time, as it’s a better fit for our training architecture, and the costs are more predictable. One pod = one running container.
- By default we use RTX 3090 or 4090 GPUs for training. These are available in RunPod’s Secure Cloud as well as Community Cloud, with the community version being 27% cheaper. We can use either cloud.
- We allocate on-demand instances (rather than spot), since our training logic has no interrupt/resume capability.
- RunPod offers 1-GPU pods and 8-GPU pods. There seem to be no cost savings associated with 8-GPU pods, so we may as well use 1-GPU pods. This has another advantage in that we can use a new pod per every new task, simplifying pod management. (note that pod usage is billed by the minute).
- To control costs, we define a configurable limit on the number of pods running at any given time. For convenience, the maximum capacity should reside in DynamoDB.
- Management of the RunPod resources is done via Lambda functions that call RunPod’s Python API. client.
Lambda
Section titled “Lambda”- Sanity check: when a new module file is submitted this function performs basic validation and stores the result to DynamoDB.
- Job dispatcher: a scheduled lambda, running once every 10-30 seconds, that dispatches queued training jobs to RunPod and manages the lifecycle of those jobs
- The dispatcher uses the
TouchstoneTaskstable to determine curent training status - If there are queued tasks and there is free capacity on RunPod: launch as many tasks as there is room for.
- If an engineer has requested that a task be terminated, call RunPod to terminate the task.
- The dispatcher uses the