Touchstone Service Original Spec

Source: Notion | Last edited: 2025-01-21 | ID: 786c98b2-0a6...

System Overview

See EI-1050 for a description of the system’s goals and basic flow.

A note regarding scope: while the immediate motivation for this work is the testing of feature sets or model structures, we are essentially proposing a configurable automated model training environment in the cloud.

Requirements By Component

EL-nigma

We’ll need the following enhancements to the training logic:
- Ability to train one model and exit. Currently it assumes “continuous training”.
- Ability to upload its results to permanent storage (S3).
- Ability to load a pluggable module given its path or name
- Ability to override any part of the configuration file at invocation time (implementation note: this can also be done by writing a temporary configuration file per run and pointing el-nigma to it).
- A way to query a training job for liveliness and progress status.
- A way to immediately abort and clean up a training task.

DynamoDB

Add a table TouchstoneSubmissions which stores metadata about module submissions:
- System (feature_set, model_structure, …)
- Submission ID
- Submitting user
- Identifier of the submitted content in storage (Amplify Storage, backed by S3)
- Configuration parameters updated by our engineer:
  - Prediction intervals and other command-line parameters
  - Configuration overrides
  - Resource requirements
  - Baseline model used for comparisons
- Status: submitted, accepted, rejected, queued, running, success, error
- Queue position (for queued submissions)
- Training result:
  - Performance score
  - Similarity score vs. our baseline
  - Links to charts
- Timestamps:
  - Submission created
  - Submission sent to job queue
  - Submission evaluation ended
Add a table TouchstoneTasks which stores runtime information per training task:
- Task ID
- Submission ID that created the task
- Current status: queued, running, success, error
- Cloud resource identifier (e.g. instance ID)
- Job start timestamp
- Progress metrics;
  - Percent done estimate
  - Estimated time to finish
Existing post-train results table:
- Details model results (drawdown, etc) are saved to this table as usual.

Result storage

The training script normally stores results as follows:

Model evaluation (metrics and parameters) in DynamoDB and in MLflow
Performance charts as MLflow artifacts
The models themselves in S3 For the time being we will continue with this approach. However, for evaluation of a submitted module we need to be able to aggregate the results from all models trained using that module. To enable this, a new field model_group_id will be added to the DynamoDB records for models trained as part of a submission.

EL-Admin and Amplify

Add a new user group for feature set testing.
Users in this group can see a new tab “Feature Sets”, containing:
- A form for submitting a new feature set for evaluation
- For previous submissions by this user, basic metadata and possibly current status and/or results.
The feature set submission form uploads a file via Amplify’s Storage API and populates related metadata.
Similarly, a tab and a form are available for submitting new model structure implementation.
EL-Admin administrators can:
- See metadata and current status for all submissions to the system
- Add or remove submissions from the evaluation queue
- Configure parameters for a submission
- Halt all running tasks belonging to a submission

Cloud resources for training

We will use RunPod to run training jobs.
- We use their Pods architecture (not Serverless) at this time, as it’s a better fit for our training architecture, and the costs are more predictable. One pod = one running container.
- By default we use RTX 3090 or 4090 GPUs for training. These are available in RunPod’s Secure Cloud as well as Community Cloud, with the community version being 27% cheaper. We can use either cloud.
- We allocate on-demand instances (rather than spot), since our training logic has no interrupt/resume capability.
- RunPod offers 1-GPU pods and 8-GPU pods. There seem to be no cost savings associated with 8-GPU pods, so we may as well use 1-GPU pods. This has another advantage in that we can use a new pod per every new task, simplifying pod management. (note that pod usage is billed by the minute).
To control costs, we define a configurable limit on the number of pods running at any given time. For convenience, the maximum capacity should reside in DynamoDB.
Management of the RunPod resources is done via Lambda functions that call RunPod’s Python API. client.

Lambda

Sanity check: when a new module file is submitted this function performs basic validation and stores the result to DynamoDB.
Job dispatcher: a scheduled lambda, running once every 10-30 seconds, that dispatches queued training jobs to RunPod and manages the lifecycle of those jobs
- The dispatcher uses the TouchstoneTaskstable to determine curent training status
- If there are queued tasks and there is free capacity on RunPod: launch as many tasks as there is room for.
- If an engineer has requested that a task be terminated, call RunPod to terminate the task.