Skip to content

Touchstone Service Original Spec

Source: Notion | Last edited: 2025-01-21 | ID: 786c98b2-0a6...


See EI-1050 for a description of the system’s goals and basic flow.

A note regarding scope: while the immediate motivation for this work is the testing of feature sets or model structures, we are essentially proposing a configurable automated model training environment in the cloud.

  • We’ll need the following enhancements to the training logic:
    • Ability to train one model and exit. Currently it assumes “continuous training”.
    • Ability to upload its results to permanent storage (S3).
    • Ability to load a pluggable module given its path or name
    • Ability to override any part of the configuration file at invocation time (implementation note: this can also be done by writing a temporary configuration file per run and pointing el-nigma to it).
    • A way to query a training job for liveliness and progress status.
    • A way to immediately abort and clean up a training task.
  • Add a table TouchstoneSubmissions which stores metadata about module submissions:
    • System (feature_set, model_structure, …)
    • Submission ID
    • Submitting user
    • Identifier of the submitted content in storage (Amplify Storage, backed by S3)
    • Configuration parameters updated by our engineer:
      • Prediction intervals and other command-line parameters
      • Configuration overrides
      • Resource requirements
      • Baseline model used for comparisons
    • Status: submitted, accepted, rejected, queued, running, success, error
    • Queue position (for queued submissions)
    • Training result:
      • Performance score
      • Similarity score vs. our baseline
      • Links to charts
    • Timestamps:
      • Submission created
      • Submission sent to job queue
      • Submission evaluation ended
  • Add a table TouchstoneTasks which stores runtime information per training task:
    • Task ID
    • Submission ID that created the task
    • Current status: queued, running, success, error
    • Cloud resource identifier (e.g. instance ID)
    • Job start timestamp
    • Progress metrics;
      • Percent done estimate
      • Estimated time to finish
  • Existing post-train results table:
    • Details model results (drawdown, etc) are saved to this table as usual.

The training script normally stores results as follows:

  • Model evaluation (metrics and parameters) in DynamoDB and in MLflow
  • Performance charts as MLflow artifacts
  • The models themselves in S3 For the time being we will continue with this approach. However, for evaluation of a submitted module we need to be able to aggregate the results from all models trained using that module. To enable this, a new field model_group_id will be added to the DynamoDB records for models trained as part of a submission.
  • Add a new user group for feature set testing.
  • Users in this group can see a new tab “Feature Sets”, containing:
    • A form for submitting a new feature set for evaluation
    • For previous submissions by this user, basic metadata and possibly current status and/or results.
  • The feature set submission form uploads a file via Amplify’s Storage API and populates related metadata.
  • Similarly, a tab and a form are available for submitting new model structure implementation.
  • EL-Admin administrators can:
    • See metadata and current status for all submissions to the system
    • Add or remove submissions from the evaluation queue
    • Configure parameters for a submission
    • Halt all running tasks belonging to a submission
  • We will use RunPod to run training jobs.
    • We use their Pods architecture (not Serverless) at this time, as it’s a better fit for our training architecture, and the costs are more predictable. One pod = one running container.
    • By default we use RTX 3090 or 4090 GPUs for training. These are available in RunPod’s Secure Cloud as well as Community Cloud, with the community version being 27% cheaper. We can use either cloud.
    • We allocate on-demand instances (rather than spot), since our training logic has no interrupt/resume capability.
    • RunPod offers 1-GPU pods and 8-GPU pods. There seem to be no cost savings associated with 8-GPU pods, so we may as well use 1-GPU pods. This has another advantage in that we can use a new pod per every new task, simplifying pod management. (note that pod usage is billed by the minute).
  • To control costs, we define a configurable limit on the number of pods running at any given time. For convenience, the maximum capacity should reside in DynamoDB.
  • Management of the RunPod resources is done via Lambda functions that call RunPod’s Python API. client.
  • Sanity check: when a new module file is submitted this function performs basic validation and stores the result to DynamoDB.
  • Job dispatcher: a scheduled lambda, running once every 10-30 seconds, that dispatches queued training jobs to RunPod and manages the lifecycle of those jobs
    • The dispatcher uses the TouchstoneTaskstable to determine curent training status
    • If there are queued tasks and there is free capacity on RunPod: launch as many tasks as there is room for.
    • If an engineer has requested that a task be terminated, call RunPod to terminate the task.