Ron’s Project Details Q1 2025

Source: Notion | Last edited: 2025-07-07 | ID: 2292d2dc-3ef...

Project: Model training orchestration using in-house servers

Technological Uncertainty & Advancement

In 2024 we built a system named Touchstone Service to orchestrate machine-learning model training tasks using cloud resources. The goal was to scale up the company’s ability to train new models and test different approaches, both in terms of the amount of hardware we can throw at the problem, and in terms of the number of engineers who have access to that hardware at a given time.

Later in the year it became evident that we would benefit from similar orchestration of the company’s in-house GPU servers - machines purchased for the purpose of training models. At that time, those machines were managed manually by our engineers, which was wasteful in terms of computing power as well as engineer time.

With regard to adding orchestration capability for in-house machines, we identified the following requirements:

A single system should dispatch jobs to either cloud or in-house servers. This is in order to maximize the return on investment in in-house servers, while also allowing quick cloud scale-out when needed.
The in-house servers should not be open to inbound connections from outside, in order to minimize the risk of security breaches.
Because in-house servers may occasionally still be needed for manual engineering work, we need a safe and effcient method to attaching and detaching servers to/from the system.
Since in-house server capacity does not scale arbitrarily, the system needs to manage the available capacity. That means not dispatching more tasks than a server can handle, and also ensuring that unused server capacity is reclaimed effectively.

Discussion of available technologies and standard practices

While the problem of distributed task orchestration is well-studied, and multiple technologies exist to address it, there was no solution that we could simply slot in that would fulfil the requirements. Most orchestration systems require inbound network access adding complexity, risk, and cost (due to the need for static IPs) to our compute infrastructure. Remote-control systems that bypass this requirement, such as TeamViewer or SplashTop, are designed for GUI-mode access rather than automation, and carry their own security risks.

One technology that we were able to leverage easily is Docker containerization. The Touchstone system already relied on containerized tasks, and running in containers in-house is an effective way to ensure tasks are self-contained and easily cleaned up. Although container orchestration systems like Kubernetes are able to run in-house, they don’t solve the issue of secure connectivity, and are not designed to share resources with manual engineer access.

Systematic Investigation & Experimental Development

To allow remote control without inbound connections, we developed a minimal client software called touchstone agent that runs on each in-house server, subscribes to a pub-sub queue (specifically, an AWS SQS FIFO queue), executes commands coming in via the queue, and posts the results to a second queue. While this was enough to run some simple commands issued by the Touchstone backend service on an in-house server, several more iterations were needed to make the protocol robust enough for real-world use. The metric used to evaluate performance was simply how often the service crashes or freezes while running a real-world task.

The backend and the agent need to stay in sync, detect out-of-sync situations, and recover back to a synced state. Out-of-sync situations arise primarily when either the backend or the agent experience an error and need to reset. We eventually ended up with the following enhancements:

Multiple pairs or queues per server are used to support commands that arrive in parallel, from different functions of the backend.
Unique request IDs are used to ensure that a given response from the agent matches a given request from the backend.
If an unexpected request ID is encountered, the request and response queues are drained, so that the protocol can start from a clean slate.
In addition to text-based input and output, arbitrary data files can be transferred over the protocol. To support capacity management and manual engineering work, we augmented the Touchstone service’s data model with a maximal number of task count for each in-house server, and implemented a cleanup process that removes any unrecognized running tasks. The containerized implementation ensured that this cleanup process is robust and complete. Setting the maximal task count to zero makes the machine available for engineer use, once all running tasks have completed.