One of Martian's customers is a customer service company that trains millions of customer service employees. They assist these employees in handling customer requests through training simulations. These simulations were previously all created manually. When evaluating LLM-alternatives to manual creation, the quality was insufficient to replace manual labor. Martian was tasked with building an end-to-end system capable of producing sufficiently high quality responses such that the manual labor was no longer needed.
In the POC, Martian worked with this customer to create metrics for quality of generating customer service responses that allowed for automated scoring. These metrics were then used to tune Martian's router to be able to produce output that was preferred over GPT-4 by users 79.2% of the time, and at 300x lower cost than GPT-4.
We then worked to optimize the speed of the solution as well. The initial version of the system had 13 tokens/second. By dynamically selecting the best providers, we were able to increase throughput substantially to 113 tokens/second.
How we were able to achieve better cost and performance with routing

In routing, we use multiple models instead of a single model – using the lowest cost model that is able to correctly answer a question. For example, for this company's conversational customer service application, here's the breakdown of what models we route to:
It's clear to see why this produces a reduced cost and latency: we're routing to models which are much less expensive and much faster than GPT-4. But if that's the case, how can we achieve better performance (isn't GPT-4's higher cost and lower speed paying for better performance)?
The key is that we've identified what requests each model is best at, and route individual requests to a model only when that model is really good. Although GPT-4 has the best mean performance (i.e. when we average across all the different requests), it's not necessarily the best at every individual request. By identifying the best model for each individual request, we can achieve higher performance, at a lower cost, and at higher speed.
The key part of this system is the resilience – if any single model goes down, any provider needs to be switched out, or newer more performant models are added, a router can switch to a new model in order to make up the difference.