Accelerating RAG Apps

Martian's multi-model routing transformed a major search engine's RAG application, cutting costs while delivering faster responses and higher accuracy, improving their previous state of the art LLM.

Partner DETAILS

GoalS

→ Prove value in the application of AI (reduce costs by 50%)

→ Maintain user experience

→ Latency better than their prior solution (52.07 tok/s)

→ Answer quality at least as good as 3.5-turbo

‍

RESULTS

→ 67% reduction in cost
→ 62.33 tok/s (+19.7% throughput)
→ 2% Higher Accuracy

DETAILS

One of Martian's customers is a search engine dealing with millions of daily requests. In order to provide a better experience to users, they include the ability for an LLM to synthesize answers to questions from search results. This is a quintessential example of a RAG application – one where external data is placed in an LLM's context so that it can be more accurate and up-to-date. Other examples of RAG applications include those that interface with proprietary data sources (such as your company's internal data, slack channels, etc.), tools that help synthesize answers for customer service, or AI applications that depend on news.

This organization originally used GPT-3.5-turbo (OpenAI's fastest and cheapest model) for all the API calls in their application. With their volume and use case, however, even GPT-3.5-turbo was expensive and slow. Martian provided a solution to improve their application by routing requests to multiple models.

Another advantage of Martian's system is that it future proofs the deployment of LLMs at this search provider, and improves over time. This is true because the system automatically incorporates new models, methods for running LLMs, and other advances as they come out, but also because the system improves from data over time.

For example, let's dive deeper into model latency. On average, Martian completed requests 19.7% faster than than the company's prior solution. But looking at the overall distribution tells an even more compelling story: the median OpenAI latency actually falls in the worst 27.5% of latencies from the router.

Over time, the graph for the router will shift to the right with additional optimization. Routing is able to achieve 100+ tok/s on some requests with routing (better than the single model ever achieved), and will hit that more consistently over time as it learns from additional data on that use case. The same is true for the cost reduction and performance improvement Martian provides.

Accelerating RAG Apps

OUR PRODUCTS

ABOUT

UPDATES