Unveiling the Magic: Scaling Large Language Models to Serve Millions
Ever wondered what happens behind the curtain when you chat with AI like OpenAI’s ChatGPT or Anthropic’s Claude?
Imagine walking into a massive library where every book can converse with you, answer your questions, and even tell stories on demand. This library needs to serve millions of visitors simultaneously, each requesting different books at different times. How does the library manage this colossal task without making you wait hours for a response? Hosting large language models (LLMs) for millions of users is a similarly herculean endeavor.
Recently, I ventured into the topic to provide open-source LLMs like Llama 3 and Phi on scale like AWS Bedrock. I’m here to pull back the curtain and share insights on how you, too, can build and scale these models.
From User to Provider: A New Perspective
As users, we’re accustomed to the simplicity of accessing AI models: we have a URL, an authentication key, and a selection of models to choose from. It’s like ordering a meal at a restaurant without knowing the flurry of activity happening in the kitchen. But what does it take to prepare that meal — or in this case, to serve those AI responses?
When integrating LLMs into applications, we often prefer a pay-per-use model over subscription limitations. This flexibility demands a robust backend capable of handling numerous requests efficiently. So, what happens behind the scenes? How are requests routed to the right model? How do we build an OpenAI-compatible server? And crucially, how do we scale our models to handle increasing loads?
I’ve identified several key challenges to address:
Model Acquisition and Caching: How to obtain models from a registry and efficiently cache them.
Building an OpenAI-Compatible Server and Inference: Creating a server that processes user input, interacts with the model, and returns responses seamlessly.
Authentication and Authorization: Ensuring secure access to the models.
Usage Tracking for Billing: Counting prompt and completion tokens for accurate pay-per-use billing.
Autoscaling Models and Distributed Inference: Dynamically adjusting resources based on user demand.
Hosting Multiple Models: Providing various models under a single domain.
Let’s dive into each of these challenges, explore the strategies I found effective, and discuss potential solutions you can adopt.
1. Model Registry and Caching: The Art of Efficient Storage
Imagine large language models (LLMs) as massive encyclopedias stored in a vault. Accessing them quickly requires a highly organized system. For smaller models, storage in a database may be sufficient. But with much larger models — ranging from 30 GB to over 200 GB — traditional databases fall short. Managing such vast data demands specialized solutions, such as object storage systems like Amazon S3 or Google Cloud Storage. These systems efficiently handle large files and integrate smoothly with content delivery networks (CDNs) for faster access.
However, downloading large models each time a server starts can feel like retrieving that massive encyclopedia from a distant warehouse whenever it’s needed. To optimize access, several approaches can be used. Caching models locally on disk reduces download times for future uses. Network File Systems (NFS) enable shared storage accessible to multiple servers, while Container Storage Interface (CSI) drivers allow direct mounting of object storage to containers for seamless, on-demand access.
In my opinion, I would choose a stateless environment where models are downloaded fresh with each deployment. Thanks to high-speed networks (up to 1000 MB/s), downloading even the largest models only takes a few minutes. The primary bottleneck is often not the download itself but loading the model into GPU memory. This approach offers simplicity and flexibility, especially when deploying on Kubernetes without dynamic persistent volume claims (PVCs). By treating models as transient data, we can avoid the complexities of managing stateful storage across multiple nodes, making deployment and scaling much easier.
2. Building the Inference Service: Powering Real-Time Responses
Serving large language models (LLMs) is like orchestrating a symphony where each musician plays a unique part in perfect harmony. To deliver real-time responses, inference requests are executed on AI accelerators such as NVIDIA’s A100s and H100s, AMD’s MI300, or specialized chips like AWS Inferentia.
The Concurrency Challenge
With countless users sending requests simultaneously, managing each individually can strain resources. A common solution is batching, where multiple requests are processed together in a single model pass. However, this approach introduces certain challenges. Latency can become an issue, as waiting to gather enough requests for a batch delays the response time, especially for the first token. Additionally, when requests of varying lengths are mixed in a batch, shorter tasks are held up by longer ones, leading to inefficiencies.
The Continuous Batching Solution
To balance these demands, continuous batching breaks down the generation process into smaller iterations, enabling new requests to join ongoing processing with minimal delay. This method functions like an assembly line, allowing different stages of processing to handle tasks concurrently.
Throughput vs. Latency
Increasing batch sizes can enhance overall throughput, but it may also slow down response times for individual requests. The key is to find a balance where GPU resources are optimally utilized without compromising the user experience.
Choosing the Right Tools
Many inference engines are available, each with distinct strengths. After extensive benchmarking and consideration of extensibility for future model additions, I would select vLLM as the inference engine (instead of Nvidias TensorRT, Ollama, or any other inference engine…). It has several advantages, including strong community support for continuous enhancements, competitive performance speeds, extensibility for easy model integration, and OpenAI compatibility, offering an OpenAI-like server interface for seamless interactions.
3. Authentication and Authorization: Guarding the Gates
Imagine our AI service as a grand museum filled with priceless artifacts — each AI model is an exhibit of immense value. Authentication serves as the ticket allowing visitors to enter, while authorization determines which exhibits they can access. Managing this system at scale is akin to ensuring that thousands of visitors with different tickets can enjoy the museum seamlessly, without overcrowding or security breaches.
The Role of JSON Web Tokens (JWTs)
JSON Web Tokens (JWTs) act like personalized tickets that not only grant entry but also specify which exhibits a visitor can view. They are ideal for high-scale environments for several reasons. JWTs carry self-contained information, meaning each token holds all necessary details — like a ticket listing the exhibits a visitor can see. This self-sufficiency eliminates the need for museum staff (servers) to check a central database every time a visitor approaches a new exhibit. Moreover, JWTs allow fast verification; with all information stored in the token, guards (servers) can quickly authenticate it without causing bottlenecks, a necessity when managing high visitor flow. JWTs also offer scalability, being stateless, meaning servers don’t need to retain any session information. This is ideal for services requiring horizontal scaling, where more servers can be added to handle increased load without complex synchronization.
A challenge with JWTs, however, lies in token invalidation. Since they are self-contained and stateless, revoking a token (like invalidating a ticket if a visitor misbehaves) isn’t straightforward; once issued, a JWT remains valid until it expires.
Addressing Token Invalidation
Managing token invalidation is like updating access rights for a visitor already inside the museum. There are several approaches to handling this. One option is to issue short-lived tokens with brief expiration times, such as 15 minutes, so that any changes in access rights naturally propagate as old tokens expire. However, this would mean frequent reissuing of tickets, creating hassle for both visitors and staff. Alternatively, a token blacklist could be maintained, listing tokens that have been revoked; every time a visitor attempts to access an exhibit, the guard checks if their ticket is blacklisted. Though effective, this approach reintroduces statefulness and requires a centralized database, which could become a performance bottleneck. Another method is token whitelisting, which involves maintaining a list of valid tokens. This is akin to a guest list at an exclusive event, where only those listed are granted access. While database lookups are necessary, this approach provides tighter control over access.
In my opinion, I would choose to use a token whitelist for several reasons. First, user control over token lifespan is essential, as users want the ability to configure token time-to-live (TTL), including tokens that never expire. This ruled out short-lived tokens. Additionally, authorization should be managed separately from authentication, meaning that even with a valid token, a database lookup determines the user’s resource access. This approach provides greater flexibility in managing permissions dynamically.
By maintaining a whitelist, we can ensure that only tokens explicitly marked as valid can access the service, adding a security layer and allowing for easy token invalidation by simply removing tokens from the whitelist.
Balancing Security and Performance
While implementing a token whitelist introduces overhead due to required database lookups, we can mitigate this by using caching. Frequently accessed tokens are cached in memory, reducing database load and improving performance.
4. Providing Pay Per Use: Tracking Usage Accurately
Imagine our AI service as a utility company supplying electricity. Just as an electric meter measures consumption for accurate billing, we need to track each user’s usage of the AI models with precision. This is crucial for a pay-per-use model, where transparency and accuracy build customer trust.
Implementing Token Counting
Accurate usage tracking requires counting the prompt tokens (user input) and completion tokens (model output) for each request. Although this seems straightforward, it is actually complex in practice.
Challenges in Token Counting
One approach is to integrate token counting within the inference server itself, logging token usage directly. However, this approach ties the counting mechanism to the server, reducing modularity and making future component swaps more challenging. Another option is to implement middleware to intercept and tally tokens, but this introduces its own set of challenges. For example, some clients request AI models to stream responses token by token, as if in a live conversation. To intercept and count tokens in this scenario, the middleware needs to parse and reassemble the stream without introducing latency or errors. Additionally, some clients use customized request and response formats, or custom schemas, which the middleware must understand and parse correctly, adding complexity to the implementation.
Solution: A Custom HTTP Proxy
To maintain independence between components and allow for future flexibility, we can implement a custom HTTP proxy for token counting. This solution effectively addresses the challenges. The proxy operates independently of the inference server, which means we can swap out the server (for example, switching to Ray) without affecting the token counting mechanism. It also handles parsing for both incoming requests and outgoing responses to ensure accurate token counting. The proxy manages different content types and adapts to various client specifications with ease. For streaming responses, the proxy buffers tokens as they are streamed, counts them, and forwards them to the client with no delay.
Overcoming the Proxy Challenges
Implementing a proxy with these responsibilities presented a few significant challenges, which we addressed by focusing on performance, and reliability. Introducing a proxy can add latency, so we optimized performance by choosing a high-performance language, such as Go or Rust, implementing asynchronous I/O operations to manage multiple connections efficiently, and using efficient data parsing libraries to minimize processing time. I implemented a poc with Pingora called GenAI Gateway.
Benefits of this Approach
By investing in a robust proxy solution, we can achieve several advantages. The architecture is modular, allowing components to be developed, updated, or replaced independently. The solution is also flexible, supporting various client requirements without overcomplicating the inference server itself. Moreover, precise token counting enables accurate billing, fostering trust with customers and ensuring they are charged fairly for their usage.
5. Autoscaling Models: Adapting to the Tides of Demand
Serving large language models (LLMs) is akin to managing a fleet of ships, where the number of passengers can fluctuate unpredictably. Some days, docks are overwhelmed with travelers, while on others, ships sail nearly empty. To operate efficiently, the number of active vessels must adjust dynamically, ensuring resources aren’t wasted and passengers aren’t left stranded. This is the essence of autoscaling in hosting LLMs, where user traffic can surge or drop unexpectedly. Sudden spikes can strain resources, causing delays or interruptions, while low activity can leave expensive hardware idle, wasting resources. Autoscaling addresses this by adjusting resources in real time to match demand.
Strategies for Autoscaling
In cloud and AI services, autoscaling strategies include horizontal and vertical scaling, as well as hybrid approaches. Horizontal scaling adds server instances to handle increased load, like deploying extra ships to accommodate more passengers. Each new instance can host additional model replicas, balancing the workload more evenly. Vertical scaling, by contrast, upgrades resources on existing servers — adding powerful GPUs or more memory, much like enhancing a ship’s capacity to carry more passengers without adding new vessels. Combining both methods provides greater flexibility, allowing servers to be individually upgraded while also adding new ones to optimize performance and resource utilization.
The Challenge with Scaling LLMs
Scaling LLMs presents unique challenges absent in traditional web services, as both horizontal and vertical scaling face practical limitations with specialized hardware and startup times. Vertical scaling, for instance, quickly reaches a limit with GPUs, as there’s only so much hardware that can be upgraded on a single node before hitting physical or financial constraints. In practice, the largest server configurations I encountered capped at eight GPUs per node, making further vertical scaling impractical.
Horizontal scaling, though theoretically unlimited, also introduces complexities. Adding more nodes requires effective workload distribution, and the choice of scaling metrics is critical. Common metrics include Requests Per Second (RPS), which can reflect demand but not the variability in request complexity, and resource utilization, where monitoring CPU, memory, and GPU usage can help determine when to scale up or down. Custom metrics, such as token counts in requests and responses, often provide a more accurate gauge of computational load, as larger requests demand more resources. However, implementing autoscaling based on custom metrics like token counts is challenging due to limitations in available tools.
Exploring Autoscaling Solutions
Navigating these challenges required evaluating several tools and frameworks, including Ray, KServe, and Kubernetes’ Horizontal Pod Autoscaler (HPA). Ray, a distributed computing platform tailored for scaling Python applications, manages workloads through a central head node that distributes tasks across worker nodes hosting models. Its autoscaling capabilities, especially in multi-model hosting, make it a good option, although I found it challenging to have a single point of failure with the head node.
KServe, a serverless model serving platform built on Kubernetes and Knative, supports TensorFlow, PyTorch, and LLMs, offering a unified inference service interface. KServe’s autoscaling, powered by Knative, can scale to zero and back based on demand, proving to be cost-effective. In benchmarking, KServe outperformed both Ray and Kubernetes’ HPA in latency and throughput for LLM inference, though it faces limitations when dealing with very large models.
The Kubernetes Horizontal Pod Autoscaler (HPA) scales pods based on observed CPU utilization or other select metrics. It is easily possible to integrate custom metrics such as requests per second.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: podinfo-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
metricName: istio_requests_total
target:
type: AverageValue
averageValue: 100 # target RPSThe Startup Time Bottleneck
Despite these autoscaling tools, a significant challenge persists: the startup time to load and initialize LLMs. When a scaling event triggers, the model must be loaded into memory before handling requests. For smaller models, this delay is noticeable but generally acceptable. For instance, Llama 3 with 8 billion parameters can take about 70 seconds to start, which users may tolerate if rare. Larger models, like Llama 3 with 70 billion parameters, can take up to 50 minutes to load, making such delays impractical for real-time applications.
Predictive Scaling: Anticipating Demand
To address these delays, we can implement predictive scaling rather than relying solely on reactive autoscaling. By proactively scaling resources based on anticipated demand patterns, we ensure that models are loaded and ready when needed, bypassing lengthy startup times. This approach involves analyzing historical usage data to identify peak times, then automating scaling events ahead of surges with a simple Python script interfacing with the Kubernetes API. During periods of low demand, typically at night, instances can be scaled down to conserve resources and reduce costs. This proactive strategy ensures models are available when needed without the delays inherent in reactive scaling.
Balancing Costs and User Experience
Predictive scaling requires careful planning to balance operational costs with user experience. Over-provisioning can lead to excessive costs, while under-provisioning can harm service quality. Continuous monitoring and adjustments are essential to strike this balance, ensuring an efficient, cost-effective, and user-friendly autoscaling solution.
6. Hosting Multiple Models Under One Roof: Navigating the Model Maze
Imagine a vast library where readers come to explore books across different genres, authors, and languages. They expect to locate their desired book efficiently, without wandering through endless aisles. In a similar way, providing access to multiple large language models (LLMs) within a single domain demands an effective system to route user requests to the appropriate model without delay.
The Complexity of Model Routing
The core challenge in multi-model hosting lies in accurately routing incoming requests to the correct model based on user selection. Traditional routing methods, such as using URL paths or HTTP headers, fall short when the model specification is embedded within the request body. Standard web servers and proxies, like Nginx, typically base routing decisions on request metadata, not the body itself. Inspecting the request body for routing is resource-intensive and unsupported by most proxies, making this a non-trivial problem to solve.
KServe’s Multi-Model Hosting Limitations
Initially, I explored KServe’s multi-model hosting capabilities, which allow for dynamic loading and unloading of models within a single inference service, theoretically optimizing resource usage. However, this approach has notable drawbacks when applied to LLMs. Due to their exceptional size, LLMs demand substantial memory and GPU resources. Hosting multiple LLMs in a single inference service can exceed hardware capabilities. Furthermore, dynamically loading models still incurs significant delays, much like the startup times previously discussed, which compromises the benefit of dynamic hosting when immediate response times are essential.
Developing a Custom Model Router
To address these issues, we can develop a custom HTTP proxy that acts as a model router. This custom system operates by reading the incoming request body to extract the model identifier specified by the user. Based on this information, the proxy routes the request to the corresponding KServe inference service hosting the desired model. The router relies on a static configuration environment variables that maps model identifiers to service endpoints. When new models are introduced, updating the configuration envs and restarting the router ensures they are recognized.
Advantages of the Custom Approach
This custom solution offers several benefits. Scalability is achieved as each model operates within its own KServe inference service, allowing for independent scaling based on demand for each model. The router’s flexible design accommodates complex routing logic, supporting various request formats and model requirements. Additionally, isolating each model reduces the risk of resource contention and simplifies troubleshooting, as each service operates independently.
Conclusion: Embarking on Your LLM Journey
In the complex world of hosting large language models, the orchestration of routing, scaling, and token counting might feel like juggling chainsaws while riding a unicycle. Each component, from custom routers to predictive scaling, plays a critical role in delivering an efficient, flexible, and user-friendly service. And for those who need precise billing to avoid that end-of-month surprise, the journey would be incomplete without tracking every token spent along the way.
That’s where the GenAI Gateway steps in. It’s your backstage pass to smoother model hosting, ensuring each request is tracked and counted down to the last token. So, if you’re tired of your models freeloading or your infrastructure buckling under the pressure, check out GenAI Gateway on GitHub. Because, like in any good library or toll booth, every word counts — and so does every token.





