Building a Scalable AI Infrastructure: Kubernetes, NVIDIA GPUs, and Beyond

The journey to creating a scalable and efficient AI infrastructure is fraught with challenges, particularly when dealing with GPU-optimized models for text, image, and audio generation. This blog delves into our experience of building a scalable multi-tenant AI system, emphasizing the management of GPU resources, virtualization, queue structuring, and the adaptability to diverse computing environments.

Initial Challenges in Scalable AI Infrastructure

The primary hurdle in managing AI workloads is the limited GPU VRAM, which is crucial for running multiple AI work requests simultaneously and handling the largest workloads that a single GPU can support. This constraint necessitates an intelligent system to prioritize and manage these requests, ensuring that higher SLA tier requests are prioritized over others.

In a development setting, running GPU-optimized models locally is common, but this setup falters in a production environment where automated CI/CD pipelines and deployable units, like VM images or Docker containers, are essential. Our project, aiAgent, necessitated a shift from local development to a Kubernetes-based architecture capable of scaling across various cloud and local data center environments.

Kubernetes and NVIDIA GPU Integration

Our decision to configure a Kubernetes cluster from scratch, using Ubuntu v22 servers, was driven by the need for a universal solution that could operate across different cloud providers and in-house servers. The flexibility to scale between AWS, Azure, Google Cloud, and local data centers is crucial, especially when facing the scarcity of GPU resources or during intensive training periods.

Virtual GPUs and Resource Management

Integrating NVIDIA GPU resources with Kubernetes presented significant challenges, notably the need for the NVIDIA GPU operator to efficiently utilize consumer-level GPUs like the RTX 4090 and 3090. The gpu operator allows multiple virtual GPUs to be defined per physical GPU, ensuring that multiple pods can utilize the same physical GPU, sharing its VRAM and processing power efficiently. The deployment strategy had to ensure compatibility with specific NVIDIA driver versions, dictated by the GPU operator to avoid conflicts and ensure optimal performance. Newer versions of the GPU operator have started to support multiple versions of the gpu driver which allows for a gradual migration from one version to the other rather than having to coordinate all pods using GPU resources

Performance Implications

The virtual GPU setup enables a more efficient distribution of GPU resources. For instance, running two instances of a container image for stable diffusion on a single 4090 GPU shows an a near doubling in processing time but gets a batch of images done in less time than a single pod working alone. For example, a given stable diffusion model might get generated in 10 seconds with a single pod using the full 4090, but two pods generating with the same model and settings would generate each image in about 17 seconds. The user would get their first image faster (10 seconds) if a single pod is the only user of the 4090, but overall, the system would generate more images per hour if being continually utilized to generate images. If there are two different users each requesting an image at the same time, the 2nd user would get their image 3 seconds faster (17 seconds vs 20 seconds) than if there is just one pod pulling from the image model queue.

Advanced Queue Management in AI Infrastructure

Queue management is a critical component in our scalable AI infrastructure, enabling efficient resource allocation, prioritization of tasks based on service level agreements (SLAs), and maintaining system stability under load. The aiAgent project utilizes a sophisticated queuing system, primarily leveraging RabbitMQ, to manage the diverse and resource-intensive demands of AI workloads.

Queue Structure and Prioritization

In our system, each AI model, version, and SLA type grouping has its own dedicated queue. This segmentation allows for better resource management and prioritization of tasks. The queues are organized based on the SLA tiers of the users, ensuring that higher-tier requests are processed preferentially. This prioritization is essential for maintaining service quality and meeting the expectations of premium users.

For example, if a request for image generation comes from a high-tier user, it is queued in a high-priority queue specific to the requested AI model and version. Lower-tier requests are placed in their respective queues and processed in order of priority and arrival. This method ensures that while maintaining fairness, we can also guarantee the performance and responsiveness required by higher-tier SLAs.

Dynamic Work Allocation

Our queue management system is more than just a static priority list; it’s a dynamic allocation engine that responds to real-time system load and resource availability. Each worker in the infrastructure listens to a dedicated command queue, which directs it to the specific model queue from which to pull work items.

This dynamic allocation allows for efficient use of GPU resources. For instance, if a particular model is heavily requested and its queue is overloaded, the system can redirect some workers to focus on that queue, leveraging the already loaded model in GPU VRAM to maximize throughput and reduce loading times.

Efficient Use of GPU Resources

A significant challenge in AI workloads is the efficient management of GPU VRAM. Our system ensures that once a model is loaded into VRAM, it is utilized as much as possible before unloading. This is crucial because loading and unloading models consume significant time and resources.

To optimize this, the work coordinator manages a balance between keeping models loaded (to reduce initiation time for new tasks) and unloading them (to free resources for other tasks). This balance depends on the current workload, the specific model’s resource requirements, and the overall system capacity.

Handling Overload and Resource Contention

In cases where the demand exceeds the system’s capacity, the queuing system ensures that the system remains stable and operational. Requests are queued and processed based on priority and resource availability, preventing system overload and ensuring that no single user or request can monopolize resources.

Moreover, to prevent resource contention, especially in a virtual GPU environment, our system intelligently schedules tasks, ensuring that GPU utilization is maximized without leading to resource conflicts. This involves careful monitoring of each virtual GPU’s workload and adjusting the allocation of work dynamically to avoid overloading any single physical GPU.

Future Enhancements in Queue Management

Looking ahead, we plan to integrate more advanced machine learning algorithms into our queue management system. These algorithms will predict workload patterns and optimize queue priorities and worker allocation in real-time, further enhancing the efficiency and responsiveness of our AI infrastructure.

Additionally, integrating with cloud-based auto-scaling services will allow our system to dynamically adjust not just the workload distribution but also the actual hardware resources available, scaling up during high demand periods and scaling down during lulls to optimize cost-efficiency.

Automation and Infrastructure as Code (IaC)

The deployment and management of our Kubernetes environment are streamlined through IaC principles, utilizing bash scripts and SSH for automated setup and scaling. This approach ensures that our infrastructure is repeatable, scalable, and adaptable to both local and cloud environments. Cloud system deployments integrate the bash automated Kubernetes cluster setup with cloud-specific network configurations, like AWS WAF and VPCs.

Conclusion

Building a scalable AI infrastructure requires careful consideration of GPU resource management, virtualization, queue structuring, and the integration of automation tools. Our journey with the aiAgent project highlights the importance of a flexible, Kubernetes-based architecture that can adapt to varying hardware resources and cloud environments. As we continue to evolve our system, integrating more sophisticated IaC practices and optimizing our use of GPU resources remains a top priority, ensuring that our AI infrastructure can meet the growing and diverse needs of our users.