Skip to main content
< All Topics

AI/ML Workloads on EKS

1. GPU Provisioning: The “Heavy Lifters” Standard EC2 instances use CPUs, which are like smart professors—they can do anything, but one thing at a time. GPUs (Graphics Processing Units) are like an army of 5,000 students—they are simpler, but they can all do math at the same time.

  • GPU-Optimized AMIs: You can’t use a standard Linux image for AI. You must use the EKS-Optimized Accelerated AMI, which comes pre-installed with NVIDIA drivers.
  • NVIDIA Device Plugin: Kubernetes doesn’t natively “see” GPUs. You must install this plugin (usually via a DaemonSet) so that the API server can track nvidia.com/gpu as a resource, just like CPU and Memory.

2. Hosting LLMs: vLLM vs. Ollama In 2026, we don’t just “run” a model; we use Inference Engines to make them fast.

  • vLLM (The Production Giant): This is the high-performance choice. It uses a technology called PagedAttention to handle thousands of users simultaneously without running out of memory. It’s OpenAI-compatible, meaning you can drop it into existing apps easily.
  • Ollama (The Developer Friend): Great for local testing or internal tools. It packages models into a “Docker-like” format, making it incredibly easy to pull and run a model (e.g., ollama run llama3.1) with one command.

3. Data on EKS: The Throughput Problem Training an AI model is like trying to drink water from a firehose. If your storage is slow (like standard EBS), your expensive $30,000 GPU will sit idle waiting for data.

  • Amazon FSx for Lustre: This is the 2026 standard for AI storage. It is a “Parallel File System” that can deliver hundreds of gigabytes per second. It integrates directly with S3, acting as a high-speed cache for your massive datasets.
Contents
Scroll to Top