Calculate GPU VRAM requirements for running large language models. Estimate memory for model weights, KV cache, and activations. Find which GPUs can run your AI model with our comprehensive memory calculator.
You might also find these calculators useful
Running AI models locally requires knowing your GPU memory requirements. Our GPU Memory Calculator estimates the VRAM needed for any large language model based on parameter count, precision, batch size, and context length. Find out if your GPU can run Llama, Mistral, or other popular models.
GPU memory (VRAM) is consumed by three main components: model weights (parameters × bytes per parameter), KV cache (scales with context length × batch size), and activation memory (temporary computation storage). The total determines which GPU can run your model.
VRAM Calculation Formula
VRAM = Model Weights + KV Cache + Activations + OverheadKnow exactly whether your RTX 3090, A100, or consumer GPU can run a specific model before you buy or rent.
See how INT8 or INT4 quantization reduces memory requirements, enabling larger models on smaller GPUs.
KV cache grows linearly with context. Calculate if you can support 4K, 8K, or 32K context windows.
Larger batches improve throughput but need more memory. Find your optimal batch size for available VRAM.
A 7B model needs approximately: 28GB at FP32, 14GB at FP16/BF16, 7GB at INT8, or 3.5GB at INT4. Add 1-4GB for KV cache depending on context length and batch size. In practice, a 16GB GPU like RTX 4080 can run 7B models at FP16 with 4K context.
A 70B model at FP16 needs ~140GB VRAM—far exceeding any single consumer GPU. Options: use INT4 quantization (~35GB, fits on A100 80GB), use multiple GPUs with model parallelism, or offload layers to CPU RAM (much slower).
FP16/BF16: Virtually no quality loss, standard for inference. INT8: 1-2% benchmark degradation, excellent for production. INT4: Noticeable quality loss on complex reasoning, but acceptable for many applications. Always benchmark on your specific use case.
Additional VRAM is used by: CUDA context and kernels (~500MB-1GB), framework overhead (PyTorch, etc.), memory fragmentation, gradient storage if training, and multiple model copies for speculative decoding. Our estimates include typical overhead but vary by setup.
Both use 16 bits per parameter. FP16 has more mantissa precision, better for inference. BF16 has more exponent range, better for training (avoids overflow). For inference, they're interchangeable and use the same memory.