Tested 2026-06-17 - Ollama 0.30.5 - NVIDIA GPU - 8GB VRAM budget

Best Local LLM for 8GB VRAM GPU: Ollama GPU Fit Test

I tested four local Ollama models on an NVIDIA GPU and measured both tokens per second and peak GPU memory delta. The short answer: use qwen3:4b when you want a fast, safe default. Use qwen3.5:9b when you want a stronger text model and can keep the GPU mostly free. Do not treat 12B as a clean 8GB default.

Fastest safe pick

qwen3:4b

319.00 tokens/s · 3.24GB peak delta

Upper text pick

qwen3.5:9b

173.48 tokens/s · 6.66GB peak delta

Boundary

gemma4:12b

8.55GB peak delta

Important method note

This is an 8GB VRAM-fit test, not a fake 8GB card review.

The machine I tested on has an RTX 5090 with 32GB VRAM. I did not pretend it was an RTX 4060. Instead, I measured the peak GPU memory delta for each model and judged whether that extra memory would fit inside an 8GB VRAM budget. This is useful for model shortlisting, but a real 8GB display card may have less practical headroom because the desktop, browser, drivers, and monitor already consume VRAM.

Host GPU

NVIDIA GeForce RTX 5090, Driver 595.79, CUDA 13.2

Budget rule

Peak GPU memory delta should stay below 8GB, preferably below 7GB.

Prompt settings

num_ctx=2048, num_predict=160, temperature=0.

nvidia-smi output for the AI Jupyter 8GB VRAM local LLM GPU fit test — Step 1: nvidia-smi records the host GPU, driver, CUDA version, and the baseline desktop memory already used before the benchmark.

Ollama model list before the 8GB VRAM local LLM GPU fit test — Step 2: ollama list shows the local model tags used for this test before any article copy was written.

Measured result

8GB VRAM is really a 4B to 9B comfort zone, not a 12B guarantee.

All four models generated tokens, but not all four are good 8GB GPU defaults. The practical split is simple: qwen3:4b is the no-drama model, qwen3.5:9b is the stronger text pick, qwen3-vl:8b is possible but tight, and gemma4:12b crosses the 8GB budget in this run.

Model	Ollama size	Tokens/s	Peak GPU delta	Wall time	Verdict
qwen3:4b	2.5 GB	319.00	3.24 GB	2.19s	Fastest safe pick
qwen3.5:9b	6.6 GB	173.48	6.66 GB	7.89s	Best upper text pick
qwen3-vl:8b	6.1 GB	215.64	7.25 GB	6.53s	Works, but tight
gemma4:12b	7.6 GB	108.26	8.55 GB	9.57s	Over 8GB budget

Raw benchmark JSON is available at benchmark-results.json. I report peak GPU memory delta instead of total GPU memory because the Windows desktop already used VRAM before any model was loaded.

Benchmark summary with tokens per second and peak GPU memory delta for 8GB VRAM local LLMs — Step 3: the benchmark summary records measured tokens/s, wall time, and peak GPU memory delta for every model.

Method note explaining that the 8GB VRAM article is an 8GB budget fit test on an RTX 5090 host — Step 4: the method note is explicit: this is an 8GB VRAM budget fit test on an RTX 5090 host, not a disguised RTX 4060 review.

What I would install first

Start with qwen3:4b, then decide whether 9B is worth the extra memory.

For an 8GB GPU, I would not start by hunting the largest model that barely fits. I would install the fast 4B model first, test my real prompt, then try the 9B model only if the answer quality is meaningfully better. The 8B vision model is interesting, but 7.25GB peak delta leaves very little room on a display GPU.

Install order

Safe first run

ollama run qwen3:4b

Stronger text test

ollama run qwen3.5:9b

Vision stretch

ollama run qwen3-vl:8b

What gets people into trouble

The file size is not the same as the runtime memory budget.

The most common 8GB VRAM mistake is looking only at the downloaded model size. Runtime memory also includes KV cache, prompt context, framework overhead, and the rest of your desktop. That is why a model with a 7.6GB Ollama size can still be a bad 8GB default.

Leave real headroom

Treat 7GB peak delta as the practical comfort ceiling on a display GPU.

Keep context modest

This test used 2048 context. Larger context can change the fit quickly.

Retest your prompt

A coding prompt, vision prompt, or long chat may move the memory peak.

Reproduce the same style of test

The simplest local command sequence

nvidia-smi
ollama list
ollama run qwen3:4b
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits
ollama stop qwen3:4b

For a more careful test, sample nvidia-smi repeatedly while the model is generating, then compare baseline memory against peak memory. That peak delta is what I used for the 8GB budget decision in this article.

FAQ

Practical 8GB VRAM local LLM questions

What is the best local LLM for an 8GB VRAM GPU?

For the safest daily text workflow in this test, qwen3:4b is the easiest recommendation because it used only about 3.24GB of peak GPU memory delta and generated 319 tokens/s. If you want a stronger text model and can keep the GPU mostly free, qwen3.5:9b was the best upper-range pick at 173.48 tokens/s and about 6.66GB peak delta.

Can an 8GB GPU run 12B local models?

Not comfortably in this test. gemma4:12b generated text, but the measured peak GPU memory delta was about 8.55GB, which is already over an 8GB VRAM budget before display memory, drivers, browser windows, or larger context settings are considered.

Why measure peak GPU memory delta instead of total GPU memory used?

This Windows desktop already had several GB of GPU memory in use before the model loaded. The useful signal is the extra GPU memory the model needed during generation. The article still reports that limitation clearly because a real 8GB display GPU has less practical headroom than a clean server GPU.

Should I use the fastest model or the biggest model that fits?

Use the smallest model that handles your real prompt well. On 8GB VRAM, the biggest model that barely fits can lose to a smaller model once context length, browser memory, heat, and repeated turns are part of the workflow.

More local LLM hardware guides