Current
RTX 5090
Good for 14B to 32B daily work and selected 70B stretch tests.
Plan a local Llama 3.3 70B setup with RAM, VRAM, 48GB workstation guidance, RTX 4090 compromise notes, and when to use hosted inference instead.
Best first download
Llama 3.3 70B
Model rows
76
local model rows
Updated
Jun 28, 2026
metrics snapshot
Families
15
model families
Compare the machine you have with the machine you might buy, then reverse-check the hardware needed for a target model.
Now fits
60
Target fits
59
Current
Good for 14B to 32B daily work and selected 70B stretch tests.
Target
Good for strong 14B to 32B local coding and reasoning models.
Models unlocked by this upgrade
These did not fit or stretch on the current machine, but become realistic on the target.
This upgrade mostly improves speed and headroom for models that already fit. Pick a larger target GPU to unlock bigger model classes.
A common baseline for strong local text performance on large rigs.
RAM floor
128 GB
VRAM target
48 GB
Q4 size
43 GB
Install hint
ollama run llama3.3:70bMinimum comfortable hardware paths
First exact: 128 GB workstation128 GB workstation
128 GB RAM / 48 GB VRAM / usable model memory 48 GB
Large general open-weight assistant
A common baseline for strong local text performance on large rigs.
Parameters
70B
Q4 size
43 GB
RAM floor
128 GB
VRAM target
48 GB
Performance
62/100
Pulls
4M
Fit order
Performance + adoption + fit
#1
Match score
71/100
Adoption
83/100
Install hint
ollama run llama3.3:70bLlama 3.3 70B is a server-class or high-memory workstation target. Plan for 128GB+ RAM or 48GB+ VRAM before treating it as local daily infrastructure.
Open the full hardware calculatorThe clean target for local 70B experiments with enough headroom for practical context.
Possible only with compromise assumptions. Use 32B-class models first for daily work.
Memory may be enough to experiment, but speed and usability are the real pass/fail tests.
ollama run llama3.3:70bInstall first if
You are deliberately building a large local model workstation or server.
Step down if
The goal is interactive chat, coding help, or repeated desktop prompts.
Use hosted fallback if
You need reliability, team access, long context, or many repeated calls.
Large-model experiments where quality matters more than desktop simplicity.
48GB VRAM workstations, multi-GPU setups, or high-memory unified-memory systems.
Users deciding whether local 70B is worth the cost compared with a hosted API.
Treating a single RTX 4090 as the clean default story.
Ignoring quantization, context length, runtime support, and service reliability.
Using 70B when a faster 14B or 32B model answers the real prompt well enough.
Treat 128GB RAM as the loading floor and 192GB RAM as the more realistic starting point if you want normal apps open while the model runs.
Use 48GB VRAM as the target for a GPU-first setup. Smaller GPUs may run it with compromises, CPU offload, shorter context, or slower responses.
Usually no. Start with a smaller model first, then move up only after you know your runtime, context length, and machine comfort limits.
Related hardware guides
High-performance local LLMs for 24 GB VRAM RTX 4090 builds.
Top-end single-GPU local LLM picks for 32 GB VRAM RTX 5090 builds.
Stronger local LLMs for 32 GB RAM systems.
More target model checks
A workstation-grade RAM and VRAM guide for Qwen3 32B.
A 24GB VRAM local reasoning path for DeepSeek-R1 Distill Qwen 32B.
A practical RAM and VRAM starting point for Qwen3 8B local installs.
A 24GB RAM or 12GB VRAM starting point for Qwen3 14B.
A hardware planning page for running Gemma 3 27B locally.