Local AI is constrained by memory before anything else. After memory fit, the best user experience usually comes from a model that leaves enough room for context, documents, tools, and repeated prompts.
Start with compact 1B to 8B models, embeddings, and lightweight chat. Use shorter context windows for smoother local inference.
Look at 7B to 14B general models, coding specialists, and smaller vision models. A 12 GB to 16 GB GPU improves speed and context headroom.
This range opens up 24B to 34B models, stronger coding assistants, and larger multimodal models when paired with enough VRAM.
Large MoE and 70B+ models are best treated as infrastructure projects. Check multi-GPU memory, quantization, runtime support, and license terms.