llama.cpp

llama.cpp is the low-level workhorse behind many local model workflows. It is useful when you need direct control over quantization, CPU fallback, GPU layers, context size, and local server behavior.

Best Fit

Developers who want precise runtime control.
CPU-first or mixed CPU/GPU machines.
Offline deployments with pre-downloaded GGUF files.
Testing quantization tradeoffs such as Q4, Q5, and Q8.

Typical Local Server Shape

llama-server -m ./models/model.gguf -c 8192 --host 127.0.0.1 --port 8080

Hardware Notes

If a model barely fits, reduce context length and GPU layers. If quality is poor, move from Q4 to Q5 or Q8 before changing model families.