LC
llama.cpp
FeaturedA lightweight local inference engine for GGUF models, CPU inference, Apple Metal, CUDA, Vulkan, and server deployments.
Local Runtimes#llama.cpp#GGUF#CPU inference#local server
llama.cpp
llama.cpp is the low-level workhorse behind many local model workflows. It is useful when you need direct control over quantization, CPU fallback, GPU layers, context size, and local server behavior.
Best Fit
- Developers who want precise runtime control.
- CPU-first or mixed CPU/GPU machines.
- Offline deployments with pre-downloaded GGUF files.
- Testing quantization tradeoffs such as Q4, Q5, and Q8.
Typical Local Server Shape
llama-server -m ./models/model.gguf -c 8192 --host 127.0.0.1 --port 8080
Hardware Notes
If a model barely fits, reduce context length and GPU layers. If quality is poor, move from Q4 to Q5 or Q8 before changing model families.