LC

llama.cpp

Featured

A lightweight local inference engine for GGUF models, CPU inference, Apple Metal, CUDA, Vulkan, and server deployments.

Local Runtimes
Open install page
#llama.cpp#GGUF#CPU inference#local server

llama.cpp

llama.cpp is the low-level workhorse behind many local model workflows. It is useful when you need direct control over quantization, CPU fallback, GPU layers, context size, and local server behavior.

Best Fit

  • Developers who want precise runtime control.
  • CPU-first or mixed CPU/GPU machines.
  • Offline deployments with pre-downloaded GGUF files.
  • Testing quantization tradeoffs such as Q4, Q5, and Q8.

Typical Local Server Shape

llama-server -m ./models/model.gguf -c 8192 --host 127.0.0.1 --port 8080

Hardware Notes

If a model barely fits, reduce context length and GPU layers. If quality is poor, move from Q4 to Q5 or Q8 before changing model families.