Advancements in Model Management with llama.cpp: Shaping Technology's Future
Local LLM deployment is no longer only about “can I run a model on my machine?” It’s about managing multiple models —small ones for quick tasks, larger ones for hard prompts, specialty models for embeddings or reranking—without turning your setup into a forest of ports and restart scripts. That’s the context for a major usability shift in llama.cpp : the project’s lightweight HTTP server ( llama-server ) introduced a native model management feature called router mode . Instead of starting a separate server process per model, router mode lets you run one server and load, unload, and switch models dynamically —including auto-discovery from your cache and LRU-based eviction when you hit a configurable limit. TL;DR Router mode in llama-server enables dynamic load/unload/switch between multiple GGUF models without restarting. It supports auto-discovery from the llama.cpp cache or a --models-dir folder, plus on-demand loading when a model is first requested....