🚀 Secret AI — Private AI Assistant on your phone
No Servers. No Tracking. No Data Uploading - 100% Offline & Secure
The ecosystem of local (personal or self-hosted) large language models (LLMs) has grown rapidly. A wide variety of open-weight models and inference tools now allow running powerful LLMs on consumer hardware or private servers. In particular, modern LLMs like Meta’s LLaMA family, Mistral AI’s models, Microsoft’s Phi-3 series, and others have been released with permissive licenses and optimized for local use. These models often come in compressed formats (e.g. quantized GGUF) and are supported by specialized runtime frameworks. Below we survey the state-of-the-art (as of 2025), covering models, software, deployment methods, optimization techniques, benchmarks, recent research, and use cases – with pointers to key open-source projects.
-
Meta LLaMA series (LLaMA 2/3/4) – Meta has released multiple generations of “LLaMA” foundation models under open licenses. For example, LLaMA 3.1 was introduced with 8B, 70B, and 405B parameter variants; the 405B model supports extremely long context (128K tokens). In early 2025 Meta further announced LLaMA 4, a Mixture-of-Experts (MoE) system with two models: Scout (~109B total, 17B active) and Maverick (~400B total, 17B active). Both are natively multimodal and support very long context (up to 10 million tokens for Scout). All LLaMA models are openly published (via Hugging Face/llama.com) under Meta’s community license, and many have been converted into local-friendly formats (GGUF). Notably, the Phi-3 paper reports that their own models match or exceed LLaMA 3.1 in benchmarks.
-
Mistral AI models – Mistral AI offers both “free” (Apache2-licensed) models and larger “premier” models (often research-license). Freely available models include Mistral Small (7B) and Devstral Small (24B text model specialized for code), as well as Mistral NeMo (7B multilingual, released July 2024) and Codestral Mamba (first Mistral “Mamba 2” model, July 2024). These are state-of-the-art in their niches. Larger “research” models include Mistral Large (34B, reasoning model, Nov 2024) and Pixtral (12B multimodal, Sept 2024), though those often require special access. The Mistral documentation emphasizes that all free models are open-source under Apache2. (See Mistral’s docs for the full catalog.)
-
Microsoft Phi-3 family – In April 2024 Microsoft published the Phi-3 series of small LLMs as “open models”. The first release was Phi-3-mini (3.8B parameters); Microsoft claims it “outperforms models twice its size” and achieves strong results (69% on MMLU). Phi-3-mini is distributed via Azure, Hugging Face, and local frameworks like Ollama. Larger variants Phi-3-small (7B) and Phi-3-medium (14B) were announced shortly thereafter. The official ArXiv report confirms Phi-3-small/medium achieve ~75–78% on MMLU, significantly above the 3.8B model’s 69%. The authors also introduced Phi-3.5 (with MoE and vision-capable variants) that outperform LLaMA 3.1 and Mistral on reasoning and code tasks. Critically, Microsoft released these models for local use: e.g. Phi-3-mini is available in Ollama’s repository, and converted GGUF checkpoints are on Hugging Face (Mini/Small/Medium) as well as open formats.
-
DeepSeek – DeepSeek-AI (a Chinese startup) has open-sourced its R1 series. The first-generation DeepSeek-R1 is an enormous MoE model (671B total, R1-Zero with RL training, and R1 with further fine-tuning) that “naturally emerges with powerful reasoning behaviors”. DeepSeek’s team also released six distilled dense models (1.5B, 7B, 8B, 14B, 32B, 70B) derived from R1. According to their report, these distilled models set a new state-of-the-art among comparably-sized open models. Notably, all R1 models and distillations are open-source: DeepSeek explicitly “open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models”. These are available on Hugging Face (e.g.
DeepSeek-R1-Distill-...-GGUF
), making them runnable locally. In benchmarks they rival proprietary models (the paper claims “performance comparable to OpenAI-o1-1217” on reasoning). -
Google Gemma models – Google released the Gemma family (a spin-off from Gemini research) under an open license. For example, Gemma-2 includes 2B and 7B variants (base and instruct). The Hugging Face model card explicitly states “Gemma is a family of lightweight, state-of-the-art open models from Google… with open weights”. These decoder-only English LLMs are designed to run on personal hardware (laptop/desktop) and support typical tasks (QA, summarization, reasoning). Gemma’s small sizes (2B, 7B) make them well-suited for local use. (Google provides technical reports and example code, indicating Gemma is intended for easy on-device deployment.)
-
Alibaba Qwen series – Alibaba’s “Tongyi Qianwen” (Qwen) models are openly accessible. Alibaba explicitly notes: “Alibaba Cloud provides Tongyi Qianwen (Qwen) models… to the open-source community”. Qwen3 (latest generation) offers multimodal capabilities (vision/audio) and supports 119 languages, with architecture allowing “flexible control of reasoning performance, speed, and cost”. Qwen2.5 (text-only) and Qwen3 models are released via Alibaba and Hugging Face and can be run locally. They achieve “competitive results” on coding, math, general tasks. These are thus viable local LLM options, especially for Chinese/multilingual use cases.
-
Other notable open models – In addition to the above, many older open models remain relevant on local devices. EleutherAI’s GPT-NeoX (20B), GPT-JT (6B), and Pythia series (up to 12B) still run offline. Hugging Face hosts community models like Vicuna (13B LLaMA-based chat model), StarCoder (15B code model by BigCode), RedPajama-INCITE (15B replicating LLaMA), and others. The OpenAI/Together work Cerebras-GPT (13B) and MPT (by MosaicML) are also available. When discussing “local LLMs”, it’s worth noting these older open models laid the groundwork even if newer ones now lead performance.
To run LLMs locally, a variety of specialized runtimes and tools have emerged:
-
Inference Engines: The core engine for many local setups is llama.cpp (GGML/C++ inference library) by Georgi Gerganov. It efficiently runs quantized LLaMA-like models on CPU (and some GPU). Llama.cpp introduced the new GGUF format in 2023 as a replacement for GGML, enabling richer metadata and quantization (see below). Several libraries wrap or extend llama.cpp functionality, including llama-cpp-python (Python bindings) and cTransformers (a high-performance C++/CUDA backend by Hugging Face/H2O.ai). For GPUs, specialized engines like ExLlama (by zxia14) accelerate LLaMA inference on NVIDIA cards, and frameworks like PyTorch/Accelerate, ONNX Runtime and TensorRT can be used when FP16/BFloat16 models are available.
-
User Interfaces and Tools: Many community tools provide GUIs or CLI front-ends:
- text-generation-webui (by oobabooga) is a widely-used web interface (43.9k stars) that “supports multiple local backends” (llama.cpp, PyTorch, ExLlamaV3/V2, TensorRT, etc.). It runs offline and offers chat modes, prompt templating, file attachments, branching chats, and more.
- KoboldCpp (LostRuins) is a standalone executable with a KoboldAI-like interface. It is “an easy-to-use AI text-generation software for GGML/GGUF models” built on llama.cpp. KoboldCpp supports CPU/GPU, all GGUF models, and includes features like chat, “storywriter” mode, RAG (retrieval), and even image generation and text-to-speech integrations.
- LM Studio (by Evgeny Zatsepin / LM Studio team) is a cross-platform local LLM app with GUI and server modes. It exposes a Python/JS SDK for building apps. The CLI/SDK and backend (MLX engine on Apple silicon) are open-source. LM Studio makes it easy to browse/download open models (LLaMA, DeepSeek, Qwen, Phi, etc.) and run them in chats or a local server. (Their site highlights features like “Chat with your local documents (RAG)”.)
- Ollama is an open-source CLI (with simple UI) designed for macOS/Windows. Ollama can pull models by name (e.g.
ollama pull deepseek-r1
) and run them locally. It also integrates into other apps: for example, a developer demonstrated using Ollama to provide Llama 3.1 (8B) and Qwen2.5 Coder (1.5B) in a VS Code “Continue” coding assistant extension. - Other UIs include OpenWebUI (open-source Docker image), LoLLMS WebUI, Faraday.dev, and Gradio-based apps. Many of these aim to simplify setup (often via Docker) and provide chat or specialized modes (chatbots, writing assistants, etc.).
-
Developer Libraries: High-level libraries facilitate building local LLM applications. Hugging Face’s Transformers library can load many models (using its AutoModel/AutoTokenizer) and supports both PyTorch and pipeline APIs. Libraries like LangChain enable connecting local LLM backends (via llama.cpp, ctransformers, or an API server) for chains, tools, and agents. CTransformers provides a drop-in fast backend with OpenAI-like API support. Llama-cpp-python exposes llama.cpp to Python for research and deployment. There are also Rust frameworks like Candle, and older tools like KohyaSS, GPTQ-for-LLaMa, etc., for model conversion and quantization (see next section).
Deploying LLMs locally or on-prem often involves packaging and optimization:
-
Containerization: Many setups use Docker or similar for reproducibility. NVIDIA offers an NIM (NVIDIA AI Microservice) format; Microsoft notes Phi-3 can be deployed as an NVIDIA NIM microservice (with a standard API) for any server. Community projects provide Docker images: e.g. the text-generation-webui Docker, OpenWebUI image, or huggingface’s TGI (Text Generation Inference) containers. These images bundle the inference runtime and a web/UI layer, so users can
docker run
a complete local LLM server. Model hubs (Hugging Face, ollama, lmstudio catalogs) allow pulling models directly into containers or local storage. -
Quantized Models: To save memory and speed up inference, models are often quantized to lower-bit representations. Quantization schemes include 8-bit (Q8), 4-bit (Q4_0, Q4_K), 5-bit (Q5_K), 6-bit (Q6_K), and even 2-3 bit (Q2_K/Q3_K) formats. The llama.cpp/ GGUF infrastructure provides quantization at conversion-time. For example, TheBloke’s Llama2 GGUF repository offers 2–8 bit quantizations for CPU/GPU and AWQ/GPTQ optimized 4-bit models for GPU inference. The llama.cpp team’s GGUF format supports new “K” quantization types (e.g. GGML_TYPE_Q4_K uses 4.5 bits per weight). Popular quantization tools include GPTQ (4-bit quantization by lambeq/GPTQ-for-LLaMa) and AWQ, which can produce ultra-compact 4-bit models. Quantized GGUF files can run on CPU or GPU (with llama.cpp, text-gen-webui, etc.). Users often trade off a small accuracy loss for huge memory gains; e.g. a 7B model can drop from 28GB FP16 down to ~8–12GB in 4-bit.
-
Formats: Beyond GGUF, models may be saved in ONNX (for inference engines), SafeTensors, PyTorch
pt
, or Apple’s MLX format. Many local-serving tools accept GGUF natively (Ollama, text-gen-webui, LM Studio). On Apple silicon, LM Studio can use the MLX engine (open-sourced by Apple) to fully utilize the GPU/Unified Memory. Deployment examples often involve converting a Hugging Face model to GGUF or ONNX, then loading it in llama.cpp, ctransformers, or a web UI. -
Optimization Techniques: In addition to quantization, other optimizations include weight tying, layer fusion, knowledge distillation (as in DeepSeek’s distilled models), and architecture tricks (e.g. FlashAttention for faster decoding). Performance also scales with hardware: modern AVX-512 CPUs or Apple M-series chips can run small quantized models in real-time. GPU inference can leverage tensor cores (FP16/BF16) or INT4 compute (with specialized kernels). Frameworks like Hugging Face Accelerate, ONNX Runtime with OpenVINO, or NVIDIA TensorRT are also used to tune speed on target hardware.
Several community benchmarks gauge how well local models perform:
-
Phi-3 Benchmarks: According to Microsoft’s report, Phi-3-mini (3.8B) scores 69% on MMLU (standard academic benchmark), rivaling OpenAI GPT-3.5 and outperforming many 7B-class models. Phi-3-small (7B) and medium (14B) achieve about 75% and 78% MMLU, respectively – competitive with top open models. The Phoenix blog claims Phi-3-mini “outperforms models twice its size” on benchmarks. In other tests (MT-Bench, code benchmarks, etc.), Phi-3 also leads its size class. The later Phi-3.5-MoE model (6.6B active) reportedly surpasses LLaMA 3.1 and Mixtral (Mistral’s MoE) on reasoning and math.
-
DeepSeek Benchmarks: DeepSeek’s R1 and its distilled models are reported to set new state-of-art for similar sizes. For example, DeepSeek-R1-Distill-32B (from Qwen/Llama base) outperforms OpenAI’s O1-mini on reasoning tasks. The DeepSeek team highlights that their largest model matches or exceeds GPT-4o-mini, Gemini-1.5-Flash, etc. In absence of widely published numbers, users note DeepSeek’s strong reasoning/code abilities in forums.
-
Small-model Benchmarks: A detailed August 2024 analysis of small LLMs (0.5–9B) found that 5 of the top 6 performers were 2024-vintage models: Phi-3-mini (3.8B), Phi-3.5-mini (3.8B), Qwen-2 7B, Mistral 7B, and O1-AI-9B. Notably, Phi-3-mini (3.8B) achieved 100% on that specific test set, making it “the most accurate model”. A 1.5B Qwen model was nearly as accurate as many 7B models. These results suggest that very small (1–3B) models have rapidly improved: Phi-3 (3.8B) in particular outperformed all 6–9B models tested. In summary, recent benchmarks consistently rank the newest open models (Phi-3, Qwen, Mistral, DeepSeek) at the top for their size, whereas 2023-era models fall slightly behind.
-
Latency & Throughput: On typical hardware, quantized 7–8B models generate roughly 5–20 tokens/sec on CPU (depending on CPU/GPU and batch size). For example, a 13B model in 4-bit on an M2 Mac or a recent Intel/AMD CPU might do ~10 tokens/s (roughly real-time chat). Multi-GPU setups or tensor-core-optimized kernels can exceed 100 tokens/s for larger models. Exact numbers vary with model, precision, and hardware, but in practice most open models can be run interactively on modern desktops or laptops in quantized form.
Key academic/industry papers relevant to local LLMs include:
-
Phi-3 Technical Report (Apr 2024) – Microsoft’s arXiv paper introduced Phi-3-mini (3.8B) and its bigger variants. It details training on 3.3T tokens and reports benchmark scores (69% MMLU for mini, 75–78% for 7B/14B). It also presents the Phi-3.5 series (MoE and vision models); e.g. “Phi-3.5-MoE (6.6B active) achieves superior performance… compared to Llama 3.1 and the Mixtral series”. This paper is a rich source on how to build high-quality small LLMs for on-device use.
-
DeepSeek-R1 Paper (Jan 2025) – DeepSeek-AI’s arXiv submission describes training “reasoning capability” via reinforcement learning. It introduces DeepSeek-R1-Zero and R1 (a 671B MoE). Crucially, it states that the authors are open-sourcing R1-Zero, R1, and six distilled models from 1.5B up to 70B. It also reports that DeepSeek-R1 “achieves performance comparable to OpenAI-o1-1217 on reasoning tasks”. This work highlights RLHF and distillation techniques for local LLMs.
-
Mistral AI Docs/Blogs – While not formal papers, Mistral’s official docs detail their model releases (NeMo, Codestral, Mathstral, etc.) and research contributions (LLM reasoning, multimodal). In particular, the release notes discuss Mistral Nemo (multilingual), Codestral Mamba (LLM-Coder 2.0), and their “Mathstral” model for math. These inform how open models are optimized (e.g. fine-tuning on math). (Mistral also published benchmarks on MMLU and coding, but the docs above mostly enumerate models.)
-
Quantization & Inference Optimizations – Various works in late 2024 explore quantizing LLMs. For example, llama.cpp discussions introduced new quant schemes (Q4_K, Q6_K) and evaluated quality (via perplexity tests). NVIDIA and others also released INT8 kernels (e.g. TensorRT-LLM). Papers on “4-bit quantization with negligible accuracy loss” (AWQ, GPTQ) have been circulating, though often as blog posts or GitHub notes rather than formal arXiv. Overall, “LLM acceleration” is an active research area with many small contributions.
Local LLMs are already being applied in many domains where privacy, speed, or cost matter:
-
Chatbots and Assistants: Offline chat interfaces (for personal or enterprise use) use local LLMs to preserve data privacy. For example, LM Studio advertises “Chat with your local documents” (a RAG setup), enabling on-prem knowledge bots. Apps on mobile or desktop can bundle an LLM for QA without sending data to a server. Local assistants can answer emails, schedule tasks, or act as personal tutors.
-
Coding Assistants: Several projects integrate local LLMs into IDEs and code tools. As one example, a developer used Ollama to provide Llama 3.1 (8B) and Qwen2.5-Coder (1.5B) models as a VS Code “Continue” extension, giving on-device code autocomplete/chat. Tools like Tabnine or Copilot-like plugins can leverage local open models (e.g. StarCoder, CodeLlama) instead of cloud. Researchers also fine-tune local models for code tasks (e.g. Mistral’s Codestral series).
-
Education and Tutoring: LLMs can power personalized learning aids. A locally-run LLM can quiz a student, explain concepts, or generate practice problems without exposing student data externally. Tools like Khanmigo could be re-implemented with an open model on local servers. The performance of small models (1–3B) has now reached the level where they can handle math/logic questions reasonably well, making them useful for guided learning.
-
Creative Writing and World-Building: Hobbyist and professional writers use local LLMs for brainstorming, story-generation, and editing. Interfaces like KoboldCpp offer an “adventure” or “storywriter” mode for interactive fiction. A writer can prompt a local LLM (e.g. Vicuna or Llama 8B) to draft scenes or dialogue. Because the model runs on a personal machine, all drafts remain private. Generative art tools may combine LLMs with diffusion models (also run locally) for multimedia creativity.
-
Game AI: Local LLMs are used to drive NPC dialogue and game narrative without requiring an internet connection. The Kobold series originally targeted AI Dungeon–style games. Indie game developers now embed LLMs as character AI. The extremely long context of LLaMA 4 Scout (10M tokens!) hints at future games with persistent, evolving worlds powered by LLM memory.
-
Edge and On-device AI: The trend toward on-device AI (e.g. smartphone LLMs) uses models like Phi-3-mini. As Microsoft notes, Phi-3-mini is small enough to “run locally on a phone”. Specialized hardware (Apple Neural Engine, Qualcomm AI cores) can accelerate these models. We expect apps on iOS/Android to ship with LLMs that perform inference offline, similar to how phones now run speech-to-text on-device.
-
Enterprise and Sensitive Domains: Companies deploy local LLM servers for legal, medical, or financial Q&A where data confidentiality is crucial. By using open models internally, firms avoid API costs and reduce leakage risk. For instance, an internal chatbot on AWS can load a quantized GGUF LLaMA 3.0 70B via a container and answer staff questions about proprietary documents. Similarly, local LLMs can perform customer support, generate reports, or analyze internal data without sending information to outside cloud services.
A non-exhaustive list of key GitHub/Hugging Face projects enabling local LLMs:
- ggml-org/llama.cpp – Core C/C++ library for LLaMA/GPT-style inference on CPU/GPU. It introduced the GGUF format (Aug 2023) and supports QK quantization. Llama.cpp is widely used for local LLM serving (it “provides a powerful and efficient way to run LLMs on edge devices”).
- oobabooga/text-generation-webui – A popular web-based UI (43.9k stars) for LLM chat/interaction. Supports many backends (llama.cpp, PyTorch, ExLlama, TensorRT, etc.). Offers dark/light themes, branchable chats, PDF upload, and OpenAI-compatible API endpoints. Easiest way to run a local chatbot via Docker or standalone.
- LostRuins/koboldcpp – A single-file GUI application (with 1.6k stars) that “builds off llama.cpp”. Supports CPU/GPU, all GGUF models, and has multiple modes (chat, instruct, adventure/storywriter) and extensions. It’s designed for storytelling/RPG scenarios but works for general chat generation.
- LM Studio – Although the GUI is proprietary, the LM Studio CLI and SDK (on GitHub) are open-source (MIT). They allow running models locally via a simple Python/JS API. The project also open-sources the MLX inference engine (for Apple Silicon). LM Studio simplifies discovering and serving local models.
- Ollama – An open-source CLI tool (with a small GUI) for local LLMs. It provides a model registry so users can
ollama pull [model-name]
and thenollama run [model]
. (See Ollama’s model library, which includes Phi-3, LLaMA, Qwen, DeepSeek, etc.) Though proprietary parts exist, the core is MIT-licensed as noted in blogs. - TheBloke’s Repositories (Hugging Face) – TheBloke publishes many quantized model packs. For example, his “Llama-2-7B-GGUF” repo provides 4- to 8-bit GGUF files, AWQ/GPTQ variants, etc. It lists supported runtimes (llama.cpp, webui, KoboldCpp, LM Studio, etc.) and provides conversion scripts. Searching Hugging Face for “TheBloke GGUF” yields quantized releases for LLaMA, LLaMA 3, Mistral, etc.
- Hugging Face Transformers and Datasets – The official Python library supports loading many open LLMs (text and vision) for inference or fine-tuning. It also includes tokenizers and conversion scripts. Combined with the
accelerate
andbitsandbytes
libraries, one can run models in 8-bit on GPU. Numerous community Hugging Face spaces also offer quick demos of local LLMs. - Other Tools:
llama-cpp-python
(pypi) wraps llama.cpp in Python,ctransformers
is a C++/Python wrapper for fast inference, and frameworks like Haystack or LlamaIndex enable RAG-style question-answering using local models. There are also many small repos: GPTQ for quantization, LoLLMS-WebUI (developer oobabooga’s alternative UI), etc. (See the “About GGUF” section of any GGUF model for up-to-date lists of supported clients.)
Sources: Recent announcements and documentation of open LLMs (Meta LLaMA, Mistral, Microsoft Phi-3); Hugging Face model cards (Google Gemma); Alibaba Cloud’s Qwen pages; open-source tool repos and readmes; technical reports and benchmarks; and official docs (LM Studio, Ollama). These confirm the models, tools, and performance claims discussed above. All cited sources are publicly accessible.