1. Why Run AI Locally?
- Privacy: Sensitive data (code, documents, client information, medical data) never leaves your machine. No need to worry about OpenAI's or Anthropic's data handling policies.
- Cost: Zero per-token API costs. A heavy ChatGPT-4o user spending $50–$200/month in API fees recovers the GPU cost within months.
- Availability: No outages, no rate limits, no API availability dependencies.
- Customisation: Fine-tune models on your own data, add custom system prompts, create custom model variants — none of which is possible with commercial API models.
- Latency: Local inference latency (first token) is 50–200ms. Cloud API first token latency is typically 300–1500ms.
- Air-gapped environments: Security-sensitive organisations (defence, healthcare, finance) can run AI without network connectivity requirements.
2. Hardware Guide: GPU Tiers
The key metric is VRAM (Video RAM). A model's minimum VRAM = approximately its size in GB at 4-bit quantization + 1–2GB for the context. A 7B model at 4-bit ≈ 4.5GB VRAM minimum.
| GPU | VRAM | Max Model Size (4-bit) | 7B Speed | Price (2026) |
|---|---|---|---|---|
| RTX 3060 | 12GB | ~10B | ~15 tok/s | $250–$320 |
| RTX 3090 / 4070 | 24GB | ~20B | ~40 tok/s | $500–$700 |
| RTX 4090 | 24GB | ~20B (fast) | ~80 tok/s | $1,700–$2,000 |
| RTX 3090 × 2 (NVLink) | 48GB | ~40B | ~35 tok/s | $1,000–$1,400 |
| RTX 4090 × 2 | 48GB | ~40B (fast) | ~70 tok/s | $3,400–$4,000 |
| A100 40GB | 40GB | ~35B | ~80 tok/s (HBM) | $5,000–$8,000 used |
Content window matters: Running a 70B model at 4-bit requires ~40GB VRAM. On a 24GB GPU, you can run 70B models but only with a small context window (2K–4K tokens) — enough for Q&A but limiting for long document tasks.
3. Apple Silicon: The Best Value Option
Apple Silicon (M2/M3/M4) uses unified memory — the CPU and GPU share the same RAM pool. A Mac with 64GB unified memory can run a 70B model at 4-bit effectively, using the full 64GB as "VRAM." This is unique to Apple's architecture.
| Mac | Unified Memory | Max Model (4-bit) | 7B Speed |
|---|---|---|---|
| M2 MacBook Pro 16GB | 16GB | ~12B | ~20 tok/s |
| M3 Max 64GB | 64GB | ~55B | ~35 tok/s |
| M4 Max 128GB | 128GB | 100B+ | ~45 tok/s |
| Mac Pro M2 Ultra 192GB | 192GB | 100B+ effortlessly | ~50 tok/s |
For most developers who already have a Mac, upgrading to 32–64GB unified memory at purchase time is the best value path to a capable local AI workstation — no separate GPU required.
4. Ollama: Installation and Model Management
Ollama is the easiest way to run LLMs locally. It handles model download, quantization selection, GPU acceleration, and serves a local REST API compatible with OpenAI's API format:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from https://ollama.com/download
# Pull and run a model (downloads ~4.7GB for Llama 3.1 8B Q4)
ollama run llama3.1
# Pull without running
ollama pull llama3.1:70b
# List downloaded models
ollama list
# Remove a model
ollama rm llama3.1
# Show model info
ollama show llama3.1
# Run with custom system prompt
ollama run llama3.1 --system "You are an expert Python developer. Answer concisely."
# Serve API (runs by default on http://localhost:11434)
ollama serve
5. Best Models to Run Locally in 2026
| Model | Size | Strengths | Ollama Pull |
|---|---|---|---|
| Llama 3.3 | 70B | Best overall quality in open-source; matches GPT-4o for most tasks | ollama pull llama3.3 |
| Llama 3.2 | 3B / 1B | Ultra-fast on any hardware; great for simple Q&A and edge devices | ollama pull llama3.2 |
| DeepSeek-R1 | 7B / 70B | Strong reasoning; chain-of-thought visible; math and coding | ollama pull deepseek-r1 |
| Gemma 3 | 4B / 27B | Google model; excellent instruction following; multilingual | ollama pull gemma3 |
| Mistral Nemo | 12B | Good balance of size/quality; fast on 24GB VRAM | ollama pull mistral-nemo |
| Qwen2.5-Coder | 7B / 32B | Specialised for code; excellent for local coding assistant | ollama pull qwen2.5-coder |
| Phi-4 | 14B | Microsoft small model; punches above weight on reasoning tasks | ollama pull phi4 |
6. Open WebUI: ChatGPT-Style Interface
Open WebUI (formerly Ollama WebUI) is a feature-rich self-hosted web interface for Ollama. Install with Docker:
# Run Open WebUI with Docker — connects to local Ollama
docker run -d \
--name open-webui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--restart always \
ghcr.io/open-webui/open-webui:main
# Access at: http://localhost:3000
# Features: multi-model chat, file uploads, RAG, image generation,
# custom system prompts, chat history, model management
Open WebUI supports: conversation history, file uploads for RAG, image generation via ComfyUI, connecting to OpenAI API simultaneously (use local or cloud depending on the task), custom agents, and multi-user setup with authentication — making it a production-ready private AI assistant platform.
7. ComfyUI: Local Image Generation
ComfyUI is a node-based interface for Stable Diffusion and Flux image generation models. Run locally for unlimited, private image generation:
# Clone and set up ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
# Download a model (Flux.1-schnell for speed, SDXL for quality)
# Place in ComfyUI/models/checkpoints/
# Run
python main.py --listen 0.0.0.0 --port 8188
# Access at: http://localhost:8188
Recommended models in 2026:
- Flux.1-schnell: Black Forest Labs model; 4-step generation (1–3 seconds on RTX 4090); excellent quality; Apache 2.0 license.
- Flux.1-dev: Higher quality than schnell; 20–50 steps; non-commercial license.
- SDXL + refiner: Older but excellent for photorealism; vast community of fine-tuned variants on CivitAI.
8. Qdrant: Local Vector Database for RAG
Qdrant is an open-source vector database for storing and querying embeddings — the backbone of Retrieval-Augmented Generation (RAG) systems. Run a private knowledge base over your own documents:
# Start Qdrant locally with Docker
docker run -d \
-p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
# Python: index documents and query
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient("localhost", port=6333)
# Create collection
client.create_collection(
collection_name="my_docs",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
# Upsert embeddings (use local embedding model via Ollama embed API)
import httpx, json
def embed(text: str) -> list[float]:
resp = httpx.post("http://localhost:11434/api/embeddings",
json={"model": "nomic-embed-text", "prompt": text})
return resp.json()["embedding"]
points = [PointStruct(id=i, vector=embed(doc), payload={"text": doc})
for i, doc in enumerate(documents)]
client.upsert(collection_name="my_docs", points=points)
# Query: find most similar documents to a question
query_vector = embed("What are the RAG performance benchmarks?")
results = client.search("my_docs", query_vector, limit=5)
9. Using Ollama as a Local API
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. Any code using OpenAI's SDK works with local models by changing the base URL:
from openai import OpenAI
# Point OpenAI client to local Ollama
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but ignored
)
response = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers in 3 sentences."},
],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
10. Performance Expectations
| Hardware | Model | Tokens/sec (generation) | Notes |
|---|---|---|---|
| M2 MacBook Pro 16GB | Llama 3.2 3B | ~40 tok/s | Suitable for real-time chat |
| M3 Max 64GB | Llama 3.3 70B Q4 | ~18 tok/s | Usable for chat; slow for streaming |
| RTX 3090 (24GB) | Llama 3.1 8B Q4 | ~80 tok/s | Very fast; feels instant |
| RTX 4090 (24GB) | Llama 3.1 8B Q4 | ~140 tok/s | Noticeably faster than cloud APIs |
| CPU only (i9-13900K) | Llama 3.2 3B | ~8–12 tok/s | Workable for occasional use |
11. Frequently Asked Questions
Is local AI really comparable to ChatGPT?
For most everyday tasks — summarisation, coding assistance, Q&A, writing — Llama 3.3 70B is genuinely competitive with GPT-4o. For complex multi-step reasoning, frontier models (GPT-4o, Claude 3.7 Sonnet) still have an edge. The gap is closing fast: in 2024, local models were one generation behind; in 2026, they're within one minor model version for most tasks. For private data and unlimited usage, local AI is excellent value.
How much VRAM do I actually need?
For a great everyday experience: 24GB VRAM covers 7B–13B models comfortably and can run 20B models. For 70B models: 40–48GB VRAM (two RTX 3090s or Mac M-series with 64GB+). For a starter setup: 12–16GB VRAM runs 7B models competently. Don't underestimate Apple Silicon — the unified memory architecture makes it uniquely capable for LLMs.
12. Glossary
- Ollama
- An open-source tool that simplifies running LLMs locally, managing models, and serving an OpenAI-compatible API.
- Open WebUI
- A self-hosted web interface for Ollama providing ChatGPT-like UX with multi-model support, RAG, and history.
- Quantization
- Reducing model weight precision (e.g., from 16-bit float to 4-bit integer) to reduce VRAM requirements with minimal quality loss.
- Tokens/second
- The speed at which a model generates output. 20+ tok/s feels instant in chat; under 5 tok/s feels slow.
- Unified Memory (Apple)
- Apple Silicon's architecture where CPU and GPU share the same memory pool, allowing large models to use full RAM as VRAM.
- ComfyUI
- A node-based interface for running Stable Diffusion and Flux image generation models locally.
- Qdrant
- An open-source vector database for storing and searching embeddings, used in local RAG applications.
13. References & Further Reading
- Ollama — Run LLMs Locally
- Open WebUI on GitHub
- ComfyUI on GitHub
- Qdrant — Vector Database
- Hugging Face Model Hub
Install Ollama right now, pull llama3.1, and have your first local AI conversation. It takes under 5 minutes and the model downloads automatically. Once you see 80+ tokens/second streaming in your terminal, the cloud APIs will feel unnecessary for most tasks.