The Problem: LLMs in the Cloud Have a Cost#
Using large language models through cloud APIs — OpenAI, Anthropic, Mistral, and others — is powerful and convenient. But it comes with real trade-offs:
- Cost — API tokens add up fast, especially in development and experimentation
- Privacy — Every prompt you send goes to a third-party server
- Rate limits — You hit ceilings when testing at scale
- Internet dependency — No connection, no AI
- Data sovereignty — In regulated industries (healthcare, finance, legal), sending data to external APIs can be a compliance problem
Ollama solves all of this by letting you run LLMs directly on your own machine — completely offline, with no API costs and full data control.
What is Ollama?#
Ollama is an open-source tool that lets you download, run, and manage large language models locally. Think of it as Docker for AI models — you pull a model with a single command, and Ollama handles everything else: model weights, quantization, memory management, GPU acceleration, and serving a REST API you can call from any application.
Released in 2023 and now at version 0.13.x as of 2025, Ollama has become the most popular local LLM runtime in the developer community — with over 112 million model pulls for Llama 3.1 alone.
It supports macOS, Linux, and Windows, and runs on both CPU and GPU (NVIDIA, AMD, and Apple Silicon).
Ollama is built on top of
llama.cpp, a highly optimized C++ inference engine for running quantized LLMs on consumer hardware. Ollama wraps this with a clean CLI and REST API, abstracting all the complexity away.
Supported Models#
Ollama supports a growing library of popular open-source models:
| Model | Description |
|---|---|
| Llama 3.1 / 3.2 | Meta’s flagship open models (8B, 70B) |
| Mistral / Mixtral | Fast, efficient models from Mistral AI |
| Gemma 3 | Google’s open model family |
| Qwen3 | Alibaba’s multilingual model series |
| CodeLlama | Code-optimised version of Llama |
| DeepSeek-R1 | Strong reasoning model |
| Phi-3 / Phi-4 | Microsoft’s small but capable models |
| Nomic Embed | Embedding model for semantic search / RAG |
Models are pulled from the Ollama model library — similar to how Docker Hub works for containers.
How Ollama Works#
Here’s what happens under the hood when you run a model:
- Pull — Ollama downloads the model weights from its registry and stores them locally (
~/.ollama/models) - Quantization — Models are stored in a quantized format (e.g. 4-bit or 8-bit), dramatically reducing their size and memory requirements without significant quality loss
- Runtime — Ollama starts a local server process that loads the model into memory
- GPU detection — Ollama automatically detects whether you have a compatible GPU and uses VRAM for inference; falls back to CPU if not
- API — A REST API is exposed on
http://localhost:11434— the same interface your apps, scripts, and tools use to talk to the model
Because the server stays running, subsequent requests are fast — the model doesn’t reload from disk on every call.
Installation#
macOS / Linux:
curl -fsSL https://ollama.com/install.sh | shWindows: Download the installer from ollama.com/download
Core Commands#
# Pull and run a model (downloads if not already local)
ollama run llama3.2
# Pull a model without running it
ollama pull mistral
# List all locally installed models
ollama list
# Remove a model
ollama rm mistral
# See currently running models
ollama ps
# Show model details
ollama show llama3.2Once you run ollama run llama3.2, you get an interactive terminal chat session — like a local version of ChatGPT, entirely on your machine.
The REST API#
Ollama exposes a local HTTP API at http://localhost:11434, compatible with OpenAI’s API format. This means any tool built for OpenAI can point at Ollama instead with minimal changes.
Generate a completion:
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "Explain VLANs in one paragraph.",
"stream": false
}'Chat completion (OpenAI-compatible):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What is Docker?"}
]
}'Using Ollama with Python#
# Install the official library
# pip install ollama
import ollama
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'What is a VLAN?'}
]
)
print(response['message']['content'])For streaming responses:
import ollama
stream = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a Python function for binary search'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)Custom Models with Modelfile#
Ollama lets you create custom models using a Modelfile — similar to a Dockerfile. You can set a system prompt, adjust parameters, and build a reusable custom persona:
# Modelfile
FROM llama3.2
SYSTEM """
You are a senior DevOps engineer. You answer questions concisely,
using real-world examples and command-line snippets where relevant.
Always ask for context before giving advice.
"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096Build and run it:
ollama create devops-assistant -f ./Modelfile
ollama run devops-assistantReal-World Use Cases#
🔒 Privacy-First Chatbot#
Build an internal Q&A tool for your team where no data ever leaves your infrastructure. Ideal for companies with strict data policies — legal, finance, healthcare.
🧪 Local Development & Testing#
Prototype AI features during development without burning API credits. Run experiments, iterate fast, test edge cases — all offline.
📚 RAG Pipelines#
Combine Ollama with a vector database (like ChromaDB or Weaviate) and an embedding model (like nomic-embed-text) to build a Retrieval Augmented Generation (RAG) pipeline entirely on-premise:
Your Docs → Embedding Model (Ollama) → Vector DB → Query → LLM (Ollama) → Answer🤖 CI/CD AI Tools#
Use Ollama in CI pipelines to run AI-powered code review, test generation, or commit message analysis — without external API dependencies.
🎓 Learning & Research#
Experiment with different model architectures, compare outputs, fine-tune understanding of how LLMs behave — all without costs or rate limits.
Hardware Requirements#
| Model Size | Min RAM / VRAM | Recommended |
|---|---|---|
| 1–3B params (e.g., Phi-3 mini) | 4 GB | Any modern laptop |
| 7–8B params (e.g., Llama 3.2 8B) | 8 GB | 8–16 GB RAM / GPU |
| 13B params | 16 GB | 16 GB+ RAM / GPU |
| 70B params | 40+ GB | High-end GPU / Apple M2 Ultra |
For most developers, a 7B or 8B model gives an excellent balance of speed, quality, and hardware requirements. On Apple Silicon (M1/M2/M3), Ollama uses the Unified Memory architecture efficiently, making Macs excellent local LLM machines.
Security Consideration#
By default, Ollama’s API is bound to localhost only — so external machines can’t reach it. However, if you expose it on 0.0.0.0 (for Docker or remote access), the API has no built-in authentication. In that case, always place it behind a reverse proxy (like Nginx) with auth, or restrict access via firewall rules.
Ollama vs Alternatives#
| Tool | Best For |
|---|---|
| Ollama | Simplest CLI + API setup, Docker-like workflow |
| LM Studio | GUI-based, great for non-developers |
| llama.cpp | Maximum control, lowest-level access |
| Jan | Desktop app with conversation history |
| vLLM | High-throughput production serving at scale |
For developers who want a quick, clean local LLM setup that integrates easily into code — Ollama is the go-to in 2026.
Quick Summary#
| Concept | One-liner |
|---|---|
| Ollama | Open-source local LLM runtime — Docker for AI |
| Modelfile | Config file to create custom model personas |
| REST API | Served at localhost:11434, OpenAI-compatible |
| Quantization | Shrinks model size for consumer hardware |
| llama.cpp | The inference engine Ollama runs on top of |
| Use cases | Privacy-first AI, RAG, local dev, research, CI tools |
Get Started#
# 1. Install
curl -fsSL https://ollama.com/install.sh | sh
# 2. Run your first model
ollama run llama3.2
# 3. Say hello
>>> Hello! What can you do?That’s it. Local AI in three commands.
- Official site: ollama.com
- GitHub: github.com/ollama/ollama
- Model library: ollama.com/library
Co-authored by Vishwakarma, Deeps 2nd Brain
