Skip to main content

What is Ollama — Run LLMs Locally on Your Machine

·1266 words·6 mins· loading · loading · · ·
Table of Contents

The Problem: LLMs in the Cloud Have a Cost
#

Using large language models through cloud APIs — OpenAI, Anthropic, Mistral, and others — is powerful and convenient. But it comes with real trade-offs:

  • Cost — API tokens add up fast, especially in development and experimentation
  • Privacy — Every prompt you send goes to a third-party server
  • Rate limits — You hit ceilings when testing at scale
  • Internet dependency — No connection, no AI
  • Data sovereignty — In regulated industries (healthcare, finance, legal), sending data to external APIs can be a compliance problem

Ollama solves all of this by letting you run LLMs directly on your own machine — completely offline, with no API costs and full data control.


What is Ollama?
#

Ollama is an open-source tool that lets you download, run, and manage large language models locally. Think of it as Docker for AI models — you pull a model with a single command, and Ollama handles everything else: model weights, quantization, memory management, GPU acceleration, and serving a REST API you can call from any application.

Released in 2023 and now at version 0.13.x as of 2025, Ollama has become the most popular local LLM runtime in the developer community — with over 112 million model pulls for Llama 3.1 alone.

It supports macOS, Linux, and Windows, and runs on both CPU and GPU (NVIDIA, AMD, and Apple Silicon).

Ollama is built on top of llama.cpp, a highly optimized C++ inference engine for running quantized LLMs on consumer hardware. Ollama wraps this with a clean CLI and REST API, abstracting all the complexity away.


Supported Models
#

Ollama supports a growing library of popular open-source models:

ModelDescription
Llama 3.1 / 3.2Meta’s flagship open models (8B, 70B)
Mistral / MixtralFast, efficient models from Mistral AI
Gemma 3Google’s open model family
Qwen3Alibaba’s multilingual model series
CodeLlamaCode-optimised version of Llama
DeepSeek-R1Strong reasoning model
Phi-3 / Phi-4Microsoft’s small but capable models
Nomic EmbedEmbedding model for semantic search / RAG

Models are pulled from the Ollama model library — similar to how Docker Hub works for containers.


How Ollama Works
#

Here’s what happens under the hood when you run a model:

  1. Pull — Ollama downloads the model weights from its registry and stores them locally (~/.ollama/models)
  2. Quantization — Models are stored in a quantized format (e.g. 4-bit or 8-bit), dramatically reducing their size and memory requirements without significant quality loss
  3. Runtime — Ollama starts a local server process that loads the model into memory
  4. GPU detection — Ollama automatically detects whether you have a compatible GPU and uses VRAM for inference; falls back to CPU if not
  5. API — A REST API is exposed on http://localhost:11434 — the same interface your apps, scripts, and tools use to talk to the model

Because the server stays running, subsequent requests are fast — the model doesn’t reload from disk on every call.


Installation
#

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com/download


Core Commands
#

# Pull and run a model (downloads if not already local)
ollama run llama3.2

# Pull a model without running it
ollama pull mistral

# List all locally installed models
ollama list

# Remove a model
ollama rm mistral

# See currently running models
ollama ps

# Show model details
ollama show llama3.2

Once you run ollama run llama3.2, you get an interactive terminal chat session — like a local version of ChatGPT, entirely on your machine.


The REST API
#

Ollama exposes a local HTTP API at http://localhost:11434, compatible with OpenAI’s API format. This means any tool built for OpenAI can point at Ollama instead with minimal changes.

Generate a completion:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain VLANs in one paragraph.",
    "stream": false
  }'

Chat completion (OpenAI-compatible):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "What is Docker?"}
    ]
  }'

Using Ollama with Python
#

# Install the official library
# pip install ollama

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'What is a VLAN?'}
    ]
)

print(response['message']['content'])

For streaming responses:

import ollama

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a Python function for binary search'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Custom Models with Modelfile
#

Ollama lets you create custom models using a Modelfile — similar to a Dockerfile. You can set a system prompt, adjust parameters, and build a reusable custom persona:

# Modelfile
FROM llama3.2

SYSTEM """
You are a senior DevOps engineer. You answer questions concisely,
using real-world examples and command-line snippets where relevant.
Always ask for context before giving advice.
"""

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Build and run it:

ollama create devops-assistant -f ./Modelfile
ollama run devops-assistant

Real-World Use Cases
#

🔒 Privacy-First Chatbot
#

Build an internal Q&A tool for your team where no data ever leaves your infrastructure. Ideal for companies with strict data policies — legal, finance, healthcare.

🧪 Local Development & Testing
#

Prototype AI features during development without burning API credits. Run experiments, iterate fast, test edge cases — all offline.

📚 RAG Pipelines
#

Combine Ollama with a vector database (like ChromaDB or Weaviate) and an embedding model (like nomic-embed-text) to build a Retrieval Augmented Generation (RAG) pipeline entirely on-premise:

Your Docs → Embedding Model (Ollama) → Vector DB → Query → LLM (Ollama) → Answer

🤖 CI/CD AI Tools
#

Use Ollama in CI pipelines to run AI-powered code review, test generation, or commit message analysis — without external API dependencies.

🎓 Learning & Research
#

Experiment with different model architectures, compare outputs, fine-tune understanding of how LLMs behave — all without costs or rate limits.


Hardware Requirements
#

Model SizeMin RAM / VRAMRecommended
1–3B params (e.g., Phi-3 mini)4 GBAny modern laptop
7–8B params (e.g., Llama 3.2 8B)8 GB8–16 GB RAM / GPU
13B params16 GB16 GB+ RAM / GPU
70B params40+ GBHigh-end GPU / Apple M2 Ultra

For most developers, a 7B or 8B model gives an excellent balance of speed, quality, and hardware requirements. On Apple Silicon (M1/M2/M3), Ollama uses the Unified Memory architecture efficiently, making Macs excellent local LLM machines.


Security Consideration
#

By default, Ollama’s API is bound to localhost only — so external machines can’t reach it. However, if you expose it on 0.0.0.0 (for Docker or remote access), the API has no built-in authentication. In that case, always place it behind a reverse proxy (like Nginx) with auth, or restrict access via firewall rules.


Ollama vs Alternatives
#

ToolBest For
OllamaSimplest CLI + API setup, Docker-like workflow
LM StudioGUI-based, great for non-developers
llama.cppMaximum control, lowest-level access
JanDesktop app with conversation history
vLLMHigh-throughput production serving at scale

For developers who want a quick, clean local LLM setup that integrates easily into code — Ollama is the go-to in 2026.


Quick Summary
#

ConceptOne-liner
OllamaOpen-source local LLM runtime — Docker for AI
ModelfileConfig file to create custom model personas
REST APIServed at localhost:11434, OpenAI-compatible
QuantizationShrinks model size for consumer hardware
llama.cppThe inference engine Ollama runs on top of
Use casesPrivacy-first AI, RAG, local dev, research, CI tools

Get Started
#

# 1. Install
curl -fsSL https://ollama.com/install.sh | sh

# 2. Run your first model
ollama run llama3.2

# 3. Say hello
>>> Hello! What can you do?

That’s it. Local AI in three commands.


Co-authored by Vishwakarma, Deeps 2nd Brain

Deep Jiwan
Author
Deep Jiwan
Building hacky solutions that save time and make my life easier. Not too sure about yours :)

Related