What is Ollama — Run LLMs Locally on Your Machine · Deep Jiwan

Table of Contents

The Problem: LLMs in the Cloud Have a Cost
#

Using large language models through cloud APIs — OpenAI, Anthropic, Mistral, and others — is powerful and convenient. But it comes with real trade-offs:

Cost — API tokens add up fast, especially in development and experimentation
Privacy — Every prompt you send goes to a third-party server
Rate limits — You hit ceilings when testing at scale
Internet dependency — No connection, no AI
Data sovereignty — In regulated industries (healthcare, finance, legal), sending data to external APIs can be a compliance problem

Ollama solves all of this by letting you run LLMs directly on your own machine — completely offline, with no API costs and full data control.

What is Ollama?
#

Ollama is an open-source tool that lets you download, run, and manage large language models locally. Think of it as Docker for AI models — you pull a model with a single command, and Ollama handles everything else: model weights, quantization, memory management, GPU acceleration, and serving a REST API you can call from any application.

Released in 2023 and now at version 0.13.x as of 2025, Ollama has become the most popular local LLM runtime in the developer community — with over 112 million model pulls for Llama 3.1 alone.

It supports macOS, Linux, and Windows, and runs on both CPU and GPU (NVIDIA, AMD, and Apple Silicon).

Ollama is built on top of llama.cpp, a highly optimized C++ inference engine for running quantized LLMs on consumer hardware. Ollama wraps this with a clean CLI and REST API, abstracting all the complexity away.

Supported Models
#

Ollama supports a growing library of popular open-source models:

Model	Description
Llama 3.1 / 3.2	Meta’s flagship open models (8B, 70B)
Mistral / Mixtral	Fast, efficient models from Mistral AI
Gemma 3	Google’s open model family
Qwen3	Alibaba’s multilingual model series
CodeLlama	Code-optimised version of Llama
DeepSeek-R1	Strong reasoning model
Phi-3 / Phi-4	Microsoft’s small but capable models
Nomic Embed	Embedding model for semantic search / RAG

Models are pulled from the Ollama model library — similar to how Docker Hub works for containers.

How Ollama Works
#

Here’s what happens under the hood when you run a model:

Pull — Ollama downloads the model weights from its registry and stores them locally (~/.ollama/models)
Quantization — Models are stored in a quantized format (e.g. 4-bit or 8-bit), dramatically reducing their size and memory requirements without significant quality loss
Runtime — Ollama starts a local server process that loads the model into memory
GPU detection — Ollama automatically detects whether you have a compatible GPU and uses VRAM for inference; falls back to CPU if not
API — A REST API is exposed on http://localhost:11434 — the same interface your apps, scripts, and tools use to talk to the model

Because the server stays running, subsequent requests are fast — the model doesn’t reload from disk on every call.

Installation
#

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com/download

Core Commands
#

# Pull and run a model (downloads if not already local)
ollama run llama3.2

# Pull a model without running it
ollama pull mistral

# List all locally installed models
ollama list

# Remove a model
ollama rm mistral

# See currently running models
ollama ps

# Show model details
ollama show llama3.2

Once you run ollama run llama3.2, you get an interactive terminal chat session — like a local version of ChatGPT, entirely on your machine.

The REST API
#

Ollama exposes a local HTTP API at http://localhost:11434, compatible with OpenAI’s API format. This means any tool built for OpenAI can point at Ollama instead with minimal changes.

Generate a completion:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain VLANs in one paragraph.",
    "stream": false
  }'

Chat completion (OpenAI-compatible):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "What is Docker?"}
    ]
  }'

Using Ollama with Python
#

# Install the official library
# pip install ollama

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'What is a VLAN?'}
    ]
)

print(response['message']['content'])

For streaming responses:

import ollama

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a Python function for binary search'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Custom Models with Modelfile
#

Ollama lets you create custom models using a Modelfile — similar to a Dockerfile. You can set a system prompt, adjust parameters, and build a reusable custom persona:

# Modelfile
FROM llama3.2

SYSTEM """
You are a senior DevOps engineer. You answer questions concisely,
using real-world examples and command-line snippets where relevant.
Always ask for context before giving advice.
"""

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Build and run it:

ollama create devops-assistant -f ./Modelfile
ollama run devops-assistant

Real-World Use Cases
#

🔒 Privacy-First Chatbot
#

Build an internal Q&A tool for your team where no data ever leaves your infrastructure. Ideal for companies with strict data policies — legal, finance, healthcare.

🧪 Local Development & Testing
#

Prototype AI features during development without burning API credits. Run experiments, iterate fast, test edge cases — all offline.

📚 RAG Pipelines
#

Combine Ollama with a vector database (like ChromaDB or Weaviate) and an embedding model (like nomic-embed-text) to build a Retrieval Augmented Generation (RAG) pipeline entirely on-premise:

Your Docs → Embedding Model (Ollama) → Vector DB → Query → LLM (Ollama) → Answer

🤖 CI/CD AI Tools
#

Use Ollama in CI pipelines to run AI-powered code review, test generation, or commit message analysis — without external API dependencies.

🎓 Learning & Research
#

Experiment with different model architectures, compare outputs, fine-tune understanding of how LLMs behave — all without costs or rate limits.

Hardware Requirements
#

Model Size	Min RAM / VRAM	Recommended
1–3B params (e.g., Phi-3 mini)	4 GB	Any modern laptop
7–8B params (e.g., Llama 3.2 8B)	8 GB	8–16 GB RAM / GPU
13B params	16 GB	16 GB+ RAM / GPU
70B params	40+ GB	High-end GPU / Apple M2 Ultra

For most developers, a 7B or 8B model gives an excellent balance of speed, quality, and hardware requirements. On Apple Silicon (M1/M2/M3), Ollama uses the Unified Memory architecture efficiently, making Macs excellent local LLM machines.

Security Consideration
#

By default, Ollama’s API is bound to localhost only — so external machines can’t reach it. However, if you expose it on 0.0.0.0 (for Docker or remote access), the API has no built-in authentication. In that case, always place it behind a reverse proxy (like Nginx) with auth, or restrict access via firewall rules.

Ollama vs Alternatives
#

Tool	Best For
Ollama	Simplest CLI + API setup, Docker-like workflow
LM Studio	GUI-based, great for non-developers
llama.cpp	Maximum control, lowest-level access
Jan	Desktop app with conversation history
vLLM	High-throughput production serving at scale

For developers who want a quick, clean local LLM setup that integrates easily into code — Ollama is the go-to in 2026.

Quick Summary
#

Concept	One-liner
Ollama	Open-source local LLM runtime — Docker for AI
Modelfile	Config file to create custom model personas
REST API	Served at `localhost:11434`, OpenAI-compatible
Quantization	Shrinks model size for consumer hardware
llama.cpp	The inference engine Ollama runs on top of
Use cases	Privacy-first AI, RAG, local dev, research, CI tools

Get Started
#

# 1. Install
curl -fsSL https://ollama.com/install.sh | sh

# 2. Run your first model
ollama run llama3.2

# 3. Say hello
>>> Hello! What can you do?

That’s it. Local AI in three commands.

Official site: ollama.com
GitHub: github.com/ollama/ollama
Model library: ollama.com/library

Co-authored by Vishwakarma, Deeps 2nd Brain

Author

Deep Jiwan

Building hacky solutions that save time and make my life easier. Not too sure about yours :)

What is Ollama — Run LLMs Locally on Your Machine

The Problem: LLMs in the Cloud Have a Cost
#

What is Ollama?
#

Supported Models
#

How Ollama Works
#

Installation
#

Core Commands
#

The REST API
#

Using Ollama with Python
#

Custom Models with Modelfile
#

Real-World Use Cases
#

🔒 Privacy-First Chatbot
#

🧪 Local Development & Testing
#

📚 RAG Pipelines
#

🤖 CI/CD AI Tools
#

🎓 Learning & Research
#

Hardware Requirements
#

Security Consideration
#

Ollama vs Alternatives
#

Quick Summary
#

Get Started
#

Related

What is MCP Servers — In a Nutshell

State of Machine Learning in 2026 — Is It Still Relevant?

Intro to Quantum Computing — A Beginner's Guide

The Problem: LLMs in the Cloud Have a Cost#

What is Ollama?#

Supported Models#

How Ollama Works#

Installation#

Core Commands#

The REST API#

Using Ollama with Python#

Custom Models with Modelfile#

Real-World Use Cases#

🔒 Privacy-First Chatbot#

🧪 Local Development & Testing#

📚 RAG Pipelines#

🤖 CI/CD AI Tools#

🎓 Learning & Research#

Hardware Requirements#

Security Consideration#

Ollama vs Alternatives#

Quick Summary#

Get Started#

Related

The Problem: LLMs in the Cloud Have a Cost
#

What is Ollama?
#

Supported Models
#

How Ollama Works
#

Installation
#

Core Commands
#

The REST API
#

Using Ollama with Python
#

Custom Models with Modelfile
#

Real-World Use Cases
#

🔒 Privacy-First Chatbot
#

🧪 Local Development & Testing
#

📚 RAG Pipelines
#

🤖 CI/CD AI Tools
#

🎓 Learning & Research
#

Hardware Requirements
#

Security Consideration
#

Ollama vs Alternatives
#

Quick Summary
#

Get Started
#