Run AI Locally for Free — No API Keys, No Cloud, No Limits

Using Docker Model Runner (DMR) | Part 1: Setup & Core Concepts

By Mayur · B.Tech CSE, 6th Semester
Series: Local AI with Docker Model Runner

I'm a CSE student. Like most of us, I wanted to experiment with AI — build things with LLMs, try different models, maybe even integrate one into a project.

Then I saw the OpenAI pricing page.

Yeah.

Here's the thing though — you don't need to pay a rupee to run a capable AI model. You don't need to send your data to anyone's server. You don't need Ollama, you don't need LM Studio. If you already have Docker installed, you're 90% there.

This series is about Docker Model Runner (DMR) — Docker's built-in way to pull, run, and talk to LLMs locally. We'll go from zero to building a real terminal-based AI chat app in TypeScript. By the end, you'll have something genuinely useful that you built yourself and can show on your portfolio.

Let's get into it.

What Even Is Docker Model Runner?

DMR is a plugin for Docker that lets you manage and run AI models the same way you manage containers. Pull a model like you'd pull an image, run it like you'd run a container, and interact with it through a local REST API that's compatible with OpenAI's format.

That last part matters a lot. It means any code you write against OpenAI's SDK works with your local model too — just swap the base URL. When you eventually want to move to the cloud, the change is literally three lines.

The models come from Docker Hub (under the ai/ namespace) or Hugging Face. They get cached locally after the first pull, so you only wait once.

Why Not Just Use Ollama?

Fair question. Ollama is great. But DMR has a specific advantage: it's built into Docker. If you're already a Docker user (and as a developer, you should be), there's nothing new to install or learn. It fits naturally into your existing workflow, your Compose files, your CI pipelines.

Also — and this is the part that got me — the API is OpenAI-compatible out of the box. No adapter, no wrapper. Just point your existing OpenAI SDK at localhost:12434 and it works.

Setup (Docker Engine on Linux — RPM-based)

I'm on Fedora/RHEL-based Linux. Here's the exact setup:

Step 1: Make sure Docker Engine is installed.

docker --version

If it's not installed, handle that first. DMR is a plugin for Docker Engine, not a standalone tool.

Step 2: Install the DMR plugin. For RPM linux distro users:

sudo dnf update
sudo dnf install docker-model-plugin

Step 3: Verify the installation.

docker model version

If you see a version number, you're good. If Docker says 'model' is not a docker command, the plugin isn't in the right directory. Create a symlink:

ln -s /Applications/Docker.app/Contents/Resources/cli-plugins/docker-model ~/.docker/cli-plugins/docker-model

Then try again.

Step 4: Updating DMR later.

docker model uninstall-runner --images && docker model install-runner

Note: this preserves your local models. If you want to wipe models too, add --models to the uninstall command.

Your First Model

Let's pull something small and fast. I use llama3.2:3B-Q4_0 — it's 1.78 GB on disk and runs well on a laptop with 4GB VRAM.

docker model pull ai/llama3.2:3B-Q4_0

First time takes a while. After that, it's cached.

To see what you have locally:

docker model ls

You'll see something like:

MODEL NAME        PARAMETERS  QUANTIZATION  ARCHITECTURE  SIZE
llama3.2:3B-Q4_0  3.21 B      Q4_0          llama         1.78 GiB

To quickly test if it works:

docker model run ai/llama3.2:3B-Q4_0 "Explain Docker in one sentence."

Key Commands You'll Use Daily

docker model ls          # list all local models
docker model ps          # see which models are currently loaded in memory
docker model pull <name> # download a model
docker model run <name>  # run and chat interactively
docker model logs        # see what's happening under the hood
docker model rm <name>   # delete a model

docker model ls vs docker model ps trips people up at first.

ls = your model library (everything downloaded to disk)
ps = what's actually running right now (loaded into RAM/VRAM)

A model only loads into memory when something actively talks to it. When you stop using it, it unloads. This is smart resource management — your laptop doesn't groan under the weight of an idle AI.

Before We Go Further — Some GenAI Vocabulary

You'll see these terms everywhere. Let me explain them once so they never confuse you again.

Parameters — The Brain Size

A model's parameter count is basically how many "learned values" it has inside. More parameters = more capable, but also heavier and slower.

SmolLM2   →  361 million parameters  — small, fast, good for testing
Llama 3.2 →  3.2 billion parameters  — solid quality, fits on a laptop
GPT-3     →  175 billion parameters  — massive, needs serious hardware

When you see 3B in a model name, that's 3 billion parameters.

Quantization — Smart Compression

A 3B parameter model at full precision takes about 12 GB of VRAM. Most laptops have 4–6 GB. So we compress.

Quantization reduces the precision of each number in the model — from 32-bit floats down to 4-bit integers, for example. Quality drops slightly, but the model becomes dramatically smaller.

F32 (original)  → ~12 GB  — perfect quality
Q8              → ~6 GB   — barely any quality loss
Q4_0            → ~2 GB   — your sweet spot on a laptop ✅
Q2              → ~1 GB   — noticeable quality drop

Your model llama3.2:3B-Q4_0 breaks down as:

llama3.2 → the model family
3B → 3 billion parameters
Q4_0 → compressed to 4-bit (method 0)

Think of it like JPEG compression for AI — you lose a little, but what you gain in size is worth it.

Architecture — The Blueprint

Same parameter count, different internal design. llama (Meta's design), mistral, gemma (Google), qwen (Alibaba) are all different architectures. For most purposes you just need to know which one you're using for compatibility — DMR handles the rest.

Context Window — Working Memory

How much text the model can "hold in its head" at once, measured in tokens (roughly 0.75 words each).

2,048 tokens  → about 1,500 words  — short conversations
4,096 tokens  → about 3,000 words  — default in DMR
8,192 tokens  → about 6,000 words  — longer code files, documents
128K tokens   → entire books       — newer large models

DMR defaults to 4,096 tokens for llama.cpp. You can change it:

docker model configure --context-size 8192 ai/llama3.2:3B-Q4_0

Just keep your VRAM in mind — larger context = more memory needed.

Temperature — The Creativity Dial

Controls how "random" the model's responses are.

0.0  → robotic, deterministic — same answer every time
0.7  → balanced, natural — good for general chat
1.2  → creative, unpredictable — good for writing
2.0  → chaos — genuinely random output

How DMR Works Under the Hood

When you pull a model, it saves to disk. When something talks to it — either your terminal or an API call — DMR loads it into memory. When you're done, it unloads.

The moment DMR is running, it exposes a local REST API at:

http://localhost:12434

This is the door to everything. In Part 2, we'll use this to build real TypeScript code that talks to your local model — no OpenAI account needed, no API key, no billing dashboard.

What's Next

In Part 2, we dig into the DMR REST API — what endpoints exist, how to call them from TypeScript using the OpenAI SDK (pointed at your local model), how streaming works, and how to build multi-turn conversations with memory.

In Part 3, we build the actual CLI tool — a terminal app where you type messages and get streaming AI responses back, with model auto-detection, preset modes for code vs chat vs creative writing, and a clean TUI interface.

If you're following along, pull the model now so it's ready:

docker model pull ai/llama3.2:3B-Q4_0

Part2 Link: https://thissidemayur.me/blogs/local-ai-dmr-rest-api-typescript

Tags: docker llm ai typescript local-ai developer-tools open-source