
Talking to Your Local AI Through Code — DMR REST API + TypeScript
Talking to Your Local AI Through Code — DMR REST API
Using Docker Model Runner (DMR) | Part 2: REST API + TypeScript
By Mayur · B.Tech CSE, 6th Semester
Series: Local AI with Docker Model Runner
In Part 1, we set up Docker Model Runner and pulled our first model. We can run it from the terminal. That's cool — but it's not useful for building anything.
In this part, we go from "AI in the terminal" to "AI in my code." We'll use TypeScript and the OpenAI SDK to talk to our local model programmatically. No API key. No cloud. No billing.
By the end of this post you'll have three working scripts: a single-question chat, a streaming response (words appearing live like ChatGPT), and a multi-turn conversation with memory.
The Big Idea — DMR as a Local Server
The moment Docker Model Runner is running, it starts a local HTTP server on your machine:
http://localhost:12434
This server speaks the OpenAI API format. Every endpoint, every response shape — it's all designed to be a drop-in replacement. The only differences are:
- The base URL is
localhost:12434instead ofapi.openai.com - No API key needed (DMR ignores the
Authorizationheader entirely) - You use your local model name instead of
gpt-4
That's it. If you've ever written code against OpenAI, you already know how to use DMR.
The URL Map
DMR exposes several API formats from the same port:
localhost:12434/engines/v1/... → OpenAI-compatible (what we'll use)
localhost:12434/anthropic/v1/... → Anthropic-compatible
localhost:12434/api/... → Ollama-compatible
localhost:12434/models/... → DMR native (model management)
For everything in this post, we use the old OpenAI-compatible path. The full URL for chat is:
http://localhost:12434/engines/v1/chat/completions
One thing that confuses people: the OpenAI SDK automatically appends /chat/completions to whatever baseURL you give it. So you set baseURL to http://localhost:12434/engines/v1 and let the SDK build the rest.
A Note on Two OpenAI APIs
If you've looked at recent OpenAI docs, you've probably seen client.responses.create() — that's their newer Responses API. DMR doesn't fully support it. Use the classic client.chat.completions.create() — this is what DMR is built for, and what every local LLM tool (Ollama, LM Studio, etc.) speaks.
Quick translation table for when you see new OpenAI docs:
| OpenAI Responses API (new) | DMR equivalent |
|---|---|
client.responses.create() | client.chat.completions.create() |
response.output_text | response.choices[0].message.content |
previous_response_id | your own messages[] array |
input: "text" | messages: [{role, content}] |
Setup
mkdir dmr-playground
cd dmr-playground
bun init -y
bun add openai
Create a .env file:
LOCAL_LLM_URL=http://localhost:12434/engines/v1
LOCAL_LLM_MODEL=ai/llama3.2:3B-Q4_0
Important: The
/engines/v1path at the end is required. If you just putlocalhost:12434, the SDK appends/chat/completionsdirectly and gets a 404.
Script 1 — Single Question, Single Answer
src/chat.ts:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: process.env.LOCAL_LLM_URL,
apiKey: "not-needed", // DMR ignores this, but the SDK requires it
});
const response = await client.chat.completions.create({
model: process.env.LOCAL_LLM_MODEL!,
messages: [
{ role: "system", content: "You are a helpful assistant. Keep answers short." },
{ role: "user", content: "What is Docker in one sentence?" },
],
});
console.log("AI Response", response.choices[0].message.content);
console.log(" Tokens used:", response.usage?.total_tokens);
bun src/chat.ts
You should see a short answer about Docker and a token count. That token count is useful — it tells you how much of your context window you're using.
Script 2 — Streaming
This is what makes it feel like ChatGPT — words appearing one by one instead of waiting for the full response.
src/stream.ts:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: process.env.LOCAL_LLM_URL,
apiKey: "not-needed",
});
process.stdout.write("🤖: ");
const stream = await client.chat.completions.create({
model: process.env.LOCAL_LLM_MODEL!,
stream: true, // one flag, that's the whole change
messages: [
{ role: "user", content: "Tell me 3 fun facts about space." },
],
});
for await (const chunk of stream) {
const word = chunk.choices[0]?.delta?.content ?? "";
process.stdout.write(word); // no newline — prints inline as words arrive
}
console.log("\n✅ Done!");
Note: I'm using process.stdout.write() instead of console.log() here. The difference matters:
console.log("hello")→ printshellothen jumps to a new lineprocess.stdout.write("hello")→ printshelloand stays on the same line
For streaming, you want each word to appear inline. console.log would put every word on its own line, which looks terrible.
Script 3 — Multi-turn Conversation (with Memory)
This is where most people hit a wall. The model has no memory between requests. Every single API call starts fresh. If you want a conversation that remembers what was said, you have to send the full history every time.
Think of it this way: you're not calling a stateful chatbot, you're calling a stateless function. You give it the entire conversation, it replies, you append that reply, and next time you give it everything again.
src/conversation.ts:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: process.env.LOCAL_LLM_URL,
apiKey: "not-needed",
});
// This array IS the memory — you manage it yourself
const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
{
role: "system",
content: "You are a helpful assistant. Keep answers short.",
},
];
async function chat(userMsg: string): Promise<void> {
// 1. Add user's message to history
messages.push({ role: "user", content: userMsg });
process.stdout.write("\n: ");
// 2. Send the full history to the model
const stream = await client.chat.completions.create({
model: process.env.LOCAL_LLM_MODEL!,
messages: messages, // entire conversation, every time
stream: ,
});
fullReply = ;
( chunk stream) {
word = chunk.[]?.?. ?? ;
process..(word);
fullReply += word;
}
.();
messages.({ : , : fullReply });
}
.();
();
.();
();
.();
();
The second message works because we send [system, "My name is Mayur...", "Nice to meet you...", "What's my name?"] — the full history. The model can see everything that was said.
DMR Native Endpoints — Model Management in Code
Beyond chat, DMR exposes endpoints for managing models themselves. These don't need the OpenAI SDK — just plain fetch.
const DMR_BASE = "http://localhost:12434";
// List all local models
async function listModels() {
const res = await fetch(`${DMR_BASE}/models`);
const data = await res.json();
return data.models;
}
// Check if a model exists
async function modelExists(modelName: string): Promise<boolean> {
const [namespace, name] = modelName.split("/");
const res = await fetch(`${DMR_BASE}/models/${namespace}/${name}`);
return res.ok;
}
// Pull a model programmatically
async function pullModel(modelName: string): Promise<void> {
const res = await fetch(`${DMR_BASE}/models/create`, {
: ,
: { : },
: .({ : modelName }),
});
(!res.) ();
}
models = ();
.(, models.( m.));
exists = ();
.(, exists);
This becomes very useful in a CLI app — check if the requested model exists, pull it automatically if not, then start chatting. No manual docker model pull required.
Configuring Model Behaviour
You can tune how the model responds per-request through API parameters:
const response = await client.chat.completions.create({
model: "ai/llama3.2:3B-Q4_0",
messages: messages,
temperature: 0.7, // creativity: 0 = robotic, 2 = chaotic
top_p: 0.9, // probability cutoff for token selection
max_tokens: 1024, // max length of the response
presence_penalty: 0.1, // discourages repeating topics
frequency_penalty: 0.1, // discourages repeating specific words
});
For different use cases, I use different presets:
// For code generation — deterministic, precise
const codePreset = { temperature: 0.1, top_p: 0.95, max_tokens: 2048 };
// For general chat — natural, balanced
const chatPreset = { temperature: 0.7, top_p: 0.9, max_tokens: 1024 };
// For creative writing — unpredictable, expressive
const creativePreset = { temperature: 1.2, top_p: 0.95, max_tokens: 2048 };
You can also configure defaults at the model level (persists across sessions):
docker model configure --context-size 8192 ai/llama3.2:3B-Q4_0
To reset to defaults:
docker model configure --context-size -1 ai/llama3.2:3B-Q4_0
What's Next — The CLI
In Part 3, we take everything from this post and build a proper CLI tool:
llm "what is a closure in JavaScript?" # quick question
llm --chat # interactive session
llm --mode code "write a binary search" # mode switching
llm --models # list available models
It'll have streaming output, conversation memory, auto model detection, preset modes, and a clean terminal interface. Something you can actually use daily and show in your portfolio.
Tags: docker typescript llm rest-api local-ai openai bun developer-tools