Talking to Your Local AI Through Code — DMR REST API

Using Docker Model Runner (DMR) | Part 2: REST API + TypeScript

By Mayur · B.Tech CSE, 6th Semester
Series: Local AI with Docker Model Runner

In Part 1, we set up Docker Model Runner and pulled our first model. We can run it from the terminal. That's cool — but it's not useful for building anything.

In this part, we go from "AI in the terminal" to "AI in my code." We'll use TypeScript and the OpenAI SDK to talk to our local model programmatically. No API key. No cloud. No billing.

By the end of this post you'll have three working scripts: a single-question chat, a streaming response (words appearing live like ChatGPT), and a multi-turn conversation with memory.

The Big Idea — DMR as a Local Server

The moment Docker Model Runner is running, it starts a local HTTP server on your machine:

http://localhost:12434

This server speaks the OpenAI API format. Every endpoint, every response shape — it's all designed to be a drop-in replacement. The only differences are:

The base URL is localhost:12434 instead of api.openai.com
No API key needed (DMR ignores the Authorization header entirely)
You use your local model name instead of gpt-4

That's it. If you've ever written code against OpenAI, you already know how to use DMR.

The URL Map

DMR exposes several API formats from the same port:

localhost:12434/engines/v1/...      → OpenAI-compatible  (what we'll use)
localhost:12434/anthropic/v1/...    → Anthropic-compatible
localhost:12434/api/...             → Ollama-compatible
localhost:12434/models/...          → DMR native (model management)

For everything in this post, we use the old OpenAI-compatible path. The full URL for chat is:

http://localhost:12434/engines/v1/chat/completions

One thing that confuses people: the OpenAI SDK automatically appends /chat/completions to whatever baseURL you give it. So you set baseURL to http://localhost:12434/engines/v1 and let the SDK build the rest.

A Note on Two OpenAI APIs

If you've looked at recent OpenAI docs, you've probably seen client.responses.create() — that's their newer Responses API. DMR doesn't fully support it. Use the classic client.chat.completions.create() — this is what DMR is built for, and what every local LLM tool (Ollama, LM Studio, etc.) speaks.

Quick translation table for when you see new OpenAI docs:

OpenAI Responses API (new)	DMR equivalent
`client.responses.create()`	`client.chat.completions.create()`
`response.output_text`	`response.choices[0].message.content`
`previous_response_id`	your own `messages[]` array
`input: "text"`	`messages: [{role, content}]`

Setup

mkdir dmr-playground
cd dmr-playground
bun init -y
bun add openai

Create a .env file:

LOCAL_LLM_URL=http://localhost:12434/engines/v1
LOCAL_LLM_MODEL=ai/llama3.2:3B-Q4_0

Important: The /engines/v1 path at the end is required. If you just put localhost:12434, the SDK appends /chat/completions directly and gets a 404.

Script 1 — Single Question, Single Answer

src/chat.ts:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: process.env.LOCAL_LLM_URL,
  apiKey: "not-needed",  // DMR ignores this, but the SDK requires it
});

const response = await client.chat.completions.create({
  model: process.env.LOCAL_LLM_MODEL!,
  messages: [
    { role: "system", content: "You are a helpful assistant. Keep answers short." },
    { role: "user", content: "What is Docker in one sentence?" },
  ],
});

console.log("AI Response", response.choices[0].message.content);
console.log(" Tokens used:", response.usage?.total_tokens);

bun src/chat.ts

You should see a short answer about Docker and a token count. That token count is useful — it tells you how much of your context window you're using.

Script 2 — Streaming

This is what makes it feel like ChatGPT — words appearing one by one instead of waiting for the full response.

src/stream.ts:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: process.env.LOCAL_LLM_URL,
  apiKey: "not-needed",
});

process.stdout.write("🤖: ");

const stream = await client.chat.completions.create({
  model: process.env.LOCAL_LLM_MODEL!,
  stream: true,   // one flag, that's the whole change
  messages: [
    { role: "user", content: "Tell me 3 fun facts about space." },
  ],
});

for await (const chunk of stream) {
  const word = chunk.choices[0]?.delta?.content ?? "";
  process.stdout.write(word);   // no newline — prints inline as words arrive
}

console.log("\n✅ Done!");

Note: I'm using process.stdout.write() instead of console.log() here. The difference matters:

console.log("hello") → prints hello then jumps to a new line
process.stdout.write("hello") → prints hello and stays on the same line

For streaming, you want each word to appear inline. console.log would put every word on its own line, which looks terrible.

Script 3 — Multi-turn Conversation (with Memory)

This is where most people hit a wall. The model has no memory between requests. Every single API call starts fresh. If you want a conversation that remembers what was said, you have to send the full history every time.

Think of it this way: you're not calling a stateful chatbot, you're calling a stateless function. You give it the entire conversation, it replies, you append that reply, and next time you give it everything again.

src/conversation.ts:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: process.env.LOCAL_LLM_URL,
  apiKey: "not-needed",
});

// This array IS the memory — you manage it yourself
const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
  {
    role: "system",
    content: "You are a helpful assistant. Keep answers short.",
  },
];

async function chat(userMsg: string): Promise<void> {
  // 1. Add user's message to history
  messages.push({ role: "user", content: userMsg });

  process.stdout.write("\n: ");

  // 2. Send the full history to the model
  const stream = await client.chat.completions.create({
    model: process.env.LOCAL_LLM_MODEL!,
    messages: messages,   // entire conversation, every time
    stream: ,
  });

  
   fullReply = ;
    ( chunk  stream) {
     word = chunk.[]?.?. ?? ;
    process..(word);
    fullReply += word;
  }

  .();

  
  messages.({ : , : fullReply });
}


.();
 ();

.();
 ();

.();
 ();

The second message works because we send [system, "My name is Mayur...", "Nice to meet you...", "What's my name?"] — the full history. The model can see everything that was said.

DMR Native Endpoints — Model Management in Code

Beyond chat, DMR exposes endpoints for managing models themselves. These don't need the OpenAI SDK — just plain fetch.

const DMR_BASE = "http://localhost:12434";

// List all local models
async function listModels() {
  const res = await fetch(`${DMR_BASE}/models`);
  const data = await res.json();
  return data.models;
}

// Check if a model exists
async function modelExists(modelName: string): Promise<boolean> {
  const [namespace, name] = modelName.split("/");
  const res = await fetch(`${DMR_BASE}/models/${namespace}/${name}`);
  return res.ok;
}

// Pull a model programmatically
async function pullModel(modelName: string): Promise<void> {
  const res = await fetch(`${DMR_BASE}/models/create`, {
    : ,
    : { :  },
    : .({ : modelName }),
  });
   (!res.)   ();
}


 models =  ();
.(, models.( m.));

 exists =  ();
.(, exists);

This becomes very useful in a CLI app — check if the requested model exists, pull it automatically if not, then start chatting. No manual docker model pull required.

Configuring Model Behaviour

You can tune how the model responds per-request through API parameters:

const response = await client.chat.completions.create({
  model: "ai/llama3.2:3B-Q4_0",
  messages: messages,
  temperature: 0.7,          // creativity: 0 = robotic, 2 = chaotic
  top_p: 0.9,                // probability cutoff for token selection
  max_tokens: 1024,          // max length of the response
  presence_penalty: 0.1,     // discourages repeating topics
  frequency_penalty: 0.1,    // discourages repeating specific words
});

For different use cases, I use different presets:

// For code generation — deterministic, precise
const codePreset = { temperature: 0.1, top_p: 0.95, max_tokens: 2048 };

// For general chat — natural, balanced
const chatPreset = { temperature: 0.7, top_p: 0.9, max_tokens: 1024 };

// For creative writing — unpredictable, expressive
const creativePreset = { temperature: 1.2, top_p: 0.95, max_tokens: 2048 };

You can also configure defaults at the model level (persists across sessions):

docker model configure --context-size 8192 ai/llama3.2:3B-Q4_0

To reset to defaults:

docker model configure --context-size -1 ai/llama3.2:3B-Q4_0

What's Next — The CLI

In Part 3, we take everything from this post and build a proper CLI tool:

llm "what is a closure in JavaScript?"      # quick question
llm --chat                                   # interactive session
llm --mode code "write a binary search"      # mode switching
llm --models                                 # list available models

It'll have streaming output, conversation memory, auto model detection, preset modes, and a clean terminal interface. Something you can actually use daily and show in your portfolio.

Tags: docker typescript llm rest-api local-ai openai bun developer-tools