Ornith-1.0: Running a State-of-the-Art Coding Model Locally with Gollama

Ornith-1.0 is a new family of open-source agentic coding models that punches well above its weight. The 9B variant delivers ~2x the coding performance of Qwen 3.5 9B. Here's how to run it locally using gollama — a single Go binary that makes it dead simple.

Ornith-1.0 coding model running locally on a developer's machine

What is Ornith-1.0?

Ornith-1.0 is a family of models designed specifically for agentic coding tasks. It's built on top of Gemma 4 and Qwen 3.5, trained with a self-improving RL framework that jointly optimizes both the reasoning scaffold and the final solution. The result: the model doesn't just generate answers — it discovers better search trajectories.

The family includes four sizes:

ModelParametersBest For
Ornith-1.0-9B~9B denseSingle GPU, efficient deployment
Ornith-1.0-31B~31B denseMid-range GPUs, balanced speed/quality
Ornith-1.0-35B~35B MoEHigh-end GPUs, top-tier coding
Ornith-1.0-397B~397B MoEMulti-GPU clusters, maximum capability

The 9B model is the accessible entry point. It runs on a single 80GB GPU in full precision, or comfortably on consumer hardware with quantization. And it delivers impressive results:

BenchmarkOrnith-1.0-9BQwen3.5-9B
Terminal-Bench 2.1 (Terminus-2)43.121.3
SWE-Bench Verified69.453.2
SWE-Bench Pro42.931.3
NL2Repo27.216.2

That's roughly 2× the performance of the base Qwen 3.5 9B on coding benchmarks. And it's MIT licensed.

What is Gollama?

Gollama is a single Go binary (no Python, no Docker, no dependencies) that downloads GGUF models from HuggingFace, runs them via llama.cpp, and exposes an OpenAI-compatible API. It includes a web UI, terminal chat, multi-instance management, and full control over llama-server flags.

Key advantages over Ollama:

  • Multi-file GGUF support — Ollama can't download sharded models; gollama handles them automatically
  • Full flag control — every llama-server parameter is exposed, not hidden
  • Multi-instance — run multiple models simultaneously on different ports
  • Auto-launch — call the API with a model name and gollama starts it if needed

Prerequisites

  • A GPU (NVIDIA recommended; Vulkan works too)
  • At least 6GB VRAM for Q4 quantization of the 9B model
  • ~10GB disk space for the model file
  • Linux, macOS, or Windows

Step 1: Install Gollama

One line on Linux/macOS:

curl -fsSL https://raw.githubusercontent.com/majidkorai/gollama/main/install.sh | sh

Windows (PowerShell):

iwr -useb https://raw.githubusercontent.com/majidkorai/gollama/main/install.ps1 | iex

The installer detects your platform, downloads the right binary, and places it in /usr/local/bin. On first run, gollama will prompt you to install the llama-server binary with your preferred GPU backend.

Step 2: Pull Ornith-1.0-9B

The GGUF build lives at deepreinforce-ai/Ornith-1.0-9B-GGUF on HuggingFace. Gollama pulls it directly:

gollama pull hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF

Gollama auto-detects multi-file GGUF splits and downloads each part with progress indicators. Pick your preferred quantization by appending :Q4_K_M (or :Q5_K_M, :Q8_0, etc.) to the model name.

Step 3: Chat with It

Terminal Chat (quickest)

gollama chat Ornith-1.0-9B-Q4_K_M.gguf

This launches the model on a free port and opens a streaming chat session. You'll see tokens appear in real time, with reasoning blocks displayed in italics before the final answer.

Web UI (most features)

gollama serve

Then open http://localhost:9080 in your browser. The dashboard shows running instances, a chat workspace, model management, and a settings panel where you can tweak flags like context size, GPU layers, and quantization.

Step 4: Use It with Any OpenAI-Compatible Client

Gollama exposes an OpenAI-compatible API at http://localhost:9080/v1. Point any tool at it:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:9080/v1",
    api_key="***"
)

response = client.chat.completions.create(
    model="Ornith-1.0-9B",
    messages=[
        {"role": "user", "content": "Write a Python function to check if a number is prime."}
    ],
    temperature=0.6,
    top_p=0.95,
    max_tokens=1024
)

print(response.choices[0].message.content)

The model returns reasoning in a separate reasoning_content field — useful if you want to see the chain of thought without the final answer.

Ornith-1.0 is a reasoning model. For best results:

  • Temperature: 0.6 (use 1.0 to reproduce benchmark numbers)
  • Top-p: 0.95
  • Top-k: 20
  • Context window: up to 262K tokens (adjust based on your VRAM)

In gollama's web UI, you can set these via the structured flag editor. Key flags:

FlagRecommended Value
--n-gpu-layers99 (offload everything to GPU)
--ctx-size8192–32768 (balance VRAM vs context)
--flash-attnon (reduces VRAM usage)
--temp0.6

What Can You Use It For?

Ornith-1.0 is optimized for agentic coding. It works with:

  • Coding CLIs — OpenCode, Aider, or any tool that talks OpenAI API
  • Agent frameworks — Hermes Agent, OpenHands, or custom MCP servers
  • OpenClaw — point it at the gollama endpoint for local agentic coding
  • Cursor / Continue.dev — use as a local model provider

Because it emits well-formed tool calls, you can connect it to shell commands, file editors, and search tools through any MCP-compatible interface.

Troubleshooting

“Out of memory” errors: Lower the context size (--ctx-size) or use a more aggressive quantization (:Q4_K_M instead of :Q8_0).
Slow on Linux with NVIDIA: llama.cpp doesn't ship pre-built CUDA binaries for Linux. Run gollama update and it will detect nvcc and compile CUDA support automatically.
Model not responding: Check the logs in the web UI. Gollama logs each instance to ~/.gollama/logs/port-NNNN.log.

Why This Matters

Ornith-1.0-9B delivers coding performance that rivals models 3–4× its size. Combined with gollama's zero-dependency setup, that means you can run a state-of-the-art coding model on a single consumer GPU with one command. No API keys, no monthly fees, no cloud dependency. Just local inference, full control, and an MIT license.

The agentic coding space is moving fast. Ornith shows that open-source models can compete with — and beat — proprietary solutions when trained with the right approach. And gollama makes it trivially easy to be part of that experiment.

Comments