What is Ornith-1.0?
Ornith-1.0 is a family of models designed specifically for agentic coding tasks. It's built on top of Gemma 4 and Qwen 3.5, trained with a self-improving RL framework that jointly optimizes both the reasoning scaffold and the final solution. The result: the model doesn't just generate answers — it discovers better search trajectories.
The family includes four sizes:
| Model | Parameters | Best For |
|---|---|---|
| Ornith-1.0-9B | ~9B dense | Single GPU, efficient deployment |
| Ornith-1.0-31B | ~31B dense | Mid-range GPUs, balanced speed/quality |
| Ornith-1.0-35B | ~35B MoE | High-end GPUs, top-tier coding |
| Ornith-1.0-397B | ~397B MoE | Multi-GPU clusters, maximum capability |
The 9B model is the accessible entry point. It runs on a single 80GB GPU in full precision, or comfortably on consumer hardware with quantization. And it delivers impressive results:
| Benchmark | Ornith-1.0-9B | Qwen3.5-9B |
|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 43.1 | 21.3 |
| SWE-Bench Verified | 69.4 | 53.2 |
| SWE-Bench Pro | 42.9 | 31.3 |
| NL2Repo | 27.2 | 16.2 |
That's roughly 2× the performance of the base Qwen 3.5 9B on coding benchmarks. And it's MIT licensed.
What is Gollama?
Gollama is a single Go binary (no Python, no Docker, no dependencies) that downloads GGUF models from HuggingFace, runs them via llama.cpp, and exposes an OpenAI-compatible API. It includes a web UI, terminal chat, multi-instance management, and full control over llama-server flags.
Key advantages over Ollama:
- Multi-file GGUF support — Ollama can't download sharded models; gollama handles them automatically
- Full flag control — every llama-server parameter is exposed, not hidden
- Multi-instance — run multiple models simultaneously on different ports
- Auto-launch — call the API with a model name and gollama starts it if needed
Prerequisites
- A GPU (NVIDIA recommended; Vulkan works too)
- At least 6GB VRAM for Q4 quantization of the 9B model
- ~10GB disk space for the model file
- Linux, macOS, or Windows
Step 1: Install Gollama
One line on Linux/macOS:
curl -fsSL https://raw.githubusercontent.com/majidkorai/gollama/main/install.sh | shWindows (PowerShell):
iwr -useb https://raw.githubusercontent.com/majidkorai/gollama/main/install.ps1 | iexThe installer detects your platform, downloads the right binary, and places it in /usr/local/bin. On first run, gollama will prompt you to install the llama-server binary with your preferred GPU backend.
Step 2: Pull Ornith-1.0-9B
The GGUF build lives at deepreinforce-ai/Ornith-1.0-9B-GGUF on HuggingFace. Gollama pulls it directly:
gollama pull hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUFGollama auto-detects multi-file GGUF splits and downloads each part with progress indicators. Pick your preferred quantization by appending :Q4_K_M (or :Q5_K_M, :Q8_0, etc.) to the model name.
Step 3: Chat with It
Terminal Chat (quickest)
gollama chat Ornith-1.0-9B-Q4_K_M.ggufThis launches the model on a free port and opens a streaming chat session. You'll see tokens appear in real time, with reasoning blocks displayed in italics before the final answer.
Web UI (most features)
gollama serveThen open http://localhost:9080 in your browser. The dashboard shows running instances, a chat workspace, model management, and a settings panel where you can tweak flags like context size, GPU layers, and quantization.
Step 4: Use It with Any OpenAI-Compatible Client
Gollama exposes an OpenAI-compatible API at http://localhost:9080/v1. Point any tool at it:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:9080/v1",
api_key="***"
)
response = client.chat.completions.create(
model="Ornith-1.0-9B",
messages=[
{"role": "user", "content": "Write a Python function to check if a number is prime."}
],
temperature=0.6,
top_p=0.95,
max_tokens=1024
)
print(response.choices[0].message.content)The model returns reasoning in a separate reasoning_content field — useful if you want to see the chain of thought without the final answer.
Recommended Settings
Ornith-1.0 is a reasoning model. For best results:
- Temperature: 0.6 (use 1.0 to reproduce benchmark numbers)
- Top-p: 0.95
- Top-k: 20
- Context window: up to 262K tokens (adjust based on your VRAM)
In gollama's web UI, you can set these via the structured flag editor. Key flags:
| Flag | Recommended Value |
|---|---|
--n-gpu-layers | 99 (offload everything to GPU) |
--ctx-size | 8192–32768 (balance VRAM vs context) |
--flash-attn | on (reduces VRAM usage) |
--temp | 0.6 |
What Can You Use It For?
Ornith-1.0 is optimized for agentic coding. It works with:
- Coding CLIs — OpenCode, Aider, or any tool that talks OpenAI API
- Agent frameworks — Hermes Agent, OpenHands, or custom MCP servers
- OpenClaw — point it at the gollama endpoint for local agentic coding
- Cursor / Continue.dev — use as a local model provider
Because it emits well-formed tool calls, you can connect it to shell commands, file editors, and search tools through any MCP-compatible interface.
Troubleshooting
“Out of memory” errors: Lower the context size (--ctx-size) or use a more aggressive quantization (:Q4_K_M instead of :Q8_0).
Slow on Linux with NVIDIA: llama.cpp doesn't ship pre-built CUDA binaries for Linux. Run gollama update and it will detect nvcc and compile CUDA support automatically.
Model not responding: Check the logs in the web UI. Gollama logs each instance to ~/.gollama/logs/port-NNNN.log.
Why This Matters
Ornith-1.0-9B delivers coding performance that rivals models 3–4× its size. Combined with gollama's zero-dependency setup, that means you can run a state-of-the-art coding model on a single consumer GPU with one command. No API keys, no monthly fees, no cloud dependency. Just local inference, full control, and an MIT license.
The agentic coding space is moving fast. Ornith shows that open-source models can compete with — and beat — proprietary solutions when trained with the right approach. And gollama makes it trivially easy to be part of that experiment.
Comments