Coming as open source in June 2026

knaif — a local command agent that doesn't burn tokens to do simple work

knaif is a framework for building AI-enhanced applications that run entirely on local hardware, with no subscription and no cloud round-trip. Developers build skills in the framework; end users get a native, self-contained tool delivered as a normal platform installer — no Python, no model wrangling, no cloud.

The premise is one sentence: the model should only figure out what you want — not do the work. A small local LLM reads your plain-English request and emits a JSON plan. From there, deterministic code validates the plan, expands it into concrete steps, asks for confirmation when needed, and executes it. The model never runs commands, never emits shell strings, and never owns a safety decision.

That split is the whole point. Once a developer has built and tuned a skill, the end product runs the expensive part (intent extraction) once per request on a 4B model, and the deterministic part (the actual file/media/CLI work) for free. You pay the AI cost during development — write once — and ship something that runs cheaply on a laptop forever.

The name is a portmanteau of knife and AI. Think Swiss Army knife: a compact tool where every blade does one thing well and snaps cleanly into place. Each skill is a blade — self-contained, purpose-built, and composable. You carry exactly the blades you need; the handle stays the same.

The thesis, and why it's not what everyone else is building

Most agents today — premium or open source — use the LLM as a general-purpose brain. Every request, trivial or not, goes through the model. "Trim this video to 10 seconds" gets the same multi-thousand-token treatment as "refactor my auth layer." That buys you three problems:

Waste. You spend tokens (money, electricity, latency) regenerating logic that is fundamentally deterministic. The 200th "compress this video" request reasons from scratch like the first.
Non-determinism. The same prompt can produce a different command tomorrow. For a CLI wrapper, that's a bug, not a feature.
Dependency. The capability lives in someone else's data center, behind a meter.

knaif inverts it. The model is a thin intent layer; the capability lives in hand-written, tested, deterministic skills. The model picks compress_video and extracts {target_size_mb: 25}. An expander turns that into a probe → recipe → preview → confirm → batch workflow and renders the exact ffmpeg argv. Same input, same output, every time.

knaif also isn't trying to solve everything. It deliberately starts where the pain is concrete and the determinism is real: complex CLI tools nobody remembers the flags for. The first shipped skill wraps FFmpeg in plain English instead of a 15-flag incantation.

How knaif compares

	Premium agents (Claude Code, Codex, Copilot)	OSS agents (Open Interpreter, aider, OpenHands)	knaif
Inference	Cloud, metered per token	Local or cloud	Local only
Cost to end user	Subscription / per-token	Free, but you bring the model	Free at runtime
Who executes	LLM proposes commands/code, often runs them	LLM generates and runs code	Deterministic native code; LLM only plans
Determinism	No	No	Yes — same request, same command
Token cost per task	Full reasoning every time	Full reasoning every time	One small intent call; expansion is free
Setup for end user	Account + key	Install Python, pick a model, pull GGUFs, tune	Native installer — no model-picking, no deps
Focus	General coding	General coding / shell	Narrow, deterministic skills
Safety	Model-mediated	Model-mediated	Code-enforced: sandbox, dry-run, confirm gates

The end-user story is the part most OSS local agents miss. Running them means installing a dev stack, downloading models, and guessing which 7B quant does the job. knaif targets the developer with the framework, and targets the end user with a product — a native app that bundles the right fine-tuned model and needs zero ML knowledge to run. The framework is the factory; the shipped artifact is a self-contained binary.

How it works

A skill is a self-contained directory. The core stays completely domain-agnostic — it loads skills, validates plans, expands intents, resolves variables, enforces safety, and dispatches handlers. All domain behavior lives in the skill.

knaif execution pipeline

The invariant the model is held to never changes across skills:

{ "plan": [ { "tool": "compress_video", "args": { "inputs": ["clip.mp4"], "target_size_mb": 25 } } ] }

That's all the model ever produces. Everything downstream is code.

Intent → workflow expansion

The model only sees high-level intent tools. A request like "get clip.mov ready for WhatsApp" picks prepare_for_platform. An expander then rewrites that single intent into a deterministic multi-step plan with data flowing between steps via $variables:

{ "plan": [
  { "tool": "resolve_inputs",  "args": { "paths": ["clip.mov"] },        "output": "$files" },
  { "tool": "inspect_media",   "args": { "files": "$files" },            "output": "$probes" },
  { "tool": "load_platform_profile", "args": { "platform": "whatsapp" }, "output": "$pp" },
  { "tool": "build_recipes",   "args": { "probes": "$probes", "platform_profile": "$pp" }, "output": "$recipes" },
  { "tool": "run_preview",     "args": { "command": "$preview_cmd" },    "output": "$preview" },
  { "tool": "wait_for_confirmation", "args": { "prompt": "Apply to all inputs?", "preview": "$preview" } },
  { "tool": "run_batch",       "args": { "commands": "$batch_cmds" },    "output": "$outputs" },
  { "tool": "generate_report", "args": { "outputs": "$outputs" } }
] }

The internal tools (resolve_inputs, inspect_media, …) are hidden from the model — they're emitted only by code. The renderer turns recipes into real argv lists:

ffmpeg -y -i clip.mov -vf scale='min(1280,iw)':-2 -c:v libx264 -crf 23 -preset medium \
  -pix_fmt yuv420p -c:a aac -b:a 128k -movflags +faststart clip_whatsapp.mp4

Commands are executed as argv lists, never as shell strings.

Safety is enforced by code, not vibes

The model emits only the JSON plan envelope. It cannot emit shell commands.
Unknown tools and unsupported args are rejected before anything runs.
Sandbox-sensitive paths are validated before execution and again after $var resolution — a variable can't smuggle a path out of the sandbox.
Dry-run mode previews the full plan and rendered commands with no side effects.
Destructive tools require explicit confirmation or dry-run.
Preview gates pause mid-workflow for a human y/n before any batch write.

Tech stack

knaif is built in two tiers, from different materials on purpose.

The development framework — what a skill author works in. Deliberately boring and small; that's a feature.

Layer	Choice
Language	Python ≥ 3.10
Core deps	`llama-cpp-python`, `requests`, `pyyaml` — that's it
CLI	`click` (`knaif-cli run <skill> <prompt>`)
Inference backends	llama.cpp (GGUF, CUDA-optional), Ollama, and mock (no model, used in tests/dev)
Skill format	YAML manifest + YAML tool registry + handler functions
Tests	`pytest` — 631 passing across core + skills
Lint / format / types	`ruff`, `black`, `mypy`
Task runner	`just`
Eval	in-tree `knaif.evalsuite` (corpus, runner, scoring, snapshot, regression, reporting)

No web service, no database, no orchestration layer. A skill is a folder; the framework is a library.

What ships to the end user — deliberately not Python. Once a skill is authored and tuned in the framework, it's packaged as a native, self-contained application: a Rust/C++ runtime around the deterministic engine plus a bundled, fine-tuned model, delivered through a normal platform installer (Windows, macOS, Linux). No Python runtime, no dependency hunt, no model-picking — the user installs an app and it runs locally for free. The Python framework is the factory; the artifact in the user's hands is a native binary.

Anatomy of a skill

skills/<name>/
  skill.yaml        # manifest: name, recommended_model, data refs, arg value sets
  tools.yaml        # flat registry of intent + internal tools
  handlers          # HANDLERS + optional EXPANDERS, SUMMARIZERS, PREFLIGHTS
  prompt.yaml       # model-facing rules + few-shot examples
  data/*.jsonl      # train / eval / safety corpora
  profiles/         # domain config (e.g. platform & quality presets)
  tests/            # per-skill test suite

Core never imports a skill by name. Adding a domain is dropping in a folder — fork an existing skill, swap the tools and handlers, write examples and data, done. That forkability is the growth model.

Existing skills

`io` — sandboxed file operations

list_files, find_files, move_files, delete_files — natural language over a guarded sandbox. Destructive ops (move, delete) require confirmation or dry-run. The simplest reference skill; copy it to start a new CRUD-style domain.

`ffmpeg` — plain-English media workflows

The flagship. 13 model-visible intent tools, each expanded into a deterministic probe/recipe/render workflow:

prepare_for_platform   compress_video   convert_video   resize_video   trim_video
extract_audio          create_thumbnail batch_convert   concat_video   reverse_video
strip_audio            extract_frame    adjust_speed

Platform targets (WhatsApp, YouTube, Instagram Reels, TikTok, web, email, archive) and quality levels (small_file → lossless) are YAML profiles, not model output. The model maps "good enough" → visually_good; handlers load the profile and render the codec, CRF, scale, and container deterministically. It also speaks four extra languages — keyword aliases for Spanish, German, French, and Russian, with diacritic-normalized retrieval, at zero runtime cost.

A real example, end to end

knaif-cli run ffmpeg compress video.mp4 under 25 mb --model qwen3-4b

intent: 0.4s
ffmpeg › compress video.mp4 under 25 mb
  • will compress video.mp4 targeting 25 MB
    Proceed? [Y/n]: Y
    $ ffmpeg -y -i video.mp4 -c:v libx264 -crf 28 -preset medium -c:a aac -b:a 96k video_compressed.mp4
    ✓ video_compressed.mp4  (23.7 MB)
    4.1s

The model spent one short call deciding "this is compress_video, target 25 MB." The recipe, the CRF choice, the filename derivation, and the execution were all code.

Models

knaif is small-model-first. Skills declare a recommended model; the runtime resolves it at launch. Both llama.cpp (GGUF, full GPU offload) and Ollama are supported backends.

Model	Size	Role
Qwen3-4B (Q4_K_M)	4B	Default / recommended — best accuracy on the ffmpeg corpus
Qwen3-1.7B (Q8 / Q4)	1.7B	Low-footprint option
Phi-4-mini	~3.8B	Alternative 4B-class backend
Gemma3-4B	4B	Strong anchor, esp. via Ollama
Gemma3-1B, SmolLM3-3B	1–3B	Tested floor — see eval notes

There is no fine-tuned model shipped yet — the framework currently gets strong results with stock instruct quants. Fine-tuning per skill is the planned next lever.

Evaluation

knaif ships its own eval harness. A versioned JSONL corpus pairs each utterance with an expected outcome (plan / clarify / reject), the expected intent tool, a fixture video, and — crucially — a pre-recorded freeform-LLM baseline command for the same request. The scorer runs both knaif's rendered command and the baseline against the same fixture and compares the outputs with ffprobe (container, codecs, resolution, duration, file size). It measures the same retrieved prompt the agent uses in production, and records time-to-artifact latency (mean / p50 / p95) on plan rows.

Two verifier modes: cheap (parse the command, no binaries — used in dev/CI) and honest (actually run ffmpeg + probe the output — used before release). A snapshot + regression ratchet catches any metric that drops more than 2%.

Outcome accuracy by backend (70-case ffmpeg corpus)

qwen3-4b (llama.cpp)     ██████████████████▊  94.3%  (66/70)
gemma3-4b (Ollama)       ██████████████████▌  92.9%  (65/70)
gemma3-4b (llama.cpp)    █████████████████▋   88.6%  (62/70)
qwen3-4b (Ollama)        █████████████████    85.7%  (60/70)
qwen3-1.7b-q8            ████████████████▊    84.3%  (59/70)
qwen3-1.7b (Ollama)      ████████████████▊    84.3%  (59/70)
qwen3-1.7b-q4            ████████████████▊    84.3%  (59/70)
phi4-mini                ████████████████▌    82.9%  (58/70)
gemma3-1b                ████████             40.0%  (28/70)
smollm3-3b               ██████               30.0%  (21/70)

Backend	Variant	Outcome accuracy	Plans	Clarify	Reject	Error	Parse-error
qwen3-4b	llama.cpp	94.3%	49	13	6	2	0
gemma3-4b	Ollama	92.9%	52	10	6	2	0
gemma3-4b	llama.cpp	88.6%	50	10	6	4	0
qwen3-4b	Ollama	85.7%	43	19	7	0	1
qwen3-1.7b-q8	llama.cpp	84.3%	44	11	9	6	0
phi4-mini	llama.cpp	82.9%	40	15	8	7	0
gemma3-1b	llama.cpp	40.0%	26	14	4	12	14
smollm3-3b	llama.cpp	30.0%	14	4	4	1	47

What the numbers actually tell you

4B is the practical floor for this skill. Qwen3-4B and Gemma3-4B clear ~89–94% with stock weights and no fine-tuning. That's the headline: a free, local, 4B model reliably maps messy English to the right intent + args.
Below 4B it degrades, but legibly. The parse_error bucket exists precisely so the "model can't emit valid JSON" failure mode (gemma3-1b: 14, smollm3-3b: 47) doesn't hide inside "clarify." Most eval harnesses would have silently counted those as clarifications.
The cheap prompt tricks didn't matter — and we proved it. A 47% rendered-prompt reduction came out accuracy-neutral on a clean A/B. We kept the code as infra and skipped three planned optimizations the data didn't justify. The eval suite is what let us say no with confidence.

Where the real wins are next (per the data)

Constrained decoding (GBNF/JSON-schema sampling) would rescue ~40 of SmolLM3's 47 parse failures essentially for free — the small-model gap is JSON emission, not reasoning. After that: post-parse argument repair, and per-skill fine-tuning. None of these touch the architecture; they make the intent layer cheaper and more reliable, which is the only part that costs anything.

The sustainability argument

This is the part that matters beyond benchmarks. Every other agent pays the inference cost per use, forever — and at data-center scale, that's real electricity for work that is mostly deterministic. knaif pays the AI cost once, during development, then ships a product whose runtime cost is a single short call to a 4B model on the user's own machine. Most users run the same handful of tasks over and over; regenerating that logic from scratch every time, in the cloud, is the waste knaif is built to remove.

Write once, run free, run local, run the same way every time.

What it is not

Not a general autonomous coding agent. It won't refactor your repo.
Not a shell the model drives. The model cannot execute or emit commands.
Not a cloud service. Local-first, no account, no key.
Not magic for tiny models — sub-4B accuracy drops, and the eval says so plainly.

Scope discipline is intentional. knaif wins by being a sharp tool for narrow, deterministic domains, not a dull one for everything.

Roadmap

Constrained decoding to close the small-model JSON-emission gap.
Per-skill fine-tuned models, bundled into end-user installers.
More skills — the format is built for forking; community skills are the growth plan.
Native end-user distributions — Rust/C++ packaging of the deterministic engine plus a bundled model, shipped as per-platform installers (Windows, macOS, Linux), zero deps.

Getting started (developers)

This is the framework path, for authoring and running skills. End users don't do any of this — they get a native installer.

uv venv && source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\Activate.ps1          # Windows

just install

knaif-cli skills
knaif-cli run ffmpeg trim video.mp4 to 10 seconds --dry-run

License: Apache 2.0