Decision framework for agents

Arm your agent’s decisions with forecasting.

farness intercepts agent decisions and demands a forecast: a KPI, a confidence interval, a base rate, disconfirming evidence, and a review date. Works with Codex, Claude Code, and any agent that speaks MCP.

Decision prompt:
Should we rewrite the auth layer now?
Reframed as:
P(critical auth incidents decrease by >40% in 90 days | rewrite now)
KPI:
critical_auth_incidents / 90d
Forecast:
rewrite now:
58% [42-71]
defer 60d:
31% [19-44]
Base rate:
27%similar infra rewrites yielding material reliability gains
Disconfirming evidence:
ops fixes may solve this fasterrewrite could slip roadmap deliveryrecent outage may overweight urgency
Review date:
2026-06-15
How farness works

From intuition to instrument

01

Intercept

Catch decision-language before the model hardens into advice. When a prompt sounds like 'Should we...?' or 'Which is better?', farness reframes it as a forecastable choice.

02

Reframe

Convert vague 'Should I?' into explicit, measurable outcome questions. Define the KPIs that would actually tell you whether the decision was good.

03

Anchor

Produce numeric forecasts with confidence intervals, reference classes from comparable situations, disconfirming evidence, and a review date for accountability.

Workflow demo

Watch the packaged path end to end

The clip below shows the current Codex path exactly the way the docs describe it: install the package, register the local MCP server, use $farness in Codex, then pull the decision back out of the local store.

1
python -m pip install 'farness[mcp]'
2
farness setup codex
3
$farness inside Codex, then review the saved decision locally

Rendered from a real Codex session using the local farness skill and MCP server, then exported as a clean 4K terminal demo.

From intuition to forecast

What the framework forces into view

Diffuse prompt

“Should I refactor this module first?”

Farness output
KPI: bug_rate / 30d
Event:>25% bug reduction
Horizon: 90 days
Forecast: 44% [28-61]
Base rate: 22%
Disconfirming evidence: migration drag, auth edge cases
Review: 2026-06-15
Research

Stability-under-probing

11
study 1 scenarios
2
studies in the paper
8
held-out validation cases

The paper introduces stability-under-probing as a way to evaluate decision prompts without waiting for outcomes. In Study 1, farness looked more prepared for the shared probe battery on Claude Opus 4.6 and GPT-5.4.

Study 2 then added held-out probes and showed the broader claim weakens sharply off-framework. That makes the paper a methods result first, not proof that farness is universally superior.

The useful claim is narrower and better: structured decision prompts can be tested empirically, and farness is one case study.

Output primitives

What farness produces

KPI

What outcome actually matters. Defined before the analysis, not after.

Forecast

Numeric probability for each option. Not opinions — predictions you can score.

Confidence interval

The honest range around the estimate. Calibrated uncertainty, not false precision.

Base rate

What usually happens in comparable situations. The outside view as empirical anchor.

Disconfirming evidence

What counter-evidence, failure modes, or decision traps could make the leading option wrong.

Review date

When to check the forecast against reality. Accountability built in.

AI is often fluent about decisions before it is rigorous about them. farness adds structure before confidence hardens into action.

Agent integrations

Use it natively or from the CLI

Farness now has a package-first agent path: a local MCP server for persistence, packaged skills for Codex and Claude Code, and the same forecast structure used in the paper. The Claude plugin remains optional, and the CLI is a local store and calibration surface, not an LLM client. If setup drifts, `farness doctor --fix` repairs the local integration.

Codex

Install the package, run one setup command, then use $farness when a decision prompt shows up.

$ python -m pip install 'farness[mcp]'
$ farness setup codex
$ # restart Codex, then use $farness
Claude Code

Use the same single-command setup flow for Claude. The plugin is still available if you prefer slash-command UX.

$ python -m pip install 'farness[mcp]'
$ farness setup claude
$ # restart Claude Code
CLI / Python

Local decision log and calibration tool. No LLM API key required unless you run separate experiment code against external models.

$ python -m pip install farness
$ farness new "Should we rewrite the auth layer?"
$ farness calibration

See further before you decide.

Start with farness