farness intercepts agent decisions and demands a forecast: a KPI, a confidence interval, a base rate, disconfirming evidence, and a review date. Works with Codex, Claude Code, and any agent that speaks MCP.
Catch decision-language before the model hardens into advice. When a prompt sounds like 'Should we...?' or 'Which is better?', farness reframes it as a forecastable choice.
Convert vague 'Should I?' into explicit, measurable outcome questions. Define the KPIs that would actually tell you whether the decision was good.
Produce numeric forecasts with confidence intervals, reference classes from comparable situations, disconfirming evidence, and a review date for accountability.
The clip below shows the current Codex path exactly the way the docs describe it: install the package, register the local MCP server, use $farness in Codex, then pull the decision back out of the local store.
Rendered from a real Codex session using the local farness skill and MCP server, then exported as a clean 4K terminal demo.
“Should I refactor this module first?”
The paper introduces stability-under-probing as a way to evaluate decision prompts without waiting for outcomes. In Study 1, farness looked more prepared for the shared probe battery on Claude Opus 4.6 and GPT-5.4.
Study 2 then added held-out probes and showed the broader claim weakens sharply off-framework. That makes the paper a methods result first, not proof that farness is universally superior.
The useful claim is narrower and better: structured decision prompts can be tested empirically, and farness is one case study.
What outcome actually matters. Defined before the analysis, not after.
Numeric probability for each option. Not opinions — predictions you can score.
The honest range around the estimate. Calibrated uncertainty, not false precision.
What usually happens in comparable situations. The outside view as empirical anchor.
What counter-evidence, failure modes, or decision traps could make the leading option wrong.
When to check the forecast against reality. Accountability built in.
AI is often fluent about decisions before it is rigorous about them. farness adds structure before confidence hardens into action.
Farness now has a package-first agent path: a local MCP server for persistence, packaged skills for Codex and Claude Code, and the same forecast structure used in the paper. The Claude plugin remains optional, and the CLI is a local store and calibration surface, not an LLM client. If setup drifts, `farness doctor --fix` repairs the local integration.
Install the package, run one setup command, then use $farness when a decision prompt shows up.
$ python -m pip install 'farness[mcp]' $ farness setup codex $ # restart Codex, then use $farness
Use the same single-command setup flow for Claude. The plugin is still available if you prefer slash-command UX.
$ python -m pip install 'farness[mcp]' $ farness setup claude $ # restart Claude Code
Local decision log and calibration tool. No LLM API key required unless you run separate experiment code against external models.
$ python -m pip install farness $ farness new "Should we rewrite the auth layer?" $ farness calibration