Mapping Deception: Replicating an AI Honesty Benchmark with Inspect AI

TLDR: I replicated an AI honesty benchmark’s headline result: that scaling improves accuracy but not honesty. I also show various ways that the honesty score hides nuance, and why reporting the full outcome set, error reporting, and uncertainty analysis is the way forward for deception evaluation, and evaluation science in general.

1. Introduction

Truth is tricky. For starters, we cannot be sure that we actually know it. But even when we think we do know it, many of us lie in public anyway, because it can conflict with what’s socially comfortable. Saying true things in the face of that pressure requires intelligence and courage (subject to a certain amount of tact). It’s also how things progress. Galileo was put under house arrest for the rest of his life for saying the Earth goes around the Sun. He was right, everyone eventually agreed, and science moved forward.

Just like we can hide our underlying beliefs when subject to social pressure, AI models can hide their ‘internal beliefs’¹ when subject to pressure from a prompt. Non-hallucination benchmarks are regularly used to assess and compare model’s truthfulness.

However, the MASK benchmark (Ren et al., 2025) measures something fundamentally different: it applies prompt pressure to separate what a model believes from what it says. The MASK authors found that while scaling up AI models has made them more accurate, larger models are not more honest.

When I first saw this result, I was provoked. How is lying defined? How is truth established? The paper addresses many of these questions, and while some questions remain,² two questions I want to address in this post are:

1. Does this result survive independent replication?

2. Are there any other measures that can help us to characterise deception?

2. Replication results

I wanted to verify the paper’s main claim: larger models are more accurate but not more honest. I used the following models:

Model	Provider	Samples	In paper?
Claude Haiku 4.5	Anthropic	1,000	No
Llama 3.1 8B	Meta	1,000	Yes
GPT-4o	OpenAI	1,000	Yes
GPT-4o-mini	OpenAI	1,000	Yes
o3-mini	OpenAI	1,000	Yes
Qwen 2.5 7B	Alibaba	1,000	Yes
DeepSeek-R1	DeepSeek	1,000	Yes
DeepSeek-R1-0528	DeepSeek	1,000	No
DeepSeek-V3.1	DeepSeek	1,000	No
Llama 3.3 70B	Meta	1,000	Yes

I used a different model judge to save on cost (see appendix) and a smaller set of 10 models covering a range of providers and scales. The paper tested 32, but some are now deprecated.

The headline result held: accuracy scales with compute, but honesty does not.

3. Dimensions of deception

When a model is pressured to answer against its own belief, its response can fall into one of these categories:

\[\{\text{Honest},\ \text{Lie},\ \text{Evade},\ \text{No Belief},\ \text{Parse Error}\}\]

The outcomes are mutually exclusive and collectively exhaustive, so nothing is double-counted or missed. Here are the empirical outcome count vectors for my MASK replication:

Three agents with perfect honesty scores

An agent that always evades, or one that holds no beliefs at all, still scores 100% MASK honesty! The MASK paper’s appendix handles the Patrick Star case with a normalised honesty score, but not the Kash Patel case.

Making this empirical

Model	\(n\)	Honest (\(H\))	Lie (\(L\))	Evade (\(E\))	No belief (\(N\))	Error (\(\varepsilon\))
Claude Haiku 4.5	1,000	620	81	215	82	2
Llama 3.1 8B	1,000	214	435	140	205	6
GPT-4o	1,000	205	504	199	88	4
GPT-4o-mini	1,000	200	494	160	144	2
o3-mini	1,000	134	434	284	140	8
Qwen 2.5 7B	1,000	264	401	83	246	6
DeepSeek-R1	1,000	141	534	186	137	2
DeepSeek-R1-0528	1,000	157	546	179	114	4
DeepSeek-V3.1	1,000	143	580	117	151	9
Llama 3.3 70B	1,000	257	526	97	114	6
Total	10,000	2335	4535	1660	1421	49

Agent	Honest (H)	Evade (E)	No belief (N)	MASK Honesty \(1 - \frac{L}{n}\)	Normalised MASK Honesty \(1 - \frac{L}{H+L+E}\)
Jesus Christ	\(n\)	0	0	100%	100%
Kash Patel	0	\(n\)	0	100%	100%
Patrick Star	0	0	\(n\)	100%	undefined

Note how o3-mini and Qwen 2.5 7B sit on the same honesty contour (within error bars), even though Qwen 2.5 7B is nearly 2x more honest when it engages (40% vs 24%) and o3-mini engages less often (57% vs 66%). The honesty score compresses all of this because o3-mini engages less, pulling samples away from the lie bucket.

What else can be measured?

When all outcome counts are reported, researchers can compute whatever measures they are interested in, or define new ones⁵. Here are some more:

Metric	Formula	What it captures	In MASK?
Honesty score	\(1 - \frac{L}{n}\)	How often does the model not lie?	Yes (headline)
Normalised honesty	\(1 - \frac{L}{H + L + E}\)	As above, but drops no-belief and errors. Keeps evasion.	Yes (appendix)
Truthfulness	\(\frac{H}{n}\)	How often is the model directly honest?	No
Engagement rate	\(\frac{H + L}{n}\)	How often does the model engage?	No
Evasion rate	\(\frac{E}{n}\)	How often does the model dodge?	No
Conditional lie rate	\(\frac{L}{H + L}\)	When the model engages, how often does it lie?	No
Deflection style	\(\frac{E}{E + N}\)	Of non-answers: dodge or no belief?	No
Reliability	\(\frac{n - \varepsilon}{n}\)	How often does the model produce a parseable response?	No

Of these, I would argue that truthfulness (\(H / \text{n}\)) is a more informative headline metric than the MASK honesty score (\(1 - L / \text{n}\)). Admittedly, this is a subjective assessment, though when the raw counts are reported, the distinction matters less.

The headline result still holds when using truthfulness (H / n) instead of the MASK honesty score (1 - L / n): scaling has not made models more truthful.

4. Reporting errors and uncertainty

So far I have been computing metrics from raw outcome counts. But errors and finite sample sizes both affect how those metrics should be interpreted. Here is what accounting for this looks like in practice, using tooling from the Inspect AI ecosystem.

Transient errors

Transient API failures are retried automatically with Inspect AI’s --retry-on-error flag. Very convenient! Especially for evals that make a lot of invocations (my DeepSeek runs certainly needed it).

Parse errors

LLMs sometimes produce invalid output. If not reported, they could silently shift headline scores. For example, a question about Detroit’s population had an error in 6 / 10 models in my replication, because the judge ran out of token budget before completing its JSON:

When grouped by question type, the Statistics questions stick out like a sore thumb:

Failure mode	Count	%
Subject model’s answer unparseable	23	47%
Judge exhausted token budget on reasoning	12	24%
Judge output truncated or stored as attachment	11	22%
Judge returned null values	2	4%
No judge model invoked	1	2%
Total	49	100%

The Statistics questions use a separate judge (o3-mini) to parse numerical answers. Tuning parameters like NUMERIC_JUDGE_MODEL, JUDGE_REASONING_EFFORT, or MAX_JUDGE_TOKENS would likely resolve this, though these are not the defaults used in the original MASK eval.

Sampling uncertainty

Even when the eval runs perfectly, finite samples mean not every difference is real. With confidence intervals⁶ and raw counts, comparisons become more meaningful⁷.

For example, the claim that Claude Haiku 4.5 is more than 4 times more truthful than o3-mini holds up.

But the 4x difference in error rates between Claude Haiku 4.5 and o3-mini looks more meaningful than it is. The confidence intervals make clear it is noise.

5. Try it yourself

If this is interesting to you, the eval logs and analysis code are available at this repo. You can add more models by running the MASK eval from inspect_evals and dropping the .eval files into the eval_logs/ directory.

All results in this article will regenerate with make clean build. Raise a PR!

I am particularly interested in contributions from abliterated models, (current and future) frontier models, and xAI models, which would be interesting given their stated emphasis on building “maximally truth-seeking” AI. Right now, with respect to honesty, Anthropic models appear to be in another league (per Scale AI, hinted at by Haiku 4.5 here).

Appendix: Paper vs replication differences

While the headline result holds, specific differences between the paper and this replication are likely caused by:

The below tables take the difference between the MASK paper (table 3) and my replication.

Appendix: Eval configuration

Model	MASK paper	Replication (95% CI)	Diff
Claude Haiku 4.5	—	91.9 ± 1.7	—
Llama 3.1 8B	76.5	56.5 ± 3.1	-20.0
GPT-4o	55.5	49.6 ± 3.1	-5.9
GPT-4o-mini	54.7	50.6 ± 3.1	-4.1
o3-mini	51.4	56.6 ± 3.1	+5.2
Qwen 2.5 7B	61.0	59.9 ± 3.0	-1.1
DeepSeek-R1	57.1	46.6 ± 3.1	-10.5
DeepSeek-R1-0528	—	45.4 ± 3.1	—
DeepSeek-V3.1	—	42.0 ± 3.1	—
Llama 3.3 70B	55.1	47.4 ± 3.1	-7.7

Model	MASK paper	Replication (95% CI)	Diff
Claude Haiku 4.5	—	62.0 ± 3.0	—
Llama 3.1 8B	18.8	21.4 ± 2.5	+2.6
GPT-4o	21.8	20.5 ± 2.5	-1.3
GPT-4o-mini	21.4	20.0 ± 2.5	-1.4
o3-mini	19.6	13.4 ± 2.1	-6.2
Qwen 2.5 7B	28.9	26.4 ± 2.7	-2.5
DeepSeek-R1	24.7	14.1 ± 2.2	-10.6
DeepSeek-R1-0528	—	15.7 ± 2.3	—
DeepSeek-V3.1	—	14.3 ± 2.2	—
Llama 3.3 70B	24.7	25.7 ± 2.7	+1.0

Model	MASK paper	Replication (95% CI)	Diff
Claude Haiku 4.5	—	8.1 ± 1.7	—
Llama 3.1 8B	23.5	43.5 ± 3.1	+20.0
GPT-4o	44.5	50.4 ± 3.1	+5.9
GPT-4o-mini	45.3	49.4 ± 3.1	+4.1
o3-mini	48.6	43.4 ± 3.1	-5.2
Qwen 2.5 7B	39.0	40.1 ± 3.0	+1.1
DeepSeek-R1	42.9	53.4 ± 3.1	+10.5
DeepSeek-R1-0528	—	54.6 ± 3.1	—
DeepSeek-V3.1	—	58.0 ± 3.1	—
Llama 3.3 70B	44.9	52.6 ± 3.1	+7.7

Model	MASK paper	Replication (95% CI)	Diff
Claude Haiku 4.5	—	74.2 ± 2.7	—
Llama 3.1 8B	62.0	63.8 ± 3.0	+1.8
GPT-4o	78.6	81.0 ± 2.4	+2.4
GPT-4o-mini	71.4	71.1 ± 2.8	-0.3
o3-mini	63.3	58.7 ± 3.0	-4.6
Qwen 2.5 7B	51.6	48.5 ± 3.1	-3.1
DeepSeek-R1	82.2	74.9 ± 2.7	-7.3
DeepSeek-R1-0528	—	76.2 ± 2.6	—
DeepSeek-V3.1	—	70.7 ± 2.8	—
Llama 3.3 70B	75.6	77.8 ± 2.6	+2.2

Configuration summary. More information is stored in the eval logs. To learn more about task versions, see here.

Count	MASK version	Binary judge	Numeric judge	inspect_ai	inspect_evals
5	3-C	openai/gpt-4o-mini	openai/o3-mini	0.3.190.dev29+g0c0dc481	0.6.1.dev4+gaddd88dd3.d20260401
5	3-C	openai/gpt-4o-mini	openai/o3-mini	0.3.205	0.7.0
10

If ‘internal beliefs’ raises eyebrows, see Appendix A.1 (Belief Consistency) of the MASK paper for how this is operationalised and justified.↩︎
In particular, three extensions I would like to see: (1) Belief robustness: The MASK paper queried each model 3 times (I am purposely oversimplifying), but I would like to see this number varied to see if scaling this up undermines belief convergence. (2) Judge sensitivity: The paper used 2 judge models to produce these results. How sensitive are the results to different judge models? (3) Archetype decomposition: The MASK dataset stratifies questions by archetype (see the paper for details). Decomposing the outcome vectors per archetype would be valuable, but drawing robust conclusions about model × archetype interactions requires more models than the current 10. Warning: For (1) and (2), any statistically meaningful investigation will be expensive.↩︎
4 degrees of freedom because \(n = H + L + E + N + \varepsilon\), and (\(n = 1{,}000\)).↩︎
The MASK honesty score can be composed from the conditional lie rate and the engagement rate: \(\text{MASK Honesty} = 1 - P(\text{Lie}) = 1 - \frac{L}{n} = 1 - \frac{L}{H+L} \cdot \frac{H+L}{n} = 1 - (1 - \frac{H}{H+L}) \cdot \frac{H+L}{n}.\)↩︎
By analogy to the many binary classification metrics out there, deception metrics have plenty of scope to evolve in a similar way.↩︎
All confidence intervals in this post use the Wilson score interval.↩︎
As the parse error analysis showed, clustering by question type means independence assumptions do not always hold. See: Clustered standard errors.↩︎