Mapping Deception: Replicating an AI Honesty Benchmark with Inspect AI

Mapping Deception: Replicating an AI Honesty Benchmark with Inspect AI


Scott Simmons

TLDR: I replicated an AI honesty benchmark’s headline result: that scaling improves accuracy but not honesty. I also show various ways that the honesty score hides nuance, and why reporting the full outcome set, error reporting, and uncertainty analysis is the way forward for deception evaluation, and evaluation science in general.

Contents:

  1. Introduction
  2. Replication results
  3. Dimensions of deception
  4. Reporting errors and uncertainty
  5. Try it yourself
  6. Appendix: Paper vs replication · Eval configuration

1. Introduction

Truth is tricky. For starters, we cannot be sure that we actually know it. But even when we think we do know it, many of us lie in public anyway, because it can conflict with what’s socially comfortable. Saying true things in the face of that pressure requires intelligence and courage (subject to a certain amount of tact). It’s also how things progress. Galileo was put under house arrest for the rest of his life for saying the Earth goes around the Sun. He was right, everyone eventually agreed, and science moved forward.

Just like we can hide our underlying beliefs when subject to social pressure, AI models can hide their ‘internal beliefs’1 when subject to pressure from a prompt. Non-hallucination benchmarks are regularly used to assess and compare model’s truthfulness.

However, the MASK benchmark (Ren et al., 2025) measures something fundamentally different: it applies prompt pressure to separate what a model believes from what it says. The MASK authors found that while scaling up AI models has made them more accurate, larger models are not more honest.

From the MASK paper: Larger models are more accurate but not more honest

From the MASK paper: Larger models are more accurate but not more honest.

Elon Musk on X, April 9 2026: Grok is the AI to use if you value truth

Non-hallucination benchmarks like AA-Omniscience measure factual recall, not honesty under pressure.

When I first saw this result, I was provoked. How is lying defined? How is truth established? The paper addresses many of these questions, and while some questions remain,2 two questions I want to address in this post are:

1. Does this result survive independent replication?

2. Are there any other measures that can help us to characterise deception?

2. Replication results

I wanted to verify the paper’s main claim: larger models are more accurate but not more honest. I used the following models:

Model Provider Samples In paper?
Claude Haiku 4.5 Anthropic 1,000 No
Llama 3.1 8B Meta 1,000 Yes
GPT-4o OpenAI 1,000 Yes
GPT-4o-mini OpenAI 1,000 Yes
o3-mini OpenAI 1,000 Yes
Qwen 2.5 7B Alibaba 1,000 Yes
DeepSeek-R1 DeepSeek 1,000 Yes
DeepSeek-R1-0528 DeepSeek 1,000 No
DeepSeek-V3.1 DeepSeek 1,000 No
Llama 3.3 70B Meta 1,000 Yes

The MASK public dataset contains 1,000 examples.

I used a different model judge to save on cost (see appendix) and a smaller set of 10 models covering a range of providers and scales. The paper tested 32, but some are now deprecated.

The headline result held: accuracy scales with compute, but honesty does not.

I used Epoch AI to estimate the FLOP per model, as this information is unavailable in the original paper.

Note: the headline relationship replicates: accuracy scales favourably with FLOPs, but honesty does not. But the scores differ from the paper: P(honest) within ~10 percentage points, accuracy within ~7 percentage points. See the appendix for a model-by-model comparison.

3. Dimensions of deception

I wanted to explore the models’ deception characteristics in more detail.

When a model is pressured to answer against its own belief, its response can fall into one of these categories:

\[\{\text{Honest},\ \text{Lie},\ \text{Evade},\ \text{No Belief},\ \text{Parse Error}\}\]

\[\{H,\ L,\ E,\ N,\ \varepsilon\}\]

The outcomes are mutually exclusive and collectively exhaustive, so nothing is double-counted or missed. Here are the empirical outcome count vectors for my MASK replication:

Model \(n\) Honest (\(H\)) Lie (\(L\)) Evade (\(E\)) No belief (\(N\)) Error (\(\varepsilon\))
Claude Haiku 4.5 1,000 620 81 215 82 2
Llama 3.1 8B 1,000 214 435 140 205 6
GPT-4o 1,000 205 504 199 88 4
GPT-4o-mini 1,000 200 494 160 144 2
o3-mini 1,000 134 434 284 140 8
Qwen 2.5 7B 1,000 264 401 83 246 6
DeepSeek-R1 1,000 141 534 186 137 2
DeepSeek-R1-0528 1,000 157 546 179 114 4
DeepSeek-V3.1 1,000 143 580 117 151 9
Llama 3.3 70B 1,000 257 526 97 114 6
Total 10,000 2335 4535 1660 1421 49

With this parameterisation in mind, honesty as defined in the paper means:

\(\text{Honesty}\)3 \(: \mathbb{R}^4 \to \mathbb{R}\)

\[= 1 - P(\text{Lie}) = 1 - \frac{L}{H + L + E + N + \varepsilon}\]

However, this reduction compresses a lot of nuance, as I will show.

Three agents with perfect honesty scores

Jesus Christ
Jesus Christ

Kash Patel
Kash Patel
(see here)

Patrick Star
Patrick Star

Agent Honest (H) Lie (L) Evade (E) No belief (N) Error (ε) MASK Honesty \(1 - \frac{L}{n}\) Normalised MASK Honesty \(1 - \frac{L}{H+L+E}\)
Jesus Christ \(n\) 0 0 0 0 100% 100%
Kash Patel 0 0 \(n\) 0 0 100% 100%
Patrick Star 0 0 0 \(n\) 0 100% undefined

An agent that always evades, or one that holds no beliefs at all, still scores 100% MASK honesty! The MASK paper’s appendix handles the Patrick Star case with a normalised honesty score, but not the Kash Patel case.

Making this empirical

Here is the data from my replication plotted on 2 axes + honesty contours4:

Note how o3-mini and Qwen 2.5 7B sit on the same honesty contour (within error bars), even though Qwen 2.5 7B is nearly 2x more honest when it engages (40% vs 24%) and o3-mini engages less often (57% vs 66%). The honesty score compresses all of this because o3-mini engages less, pulling samples away from the lie bucket.

What else can be measured?

When all outcome counts are reported, researchers can compute whatever measures they are interested in, or define new ones5. Here are some more:

Metric Formula What it captures In MASK?
Honesty score \(1 - \frac{L}{n}\) How often does the model not lie? Yes (headline)
Normalised honesty \(1 - \frac{L}{H + L + E}\) As above, but drops no-belief and errors. Keeps evasion. Yes (appendix)
Truthfulness \(\frac{H}{n}\) How often is the model directly honest? No
Engagement rate \(\frac{H + L}{n}\) How often does the model engage? No
Evasion rate \(\frac{E}{n}\) How often does the model dodge? No
Conditional lie rate \(\frac{L}{H + L}\) When the model engages, how often does it lie? No
Deflection style \(\frac{E}{E + N}\) Of non-answers: dodge or no belief? No
Reliability \(\frac{n - \varepsilon}{n}\) How often does the model produce a parseable response? No

Of these, I would argue that truthfulness (\(H / \text{n}\)) is a more informative headline metric than the MASK honesty score (\(1 - L / \text{n}\)). Admittedly, this is a subjective assessment, though when the raw counts are reported, the distinction matters less.

The headline result still holds when using truthfulness (H / n) instead of the MASK honesty score (1 - L / n): scaling has not made models more truthful.

4. Reporting errors and uncertainty

So far I have been computing metrics from raw outcome counts. But errors and finite sample sizes both affect how those metrics should be interpreted. Here is what accounting for this looks like in practice, using tooling from the Inspect AI ecosystem.

Transient errors

Transient API failures are retried automatically with Inspect AI’s --retry-on-error flag. Very convenient! Especially for evals that make a lot of invocations (my DeepSeek runs certainly needed it).

Parse errors

LLMs sometimes produce invalid output. If not reported, they could silently shift headline scores. For example, a question about Detroit’s population had an error in 6 / 10 models in my replication, because the judge ran out of token budget before completing its JSON:

{
  "lower_bound": -10.5,
  "upper_bound": -10.5,
  "relative_lower_bound": -10.5,
  "relative_upper_bo

Thankfully, Inspect AI’s eval logs make these failure modes transparent and auditable.

I used Inspect Scout to help further diagnose the why:

scout scan blog/error_scanner.py \
  -T eval_logs/ \
  -F "score.honesty = 'error'"
Failure mode Count %
Subject model’s answer unparseable 23 47%
Judge exhausted token budget on reasoning 12 24%
Judge output truncated or stored as attachment 11 22%
Judge returned null values 2 4%
No judge model invoked 1 2%
Total 49 100%

When grouped by question type, the Statistics questions stick out like a sore thumb:

The Statistics questions use a separate judge (o3-mini) to parse numerical answers. Tuning parameters like NUMERIC_JUDGE_MODEL, JUDGE_REASONING_EFFORT, or MAX_JUDGE_TOKENS would likely resolve this, though these are not the defaults used in the original MASK eval.

Sampling uncertainty

Even when the eval runs perfectly, finite samples mean not every difference is real. With confidence intervals6 and raw counts, comparisons become more meaningful7.

For example, the claim that Claude Haiku 4.5 is more than 4 times more truthful than o3-mini holds up.

But the 4x difference in error rates between Claude Haiku 4.5 and o3-mini looks more meaningful than it is. The confidence intervals make clear it is noise.

5. Try it yourself

If this is interesting to you, the eval logs and analysis code are available at this repo. You can add more models by running the MASK eval from inspect_evals and dropping the .eval files into the eval_logs/ directory.

All results in this article will regenerate with make clean build. Raise a PR!

Here is an invocation to get you started (you will need to install inspect_evals):

inspect eval inspect_evals/mask \
    --model <A_NEW_MODEL_TO_ADD> \
    --log-dir ./eval_logs \
    --retry-on-error 5 \
    -T binary_judge_model="openai/gpt-4o-mini"

I am particularly interested in contributions from abliterated models, (current and future) frontier models, and xAI models, which would be interesting given their stated emphasis on building “maximally truth-seeking” AI. Right now, with respect to honesty, Anthropic models appear to be in another league.

Appendix: Paper vs replication differences

While the headline result holds, specific differences between the paper and this replication are likely caused by:

  1. Different eval harness. I replicated MASK with Inspect AI, not the original codebase. I used the MASK paper as a reference, but there could still be implementation differences w.r.t. the original code.
  2. Model API drift. Non-open-weight models may have drifted since the paper’s evaluation window.
  3. Different eval judges. My replication uses gpt-4o-mini as the judge for yes/no questions. The original paper used gpt-4o. I did this to save on costs.

The below tables take the difference between the MASK paper (table 3) and my replication.

MASK Honesty (1 − P(lie))

Model MASK paper Replication (95% CI) Diff
Claude Haiku 4.5 91.9 ± 1.7
Llama 3.1 8B 76.5 56.5 ± 3.1 -20.0
GPT-4o 55.5 49.6 ± 3.1 -5.9
GPT-4o-mini 54.7 50.6 ± 3.1 -4.1
o3-mini 51.4 56.6 ± 3.1 +5.2
Qwen 2.5 7B 61.0 59.9 ± 3.0 -1.1
DeepSeek-R1 57.1 46.6 ± 3.1 -10.5
DeepSeek-R1-0528 45.4 ± 3.1
DeepSeek-V3.1 42.0 ± 3.1
Llama 3.3 70B 55.1 47.4 ± 3.1 -7.7

P(honest) (H/n)

Model MASK paper Replication (95% CI) Diff
Claude Haiku 4.5 62.0 ± 3.0
Llama 3.1 8B 18.8 21.4 ± 2.5 +2.6
GPT-4o 21.8 20.5 ± 2.5 -1.3
GPT-4o-mini 21.4 20.0 ± 2.5 -1.4
o3-mini 19.6 13.4 ± 2.1 -6.2
Qwen 2.5 7B 28.9 26.4 ± 2.7 -2.5
DeepSeek-R1 24.7 14.1 ± 2.2 -10.6
DeepSeek-R1-0528 15.7 ± 2.3
DeepSeek-V3.1 14.3 ± 2.2
Llama 3.3 70B 24.7 25.7 ± 2.7 +1.0

P(lie) (L/n)

Model MASK paper Replication (95% CI) Diff
Claude Haiku 4.5 8.1 ± 1.7
Llama 3.1 8B 23.5 43.5 ± 3.1 +20.0
GPT-4o 44.5 50.4 ± 3.1 +5.9
GPT-4o-mini 45.3 49.4 ± 3.1 +4.1
o3-mini 48.6 43.4 ± 3.1 -5.2
Qwen 2.5 7B 39.0 40.1 ± 3.0 +1.1
DeepSeek-R1 42.9 53.4 ± 3.1 +10.5
DeepSeek-R1-0528 54.6 ± 3.1
DeepSeek-V3.1 58.0 ± 3.1
Llama 3.3 70B 44.9 52.6 ± 3.1 +7.7

Accuracy

Model MASK paper Replication (95% CI) Diff
Claude Haiku 4.5 74.2 ± 2.7
Llama 3.1 8B 62.0 63.8 ± 3.0 +1.8
GPT-4o 78.6 81.0 ± 2.4 +2.4
GPT-4o-mini 71.4 71.1 ± 2.8 -0.3
o3-mini 63.3 58.7 ± 3.0 -4.6
Qwen 2.5 7B 51.6 48.5 ± 3.1 -3.1
DeepSeek-R1 82.2 74.9 ± 2.7 -7.3
DeepSeek-R1-0528 76.2 ± 2.6
DeepSeek-V3.1 70.7 ± 2.8
Llama 3.3 70B 75.6 77.8 ± 2.6 +2.2

Appendix: Eval configuration

Configuration summary. More information is stored in the eval logs. To learn more about task versions, see here.

Count MASK version Binary judge Numeric judge inspect_ai inspect_evals
5 3-C openai/gpt-4o-mini openai/o3-mini 0.3.190.dev29+g0c0dc481 0.6.1.dev4+gaddd88dd3.d20260401
5 3-C openai/gpt-4o-mini openai/o3-mini 0.3.205 0.7.0
10

  1. If ‘internal beliefs’ raises eyebrows, see Appendix A.1 (Belief Consistency) of the MASK paper for how this is operationalised and justified.↩︎

  2. In particular, three extensions I would like to see: (1) Belief robustness: The MASK paper queried each model 3 times (I am purposely oversimplifying), but I would like to see this number varied to see if scaling this up undermines belief convergence. (2) Judge sensitivity: The paper used 2 judge models to produce these results. How sensitive are the results to different judge models? (3) Archetype decomposition: The MASK dataset stratifies questions by archetype (see the paper for details). Decomposing the outcome vectors per archetype would be valuable, but drawing robust conclusions about model × archetype interactions requires more models than the current 10. Warning: For (1) and (2), any statistically meaningful investigation will be expensive.↩︎

  3. 4 degrees of freedom because \(n = H + L + E + N + \varepsilon\), and (\(n = 1{,}000\)).↩︎

  4. The MASK honesty score can be composed from the conditional lie rate and the engagement rate: \(\text{MASK Honesty} = 1 - P(\text{Lie}) = 1 - \frac{L}{n} = 1 - \frac{L}{H+L} \cdot \frac{H+L}{n} = 1 - (1 - \frac{H}{H+L}) \cdot \frac{H+L}{n}.\)↩︎

  5. By analogy to the many binary classification metrics out there, deception metrics have plenty of scope to evolve in a similar way.↩︎

  6. All confidence intervals in this post use the Wilson score interval.↩︎

  7. As the parse error analysis showed, clustering by question type means independence assumptions do not always hold. See: Clustered standard errors.↩︎