# Beating the benchmark — how CAPAS wins IBM's game with IBM's own numbers *2026-06-22. Honest scope. Re-derivable: `capas_quantum_physics.complete_error_budget`, `mitigation_prescription`, `estimate_kappa`; kingston data in `/tmp/xeb_results.json`.* ## The game, and where it's winnable We cannot build better qubits. But IBM's *published* numbers — RB/IRB error, CLOPS, median T1/T2 — are designed to **compare processors**, not to **predict your circuit**. The deep hardware literature is unambiguous: the RB number is an **optimistic lower bound**; for structured (repeated) circuits the real error is **3–10× higher** (Proctor et al., *Nat. Phys.* 2022). That gap is the opening. CAPAS wins three ways, **from what we already have**: - **By design** — deterministic, composable, auditable. IBM's RB number is *not* re-derivable by you; CAPAS's complete budget *is*, from IBM's own calibration fields. - **By understanding** — we encode the physics the headline averages away (dephasing, ZZ-idle, leakage, thermal SPAM, the structured-circuit amplification). - **By inference** — from published T1/T2/readout/CZ/ZZ we re-derive what IBM withholds: Γφ, thermal population, the **complete error budget**, and (by direct measurement) the coherent/incoherent split. ## The result: the headline number is optimistic, and we prove it `complete_error_budget` on the real ibm_fez pair q9–q10 (every input from IBM's panel): | | value | source | |---|---|---| | IBM headline (CZ/RB) | **1.63e-3** | published | | Re-derivable **complete floor** | **1.86e-2** (**11.4×**) | composed from published fields | | dominant omitted term | **dephasing** (q9 T2=33.9µs) | from published T1,T2 | | structured-circuit band | up to **1.86e-1** | Proctor 2022 (literature range, cited) | Every term except `leakage` (a literature estimate) and the structured band (a cited range) is **exact re-derivation from IBM's own data**. The headline says 0.16%; the honest, auditable floor is 1.9% — because the published number hides a dephasing-limited qubit. ## We measured it directly too (and corrected our own error) The kingston XEB run (edges 82-83 diamond, 6-7 moderate) gave per-layer error from the **slope of log(XEB) vs depth** — which, crucially, **separates the per-layer gate error from the depth- independent SPAM/readout prefactor**. Our first estimator took single-depth `eff_err` and folded SPAM in, inflating κ to ~10 on the diamond edge. Corrected: | edge | published CZ+SX | measured per-layer (slope) | κ | reading | |---|---|---|---|---| | 6-7 (moderate) | 1.02% | 1.08% | **1.05** | the published number is **accurate** here | | 82-83 (diamond) | 0.14% | 0.35% | **2.53** | IBM **under-states 2.5×** on its best edge | **The honest punchline:** the gap between IBM's headline and reality is *largest exactly where its number looks best* — on the diamond edge, the tiny CZ (8e-4) is swamped by a ~0.3% per-layer floor (SPAM/dephasing/crosstalk) the headline ignores. And the error is **incoherent** (speckle purity), so this is *not* the κ̂=0.401 coherent-discount hypothesis — that was an artifact of a too- pessimistic inference baseline; direct measurement shows the published number is optimistic, not pessimistic. ## Operationalized: error correction, re-derived `mitigation_prescription` turns the same calibration row into actions IBM's panel does not hand you — each with its re-derived reason: - **dynamical decoupling (XY-8/CPMG)** when Tφ < 50µs (q9: Tφ=35.9µs) — with the honest caveat that it only helps for low-frequency TLS. - **rep_delay ≥ 5·T1** (q9: 1498µs) — the default 250µs leaves ~57% relaxation, contaminating state prep. - **active reset** when P(1|prep0) ≫ thermal (residual excitation). - **explicit readout correction** (the assignment matrix is asymmetric → systematic bias). ## The honest line CAPAS does not measure the qubit better than IBM — IBM has the device. What CAPAS does is **refuse to let an optimistic summary statistic stand as the admissible error**, and re-derive the complete, auditable number from the vendor's own published fields. That is the GRIM/statcheck move, applied to a hardware benchmark: the headline is real but optimistic; the complete floor is re-derivable; we say exactly which terms are exact, which are estimated, and which are a cited literature range. We win by honesty and auditability, not by hardware.