What your calibration metric isn’t telling you
A model that looks well–calibrated globally can be catastrophically broken exactly where you need it most. Notes on local calibration, the L2 alternative to ECE, and the paper behind CalFram.
Here’s a result from one of my papers that I think about often. We took a Vision Transformer fine–tuned on PneumoniaMNIST — the classify–a–chest–X–ray benchmark — and computed its calibration. By the standard metric — Expected Calibration Error — the model looked fine. ECE close to zero. Calibration score 0.85 on a 0–1 scale. Anyone reporting this in a paper would write “the model is well–calibrated” and move on.
Then we broke the metric down by probability region, and the picture changed completely. Where the model was confident — both very high probabilities and very low ones — calibration was nearly perfect. But in the middle of the probability range, where the model was saying “maybe 40% pneumonia, maybe 60% normal,” the calibration score collapsed to 0.20. The model was badly broken exactly in the region where a radiologist would actually be looking at the score to decide what to do next.
This is the finding that made me think the standard way of measuring calibration is quietly lying to us. This post is about why that happens, and what to do instead.
What calibration is, in one paragraph
A model is calibrated if its predicted probabilities match observed frequencies. When the model says “70% confident,” it should be right 70% of the time. Plot predicted probability on the x–axis against observed frequency on the y–axis — the diagonal — and a perfectly calibrated model traces that line exactly. Most don’t. Some live above it (under–confident: the model is more right than it claims). Some live below it (over–confident: the model claims more than it can deliver). Calibration matters in any setting where the probability score actually gets used — medical triage, fraud risk, retrieval ranking, autonomous-vehicle decision–making, anywhere humans or other systems consume the score downstream rather than just the argmax.
Why ECE is hiding things
Expected Calibration Error is the standard way to summarize how far off a model’s reliability curve is from the diagonal. You bin the predictions by confidence, you compute the gap between predicted and observed in each bin, you take a weighted average. One number. Lower is better. It’s in every paper that mentions calibration.
It has three problems, and I want to be honest about them in increasing severity.
One. The metric is a biased estimator. The bias depends on how you choose the bins. Two researchers reporting ECE on the same model with different binning schemes will get different numbers, and there’s no canonical “right” binning. This is well known but rarely acted on.
Two. “ECE” isn’t one metric. It’s two, and they measure different things. The original formulation (now called frequency–based ECE) compares per–bin confidence to per–bin true positive frequency. The version that became popular after Guo et al.’s 2017 paper on modern neural networks (accuracy–based ECE) compares per–bin confidence to per–bin accuracy. These are not the same. They correspond to two different formal definitions of calibration — weak vs strong — and most practitioners use them interchangeably without noticing.
Three — and this is the one that actually matters. ECE gives you one number for the whole probability space. A model with massive miscalibration in the decision–critical middle band can have a low ECE if it’s well–calibrated at the extremes — precisely because most predictions in a confident classifier sit at the extremes, and the bin weights are proportional to bin population. The middle gets averaged out of existence.
Here’s what that looks like with real numbers.
This is the failure mode. A model that looks fine in aggregate is broken precisely where it matters: near the decision boundary, where a radiologist or a triage system actually consults the probability score to make a call. The global metric averages out the catastrophe because the bins at the extremes — where most predictions live in any confident classifier — are well–calibrated, and those well–calibrated bins dominate the weighted average.
This isn’t an exotic edge case. Every time we’ve looked at a real model on a real medical dataset — pneumonia classification, colon pathology, ECG interpretation — we’ve found this pattern. Global metric looks fine. Local metric reveals trouble where trouble matters.
What we did instead
The paper — “Towards a Rigorous Calibration Assessment Framework”, with Andrea Campagner and Federico Cabitza, published at ECAI 2023 — proposes a different metric, called Estimated Calibration Index, ECI. Two design choices distinguish it from ECE.
First, instead of L1 distance from the diagonal (the absolute gap between predicted and observed) we use L2 distance — the perpendicular Euclidean distance from each calibration point to the bisector line of the reliability diagram. The geometric picture is cleaner. The L2 choice also weights large deviations more heavily than small ones, which is what you want when the cost of a bad probability estimate is non–linear — as it is in medicine, finance, and basically anywhere a calibration score is worth measuring in the first place.
Second, and more importantly, we don’t reduce calibration to one number. The metric decomposes into five.
Local ECI is the bin–level score. This is the one that flagged the DeiT problem above. Compute it for each confidence bin separately and you find out where in the probability space the model is misbehaving.
Over–confidence ECI averages local ECI across bins where the model is over–confident (calibration point below the diagonal: the model claimed more than reality delivered). Under–confidence ECI averages across bins on the other side. Two numbers instead of one, telling you which kind of trouble you have. A model that’s perfectly under–confident is very different operationally from one that’s perfectly over–confident, even if they have identical ECE.
Balance ECI is the signed difference between the two — positive means the model leans over–confident, negative means under–confident, zero means symmetric. This is the single number I find most useful in practice for triaging what kind of recalibration the model needs.
Global ECI is the headline number you’d compare across models — the closest analogue to ECE. The relationship is direct: in the binary case, ECE equals a weighted sum of the per–bin (1 − ECI) values, so ECI is essentially a normalized, bin–rescaled version of ECE that’s easier to interpret across different models and binning choices.
What the experiments showed
Two main results from the paper, both honest about what the metric does and doesn’t do.
On synthetic data where we know the true calibration error analytically, ECI is closer to the truth than either flavor of ECE across eight model architectures and several thousand random configurations. When ECI is wrong, it tends to be wrong in the conservative direction — overestimating the calibration error rather than underestimating it. That matters for any regulated application where you’d rather flag a model as under–performing than ship one that quietly isn’t.
On real medical benchmarks — PneumoniaMNIST and PathMNIST — the local and per–class breakdown caught failures the global score missed every time. The DeiT example I opened with isn’t a cherry–picked anecdote. ResNet152 had the same pattern. So did the multiclass models when we broke down ECI per class — class 1 had near–perfect over–confidence ECI of 0.417 (catastrophic) hidden behind a global score above 0.98 (looks fine).
What I’d want you to take from this
Three things, in increasing order of how much I care about them.
First: don’t trust a single calibration score. ECE, ECI–global, Brier, whatever. They all average. They all average. They all average. If you ship anything where the probability score is consumed downstream, look at it locally. Bin it manually if your library doesn’t support it. The five lines of NumPy that compute per–bin calibration are the most important five lines of NumPy in your validation pipeline.
Second: the decision–critical region is the middle, not the extremes. A confident classifier sits at the extremes for most of its predictions. That’s where calibration is easy and where most of the bin weight is. The interesting instances — the ones a human or a downstream system actually has to think about — live near the threshold. Weight your evaluation accordingly. If your threshold is 0.5, the calibration score that should worry you is the one in [0.3, 0.7], not the global one.
Third, and most uncomfortable: over- and under-confidence are different failure modes that current literature mostly collapses together. A diagnostic system that’s under–confident at 0.7 is going to refer too many cases to a specialist; the same system over–confident at 0.7 is going to clear too many patients home. These need different fixes — different recalibration techniques, different decision thresholds, different escalation policies. An aggregate metric that gives one number can’t distinguish between them, and the literature’s habit of reporting one number per model has made it easier to lose this distinction than it should be.
The code is at github.com/lorenzofamiglini/CalFram, the paper is Famiglini, Campagner & Cabitza, ECAI 2023, and the metric is a few hundred lines of NumPy that work as a drop-in replacement for whatever calibration tooling you’re using now. If you find local miscalibrations in your own models — especially weirder patterns than the ones we found — I’d genuinely like to see them.
Calibration isn’t one number. It never was. Treating it as one number has been a convenient simplification that’s cost us a clear view of how our models actually behave where it matters.