Beyond AUC: Why "Effective" Models Can Be Profoundly Unfair
Series Auditing the algorithm part 3 b). Useful metrics
Section 1: Introduction
The Netherlands Case
The Netherlands’ child benefits fraud detection algorithm was, according to authorities, effective.
It operated without technical scrutiny for years. Internal reports considered it “operationally efficient.” The fraud problem was “under control.”
But nobody measured whether errors were distributed evenly across demographic groups.
And that’s where the disaster lay.
Between 2013 and 2019, that system destroyed 26,000 families.
I’m not exaggerating: parents who lost custody of their children, homes foreclosed, couples divorced under financial strain, individuals with suicidal ideation. All because an algorithm flagged them as “fraudulent” when they were innocent.
Parliamentary investigation revealed what technical metrics didn’t show: families with dual nationality experienced dramatically higher false positive rates than native Dutch families.
How can a system considered “effective” for years generate a humanitarian catastrophe?
The answer lies in what traditional metrics don’t measure.
Why AUC Isn’t Enough
Last week we covered how AUC/ROC works: it measures a model’s ability to rank cases by risk. It separates positives from negatives well. It’s a useful, even necessary metric.
Read here:
But it has a critical problem for high-impact applications in the public sector:
AUC doesn’t tell you who pays for the model’s errors.
Consider this:
A model can have AUC = 0.85 (excellent) while simultaneously:
Generating 3x more false positives among immigrant families than native ones
Auditing 40% of self-employed women but only 12% of men with identical risk profiles
Detecting actual tax evasion in formal sectors with 85% accuracy but only 55% in informal sectors
AUC sees none of this. Because AUC measures overall ranking, not distributive justice.
The problem with aggregated metrics:
Overall false positive rate = 8% → “Low, acceptable”
But when stratified by group:
FPR for native Dutch families = 5%
FPR for families with dual nationality = 24%
The aggregated metric hides that half of one demographic group is being destroyed by systematic errors.
This isn’t hypothetical. This is what happened in Toeslagen.
The IRS Case: When “Clean” Data Perpetuates Discrimination
United States, 2010s.
The IRS used predictive models to select taxpayers for audits. Models with performance considered solid for years.
A Stanford study (2018) found something disturbing:
African American taxpayers were audited approximately 5 times more often than white taxpayers with the same objective risk level.
The model didn’t use “race” as a variable (that’s illegal). But it learned to use proxies:
• ZIP code (residential segregation)
• Type of tax deductions (correlate with socioeconomic status)
• Reported occupation (sectors with higher minority representation)
• Type of declared income (wages vs. investments)
The result: a system perpetuating historical discrimination, with technical metrics that never flagged the problem.
Because nobody was measuring fairness.
Until specific fairness analyses revealed the systematic inequality, the model operated without question.
Section 2: The 5 Metrics
For algorithmic systems in the public sector (taxes, benefits, licenses, sanctions), you need to measure real outcomes, not just predictive capacity.
Here are the 5 critical metrics that should be on your dashboard alongside AUC:
1. Demographic Parity Gap
What it measures:
Difference in selection rates between demographic groups.
Formula (simplified):
|% selected Group A - % selected Group B|
Example:
If 12% of men are audited but 28% of women (same observable risk profile), the gap is 16%.
Suggested thresholds:
- ✅ ≤5%: Acceptable
- ⚠️ 5-10%: Review urgently
- 🛑 >10%: STOP — You’re auditing identity, not risk
Why it matters:
If two groups have equal true risk distribution but the model selects them at very different rates, you’re not measuring fiscal risk. You’re measuring group membership.
Real case: In Toeslagen, the selection rate for families with dual nationality was more than 4x that of native families, even controlling for observable risk variables.
2. Equalized Odds Gap (TPR) - True Positive Rate
What it measures:
Difference in ability to detect true positives between groups.
Formula:
|TPR Group A - TPR Group B|
where TPR = True Positives / Total Real Positives
Example:
If the model detects 80% of actual evaders in the formal sector but only 55% in the informal sector, there’s a 25 percentage point inequity.
Suggested thresholds:
- ✅ ≤10%
- ⚠️ 10-15%
- 🛑 >15%
Why it matters:
Unequal effectiveness = double injustice. If the system is better at catching evaders in one group, the other group has more “criminals who escape” while suffering more “innocents caught.”
Practical consequence: A model with unequal TPR disproportionately punishes groups where it’s less effective, generating institutional distrust (”they always bother us but never catch the real culprits”).
3. Equalized Odds Gap (FPR) - False Positive Rate
What it measures:
Difference in false positive rate (innocents flagged as guilty).
Formula:
|FPR Group A - FPR Group B|
where FPR = False Positives / Total Real Negatives
Documented example - Toeslagen:
Families with dual nationality experienced substantially higher false positive rates than native Dutch families. Parliamentary investigation confirmed that the variable “dual nationality” was used as a proxy for fraud risk.
Suggested thresholds:
- ✅ ≤5%
- ⚠️ 5-8%
- 🛑 >8%
Why it matters:
In high-impact systems (loss of benefits, sanctions), a false positive can destroy lives. If FPR is substantially higher in a vulnerable group, that group disproportionately pays the cost of the system’s “efficiency.”
The real cost: In Toeslagen, a false positive meant total benefit loss + retroactive debt + interest + foreclosures. For families in precarious economic situations, this was financially terminal.
4. Predictive Parity Gap (PPV) - Positive Predictive Value
What it measures:
Difference in precision (positive predictive value) between groups.
Formula:
|PPV Group A - PPV Group B|
where PPV = True Positives / Total Selected
Example:
Group A: Of 100 audited, 70 have findings (PPV=70%)
Group B: Of 100 audited, 40 have findings (PPV=40%)
Gap = **30%**
Suggested thresholds:
- ✅ ≤15%
- ⚠️ 15-20%
- 🛑 >20%
Why it matters:
Low PPV in a group means most “alarms” are false. This generates:
1. Institutional distrust: “They always bother us for nothing”
2. Operational cost: Resources wasted on audits that find nothing
3. Reputational harm: Innocents publicly marked as suspects
Public sector consequence: If the State audits you and finds nothing, the damage to your reputation is already done. And if this happens systematically more in your demographic group, it’s institutional discrimination.
5. Calibration - Probability Calibration
What it measures:
Whether predicted probabilities match observed real frequencies.
Simple explanation:
If the model says “this case has an 80% probability of fraud,” then of all cases with that score, approximately 80% should actually be fraud.
Miscalibration example:
Model says: “80% probability of evasion”
Reality: Only 50% evade
Miscalibration = 30 percentage points
Threshold:
- ✅ Error <10 points: Well calibrated
- ⚠️ 10-20 points: Review urgently
- 🛑 >20 points: Scores unreliable
Why it matters:
If “risk scores” don’t reflect reality, stakeholders make decisions based on false numbers. In the public sector, this destroys institutional legitimacy.
Legal consequence: In many jurisdictions, if an automated system assigns “probabilities” that aren’t calibrated, it can be considered misleading information and violate administrative transparency regulations.
Section 3: What’s next
Now that you know the 5 metrics, you have the diagnosis.
But here’s the critical part: when do you use these metrics to STOP a deployment?
Because measuring is fine. But you need clear decision rules:
Is a 7% gap acceptable or catastrophic?
Does it depend on context?
Who decides?
How do you document it?
And more importantly: how do you segment your analysis?
It’s not enough to calculate these metrics “overall.” You need to stratify by the right axes. Because:
“Job tenure” is a proxy for age
“ZIP code” is a proxy for social class + ethnicity
“Type of deductions” is a proxy for economic sector + gender
If you don’t segment correctly, the metrics will lie to you as much as AUC does.
Next week I’m bringing you:
✅ The 3 Stop Rules: When to stop deployment (with specific thresholds by system type)
✅ The 2 Segmentation Axes: How to stratify your analysis to make it actionable
✅ The Outcome Fairness Pack: A free operational kit with calculator, checklist, and design canvas
The hard part isn’t measuring. The hard part is deciding.
And in the public sector, where your decisions impact real lives, lacking a decision framework isn’t negligence: it’s malpractice.
And if you know someone deploying models in public administration, share this post with them. Especially if they still think AUC >0.75 is sufficient.
Key Takeaways
AUC measures ranking, not justice: A model can discriminate systematically while maintaining high AUC
5 critical metrics: Demographic Parity, Equalized Odds (TPR+FPR), Predictive Parity, Calibration
Real cases: Toeslagen (26K families) and IRS (verified racial discrimination) operated for years without these metrics.
Invisible proxies: “Neutral” variables like ZIP code or job tenure hide discrimination
Next week: Stop Rules + Segmentation + Free operational kit
References:
Fairness Metrics in Machine Learning
Beyond Accuracy-Fairness: Stop evaluating bias mitigation methods




This is a vital perspective, Marcela! 💯
Moving beyond AUC to proactive fairness frameworks is how we build integrity into these systems from the start. Performance metrics are just the beginning; the real work is auditing for the outcomes that actually affect people.
Looking forward to the rest of the series!
This is such a complex topic and a career of the future if governments start to understand the importance of creating fair and ethical AI systems