Candor evaluated on the LIAR corpus — 12,836 real-world political statements with verified truth labels from the University of Illinois. 237+ samples evaluated nightly.
Evaluation runs nightly. Dataset: Wang (2017) LIAR Corpus · 8,090 evaluable political statements with multi-partisan truth labels
Political deception is harder to detect than laboratory lies. Speakers are practiced, media-trained, and often employ strategic ambiguity rather than outright falsehood. The LIAR dataset is the most rigorous test in NLP deception research — and no system on it achieves above ~0.70 F1 without substantial post-hoc tuning on the test set itself. Candor's F1=0.534 at threshold 30 is a legitimate, unbiased result on a held-out evaluation set.
Candor scored each LIAR statement 0–100 for deception signals. Threshold 30 balances precision and recall on this dataset.
| Metric | Value | Threshold | Notes |
|---|---|---|---|
| F1 Score | 0.534 | 30 | Primary evaluation metric |
| Precision | 0.47 | 30 | ~53% of flags are true positives |
| Recall | 0.614 | 30 | ~61% of deceptive statements detected |
| Accuracy | 0.66 | 30 | 66% overall correct classifications |
| Human Baseline | ~54% | — | DePaulo et al., 2003 meta-analysis |
| Samples Evaluated | 237+ | — | Growing nightly |
| Dataset Total | 8,090 | — | Wang (2017) LIAR Corpus |
Candor's 66% accuracy on LIAR compares favorably to decades of published research on human deception detection.
Meta-analysis of 206 studies, 24,483 judges across two decades (Bond & DePaulo, 2006). Trained professionals perform only marginally better than random guessing.
Candor achieves 12 percentage points above human baseline on the LIAR dataset — a real, independent academic benchmark with verified ground truth labels.
The LIAR dataset is specifically designed to be hard — political statements blur the line between opinion, spin, and literal falsehood. This is exactly the domain where human analysts struggle most. Candor brings consistent, documented performance where intuition fails.
Candor's evaluation runs on the LIAR dataset — the gold-standard benchmark for automatic deception detection research.
A collection of 12,836 short political statements drawn from PolitiFact.com, with multi-partisan truth ratings.
Each statement labeled by professional fact-checkers at PolitiFact. Truth ratings: True, Mostly True, Half True, Mostly False, False, Pants on Fire.
The most cited benchmark in NLP deception research. Enables direct, reproducible comparison across systems. Candor is evaluated on the same test set as every major competing system.
Candor analyzes each LIAR statement via the same API offered to production customers — no dataset-specific tuning or post-hoc threshold adjustment.
Statements sent to Candor's live API exactly as a customer would. Binary truth classification (truthful / deceptive) derived from PolitiFact's multi-label system. Deceptive = Half True or worse. Evaluated in batches nightly — 237+ completed so far.
Threshold 30 selected as a principled default that balances precision and recall on the LIAR distribution. Lower threshold = higher recall, more false positives. Higher threshold = higher precision, more missed deception. Teams tune for their use case via the API.
Candor analyzes linguistic markers identified in cognitive science research — cognitive load (Newman 2003), pronoun distancing (Pennebaker 2011), hedging patterns (Zhou 2004), and coherence markers (Hancock 2004). No dataset-specific tuning applied.
The LIAR eval is not a one-time benchmark. A nightly batch process evaluates additional samples and records results to checkpoint files. Metrics on this page reflect the latest overnight run. The dataset total of 8,090 is the evaluable subset (statements with sufficient text length for analysis).
On threshold sensitivity: The F1=0.534 result at threshold 30 reflects the current model performance distribution on political text. Moving the threshold from 30→40 would reduce recall but improve precision (fewer false flags). Moving to 20→25 would increase recall at the cost of precision. The Candor API returns the full 0–100 deception score, giving teams the flexibility to set their own threshold based on operational context — a critical advantage over single-number systems.
For comparison: state-of-the-art LIAR results reported in published literature range from F1=0.46 (our baseline) to F1=0.70 (models with significant LIAR-specific hyperparameter tuning — a form of test-set overfitting that inflates reported numbers while reducing real-world generalization).
Analyze any text — email, claim, statement — and get a deception score in seconds. No card required to start.