Trust & Safety Teams: How Linguistic Analysis Catches What Keyword Filters Miss

The Moderation Gap Isn't a Tooling Problem

Every trust and safety team operating at scale has the same setup: a keyword blocklist, a rule engine, a classifier or two trained on historical violations, and a human review queue that's always too long. The tooling is mature. The gap isn't tooling — it's that the threat model has outpaced the detection model.

Keyword filters catch the unsophisticated violations: slurs, explicit content, brand-name drug terms, phone numbers in formats your rules recognize. They're reactive by design. The moment you publish a rule, anyone with a reason to evade it starts evading it. Sophisticated actors — organized fraud rings, romance scammers, fake review networks, marketplace grifters — have long since learned what your filters look for and what they don't.

What they can't fully control is how they write. Deceptive intent produces measurable linguistic patterns that persist even when the author is trying to sound legitimate. This is the finding that forty years of psycholinguistics research has converged on — and it's the basis for text-based deception detection as a moderation signal.

$8.8B Consumer fraud losses (FTC, 2022)

34% Of losses from imposter scams

~70% Of marketplace scams involve off-platform redirect

The categories where this matters most for T&S teams: fake reviews written to mimic genuine customer experience, romance scams that build trust over dozens of messages before pivoting to a financial request, marketplace fraud that redirects transactions off-platform before the moment of extraction, and financial platform abuse that uses plausible-sounding narratives to social-engineer support agents or circumvent verification flows.

What Keyword Filters Can't See

A keyword filter operates on surface form. It looks for specific strings, patterns, or token sequences. Its coverage is exactly the set of things you've told it to look for — no more, no less. This makes it brittle against any actor who understands its coverage boundaries.

Consider three message types that keyword filters fail on almost by design:

Fake reviews with no policy-violating content. A coordinated fake review campaign targeting a competitor produces reviews that are grammatically correct, contain no prohibited terms, and read like genuine customer feedback. The content is fabricated, but the content itself is clean. No keyword rule will catch this. What does surface is the psycholinguistic signature of constructed narrative: reduced first-person agency, hedged specificity, emotional language that doesn't match the claimed experience.

Romance scams before the ask. The messages that establish trust in a romance scam — the weeks of conversation before any financial request — contain nothing a keyword filter would flag. No solicitation, no urgency language, no external links. What they do contain is a measurable pattern of emotional manipulation: high affective language density, unusually strong expressions of certainty and connection from a stranger, and a narrowing of conversational scope that pushes toward dependency. These signals exist in the text long before the financial request appears.

Marketplace off-platform redirection. The most common marketplace fraud pattern involves convincing a buyer to continue a transaction outside the protected platform environment — via a separate payment app, email, or messaging channel. The message doesn't need to contain a phone number or payment link to accomplish this. It can suggest the redirection through urgency framing, claimed platform limitations, or manufactured rapport. The linguistic fingerprint of this manipulation is detectable before any prohibited content appears.

Psycholinguistic Signals in Platform Abuse Contexts

The signals that deception detection measures aren't domain-specific — they emerge from the cognitive mechanics of deception itself. As we covered in our foundational explainer on psycholinguistic signals, deceptive text reliably differs from truthful text across five measurable dimensions. Here's how those dimensions manifest specifically in T&S contexts:

Signal	T&S Manifestation	Example Pattern
Pronoun Distancing	Fraudster avoids personal ownership of claims	"Payment is processed through a separate system" vs. "I handle payment separately"
Emotional Manipulation	Artificially inflated affective language to accelerate trust	Romance scam: unsolicited expressions of deep connection after minimal interaction
Hedging	Vague framing around transaction mechanics or identity claims	"It should be fine if you just…" / "There might be a small issue with the normal process"
Cognitive Complexity	Oversimplified causal chains; absence of the uncertainty genuine users express	Fake reviews with suspiciously clear cause-effect ("Used it once, immediately noticed the difference")
Detail Specificity	Strategic specificity gaps — concrete where easily verifiable, vague where it matters	Scam listing with precise irrelevant details but vague on shipping, returns, and verification

"Deception produces a unique linguistic profile — not because liars are careless, but because the cognitive load of fabrication is measurable in how people construct sentences." — Hancock, J. T., Curry, L. E., Goorha, S., & Woodworth, M. (2008). On lying and being lied to. Discourse Processes, 45(1), 1–23.

The advantage for T&S is that these signals operate independently of content. A romance scammer who avoids all financial keywords still produces the emotional manipulation signature. A fake review with no prohibited terms still produces the cognitive complexity gap. A marketplace fraudster using clean language still produces the pronoun distancing and hedging pattern. The signal is in the structure, not the vocabulary.

Walkthrough: Analyzing a Marketplace Fraud Message

The following is a fictional message constructed to demonstrate the analysis. It represents a common off-platform redirection pattern — a seller encouraging a buyer to complete a transaction outside the marketplace. All names and details are invented.

"Hi there! I noticed your inquiry about the item. I do want to mention there's currently a small issue with the platform's payment processing — it's been affecting a few sellers this week, totally outside my control. What tends to work better in situations like this is if we handle it directly, which actually gets things moving faster for you anyway. I've done this with several buyers and it's always gone smoothly. I can assure you the item is exactly as described and you'll be very happy. Just message me at the contact in my profile and we can sort out the details there."

Pronoun Distancing

Emotional Manipulation

Hedging

Cognitive Complexity

Detail Specificity

Overall Deception Score

High Risk 81

Walk through the signal drivers. Emotional manipulation at 85: The message opens with false rapport, pivots to manufactured urgency ("affecting a few sellers this week"), then reassures with social proof ("I've done this with several buyers"). All three are classic trust-acceleration techniques deployed in a single paragraph — from a stranger, in a first message, about a financial transaction. Pronoun distancing at 79: The platform problem is passive and external ("there's currently a small issue," "totally outside my control"). The seller avoids owning the claim or providing any verifiable detail about it. Hedging at 74: "Tends to work better," "situations like this," "sort out the details there" — the message is consistently vague about the specific mechanics of what the off-platform transaction would involve. Detail specificity at 68: The message is specific about the alleged problem's existence but gives zero verifiable detail: no ticket number, no platform notification, no timeframe. Strategic vagueness where specifics would undermine the fabrication.

No keyword in this message would trip a standard filter. The score of 81 reflects the structural manipulation pattern — and that pattern is identifiable before any money moves.

Integrating Candor into a Moderation Pipeline

The API returns a score and signal breakdown in a single call. Integration fits into three common T&S workflow patterns:

Real-Time Messaging Analysis

For platforms with user-to-user messaging — marketplaces, dating apps, community platforms — the analysis runs inline as messages are sent. Messages above a threshold score are held for review or soft-blocked (shown as sent, queued for moderator review before delivery). This pattern catches manipulation before victim contact is completed.

# Analyze a user message before delivery
curl -X POST https://getcandor.polsia.app/api/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hi there! I noticed your inquiry about the item..."
  }'

# Response
{
  "score": 81,
  "is_deceptive": true,
  "signals": {
    "pronoun_distancing": 79,
    "emotional_manipulation": 85,
    "hedging": 74,
    "cognitive_complexity": 61,
    "detail_specificity": 68
  },
  "flagged_sentences": [
    "I've done this with several buyers and it's always gone smoothly.",
    "there's currently a small issue with the platform's payment processing"
  ]
}

Batch Review Queue Scoring

For review queue triage — especially for user-generated content like reviews, listings, or support tickets — batch processing ranks items by deception score before they reach a human reviewer. Moderators start with the highest-risk items. The queue processes faster; the same number of reviewers covers more ground; false negatives drop because reviewers spend time on flagged content, not normal content.

At standard API throughput, scoring 10,000 messages takes under 30 minutes of unattended processing. The output is a ranked list — score, signal breakdown, flagged sentences — ready to import into your review tooling.

Risk Signal in a Composite Score

The deception score works well as one signal in a composite risk model. Combine it with behavioral signals (account age, transaction velocity, device fingerprint), network signals (known fraud ring associations), and content signals (keyword flags, image hash matches). The linguistic score adds a dimension that none of the other signals provide: intent inference from text structure, independent of the content's surface form.

A user who is 90 days old, has a clean behavioral record, and sends one high-scoring message is a different risk profile than a 2-day-old account with the same score. The combination is stronger than either signal alone.

Threshold Calibration for T&S

Score thresholds should be calibrated to your platform's moderation capacity and risk tolerance. Starting points:

Score ≥ 75: High risk. Hold for moderator review before delivery or publication. In messaging contexts, consider soft-blocking while review completes.
Score 55–74: Elevated. Flag for downstream review; do not intervene in real time. Useful for building a second-pass review queue.
Score < 55: Normal processing. No intervention; log the score for trend monitoring.

These are starting points, not prescriptions. A dating platform with known romance scam exposure should run a tighter threshold than a B2B SaaS support channel. The right calibration comes from running your known-fraud cases through the API and observing where they score — then setting thresholds based on your team's review capacity.

Full API documentation, error codes, rate limits, and authentication are at /docs. For model performance data and benchmark methodology, see /validation.

The Honest Limits

Linguistic analysis is a triage signal. It is not a fraud verdict. A high score means the text warrants attention — it does not mean the user is a bad actor. New users, non-native speakers, and people writing under stress all produce linguistic patterns that can elevate scores for reasons unrelated to deception. Any deployment that uses a score as a ban trigger without human review in the loop is miscalibrating the tool.

The appropriate frame is: deception detection surfaces the messages that deserve a second look. The moderator makes the call. As with our insurance fraud use case, the value is in prioritization — not replacement of human judgment.

We also validate publicly. Current model performance on the LIAR benchmark — 1,017 labeled statements, F1 = 0.534 — is published at /validation with full methodology. The LIAR dataset is short political speech, which is a harder domain for psycholinguistic signals than longer narrative text. We expect performance on platform abuse messages — which tend to be longer and more manipulative in structure — to be materially stronger. Domain-specific validation is in progress; when it's ready, we'll publish it on the same page with the same methodology. No selective reporting. For more context on what F1 = 0.534 means and how it compares to human baseline, read our validation explainer.

Add linguistic deception detection to your moderation stack

Test the API on your own message samples. Read the integration docs. See the live benchmark results.

Try the API free → Integration docs See validation

References

Federal Trade Commission. (2023). Consumer Sentinel Network Data Book 2022. FTC.gov.
DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to deception. Psychological Bulletin, 129(1), 74–118.
Hancock, J. T., Curry, L. E., Goorha, S., & Woodworth, M. (2008). On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes, 45(1), 1–23.
Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29(5), 665–675.
Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. Proceedings of ACL 2011, 309–319.
Levitan, S. I., et al. (2016). Cross-cultural production and detection of deception from speech. Proceedings of the Workshop on Computational Approaches to Deception Detection.
Wang, W. Y. (2017). "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection. Proceedings of ACL 2017, 422–426.