The user-generated content problem
The moment you allow users to submit content to your platform - reviews, comments, listings, forum posts, messages - you are responsible for what appears. Harmful content that slips through damages your community, your brand, and depending on your jurisdiction, your legal standing.
Hiring human moderators is expensive and doesn't scale. Keyword blocklists are brittle - they flag harmless words and miss creative evasions. AI classification gives you a third path: a model that understands context and can evaluate each submission against your specific content policy, at any volume, in real time.
How classification-based moderation works
The approach is straightforward: before any user-submitted content is published, you send it to a classification API. The API returns a label - safe, review_needed, or block - along with a confidence score. Your application then acts on that label: publish immediately, hold for review, or reject with an error message.
Critically, the model evaluates the full context of the submission, not just isolated words. "I'm going to kill this feature request" is classified differently from a genuine threat. "This product cured my addiction to overspending" is classified differently from content promoting substance abuse.
Designing your category set
Your categories should reflect your actual content policy, not a generic one. A few examples:
For a marketplace with product listings:
compliant- meets listing standardsprohibited_item- item not allowed on the platformmisleading- title or description appears deceptiveadult_content- needs age-gating or removalneeds_review- uncertain, requires human judgment
For a community forum or comment section:
safespamharassmenthate_speechmisinformationself_harmneeds_review
Being specific with your categories gives the model more precise signal and makes your downstream logic cleaner. "Harmful" is too broad; "harassment" and "hate_speech" are actionable.
A real moderation request
Here's a comment submitted to a community forum:
{
"input": "Nobody asked for your opinion. You're the dumbest person here and everyone agrees. Stop posting.",
"categories": ["safe", "spam", "harassment", "hate_speech", "needs_review"]
}
Full request:
curl -X POST https://api.classifaily.com/v1/classify \
-H "Authorization: Bearer cai_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"input": "Nobody asked for your opinion. You'\''re the dumbest person here and everyone agrees. Stop posting.",
"categories": ["safe", "spam", "harassment", "hate_speech", "needs_review"],
"explain": true
}'
Response:
{
"label": "harassment",
"confidence": 0.94,
"reasoning": "Comment directly attacks a specific individual using demeaning language and attempts to silence them.",
"request_id": "req_01hz..."
}
Block the submission and return an error to the user. Log the event. If you have a strike system, increment it.
The three-tier routing pattern
Not every piece of content is clearly safe or clearly harmful. A good moderation system uses three tiers based on confidence:
- Auto-publish - label is
safewith confidence > 0.85. Publish immediately with no delay. - Human review queue - confidence is below your threshold on any label, or label is
needs_review. Hold the submission and notify a moderator. - Auto-block - label is
harassment,hate_speech, orself_harmwith confidence > 0.80. Reject immediately and log.
Adjust the confidence thresholds based on how your community behaves and how sensitive your false-positive tolerance is. A children's education platform will set much tighter thresholds than an adult professional network.
Pre-publish vs. post-publish moderation
You have two timing options. Pre-publish moderation classifies content before it's ever stored or shown - the API call is in the hot path of the submission request. Post-publish moderation stores the content immediately and classifies asynchronously, taking it down if flagged.
Pre-publish is the safer default but adds latency to submission. Since classifaily targets <300ms, it's fast enough for most synchronous flows. Post-publish makes sense for very high-volume scenarios where even 300ms is too much, or where content going briefly live is an acceptable trade-off for throughput.
Batch moderation for existing content
If you have a backlog of existing user content to audit - maybe you're adopting AI moderation on an existing platform - classifaily's batch endpoint lets you send up to 25 items in a single request:
{
"items": [
{ "id": "comment_001", "content": "Great post, really helpful!" },
{ "id": "comment_002", "content": "..." },
...
],
"categories": ["safe", "spam", "harassment", "hate_speech", "needs_review"]
}
Each item gets its own label and confidence score. You can run through thousands of existing records in a scheduled job without hitting rate limits.
Appeals and false positives
AI moderation will occasionally flag legitimate content. You need a way for users to appeal decisions. The minimum viable appeals flow: show the user a message explaining their content was removed, provide a way to dispute it, and route disputes to a human moderator who can override the AI decision.
Tracking overrides over time also gives you a feedback signal: if a particular category is generating lots of false positives for your community, you can adjust your confidence thresholds or refine your category definitions.
Start with 100 free requests. No setup required.
Get started free