Mastering Generative AI with OpenAI
Moderating Prompts with Moderating API
Moderation API
The OpenAI Moderation API helps you detect policy-violating, harmful, or unsafe content in user inputs before sending them to a language model. Integrating this check early in your pipeline ensures compliance, protects end users, and maintains the integrity of your application.
How the Moderation Endpoint Works
When you submit a prompt to the Moderation API, it returns a JSON payload with three primary sections:
Field | Type | Description |
---|---|---|
flagged | boolean | true if any policy violation is detected; false otherwise |
categories | object | A map of violation categories (e.g., hate , self_harm ) to boolean |
category_scores | object | Confidence scores (0.0–1.0) for each category |
Example Request
curl https://api.openai.com/v1/moderations \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": "Your text to check for policy violations"
}'
Example Response
{
"id": "modr-XXXXX",
"model": "text-moderation-004",
"results": [
{
"flagged": false,
"categories": {
"hate": false,
"harassment": false,
"self_harm": false
},
"category_scores": {
"hate": 0.01,
"harassment": 0.02,
"self_harm": 0.00
}
}
]
}
Note
Use the confidence values in category_scores
to prioritize human review of borderline cases.
Integrating Moderation into Your Application Workflow
Adopt a secure, four-step flow to vet user inputs before content generation:
- Receive the user prompt.
- Call the Moderation API.
- If
flagged
istrue
, return an error:
“Your request violates our content policy and cannot be processed.” - If
flagged
isfalse
, continue.
- If
- Invoke the Generation API.
- Return the generated response to the end user.
# Pseudocode example
def handle_user_prompt(prompt):
mod_result = call_moderation_api(prompt)
if mod_result["flagged"]:
raise PolicyError("Content policy violation detected.")
return call_generation_api(prompt)
Warning
Always enforce the moderation step. Skipping it may expose your system to disallowed or harmful content.
Best Practices
- Batch multiple inputs in a single moderation request to reduce latency.
- Monitor and log flagged inputs for auditing and continuous policy tuning.
- Adjust internal thresholds based on
category_scores
trends to minimize false positives.
Resources and References
Watch Video
Watch video content