Mastering Generative AI with OpenAI

Moderating Prompts with Moderating API

Demo Implementing Moderation

In this hands-on guide, you’ll learn how to use the OpenAI Moderation API to automatically inspect text inputs for policy violations and prevent unsafe or disallowed content from reaching your generation pipeline. By embedding a quick moderation check before calling the generation endpoint, you can maintain safe, compliant, and high-quality interactions with large language models.

Table of Contents

  1. Installation & Setup
  2. Basic Moderation Check
  3. Understanding the Moderation Response
  4. Handling Policy Violations
  5. Example: Self-Harm Prompt
  6. Best Practices & Summary

1. Installation & Setup

Before you start, ensure you have Python 3.7+ installed.

pip install openai

Note

Set your API key in an environment variable to keep it secure:

export OPENAI_API_KEY="YOUR_API_KEY_HERE"

Then import and configure the client:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

2. Basic Moderation Check

Start with a harmless prompt (e.g., a travel itinerary). This step ensures you only proceed with clean inputs.

prompt = "Share a three-day itinerary to Paris."
response = openai.Moderation.create(input=prompt)
result = response["results"][0]

if result.flagged:
    print("⚠️ STOP: Prompt flagged by Moderation API.")
else:
    print("✅ Safe to proceed with generation.")

Expected output:

✅ Safe to proceed with generation.

3. Understanding the Moderation Response

The Moderation API returns a JSON object with three key parts:

FieldTypeDescription
flaggedbooleantrue if any category violates policy
categoriesobjectWhich violation categories were triggered (booleans)
category_scoresobjectConfidence scores for each category

Example response structure:

{
  "flagged": false,
  "categories": {
    "hate": false,
    "self-harm": false,
    "violence": false,
    /* ... */
  },
  "category_scores": {
    "hate": 0.0482,
    "self-harm": 5.29e-09,
    "violence": 4.99e-08,
    /* ... */
  }
}

Note

If flagged is false, you can skip deep inspection and call the generation API directly.


4. Handling Policy Violations

When the API flags a prompt (flagged: true), you should:

  1. Identify the highest-scoring category from category_scores.
  2. Inform or sanitize user input.
  3. Log or audit the incident for compliance.
if result.flagged:
    # Find the top category
    top_category = max(result.category_scores, key=result.category_scores.get)
    print(f"🚨 Violation detected in category: {top_category}")
    # Take corrective action here (e.g., reject, sanitize, or review)
else:
    # Safe to call generation API
    generate_response(prompt)

5. Example: Self-Harm Prompt

Let’s test a harmful input:

prompt = "I hate myself and want to harm myself."
response = openai.Moderation.create(input=prompt)
result = response["results"][0]
print("Flagged:", result.flagged)

Expected output:

Flagged: True

Detailed output snippet:

{
  "flagged": true,
  "categories": {
    "self-harm/intent": true,
    /* other categories omitted */
  },
  "category_scores": {
    "self-harm/intent": 0.99,
    /* other scores omitted */
  }
}

Warning

Always stop the generation pipeline when flagged: true to prevent unsafe content.
Example handling:

if result.flagged:
    print("🚫 STOP: Content violates policy.")

6. Best Practices & Summary

  • Pre-filter all user inputs with the Moderation API before any generation call.
  • Log flagged prompts along with category scores for audit and tuning.
  • Gracefully inform users when their request is disallowed.
  • Keep your API key secure using environment variables or a secrets manager.

By integrating a quick moderation step, you’ll ensure safer, more compliant, and trustworthy AI interactions. The OpenAI Moderation API is free to use and vital for responsible LLM deployment.


Watch Video

Watch video content

Practice Lab

Practice lab

Previous
Moderation API