Skip to main content
Welcome back. In this lesson we’ll build an asynchronous audio-to-insight pipeline that:
  • Accepts a WAV audio file
  • Transcribes the audio
  • Detects the spoken language and a prevailing emotion/tone
  • Translates the transcription into English
  • Generates a concise summary and a suggested title
This design separates each step into small, composable async functions so you can reuse or replace individual components (for example swapping models or custom agents).
Store your OpenAI API key in a .env file (for example OPENAI_API_KEY=<your_key>). This lesson will load environment variables via python-dotenv.
Quick overview — Pipeline steps and the corresponding functions:
StepPurposeFunction
1Transcribe WAV audio to texttranscribe_audio
2Detect language and one-word emotional toneanalyze_language_and_emotion
3Translate text to Englishtranslate_text
4Produce suggested title and short summarygenerate_title_and_summary

Setup and imports

Load environment variables, initialize the OpenAI client, and import utilities. This example uses the modern OpenAI Python client (OpenAI()), plus an assumed agents package providing Agent and Runner as used in the original material.
from dotenv import load_dotenv
import os
from pathlib import Path
import asyncio
import re

# OpenAI Python client
from openai import OpenAI

# Optional display in notebooks
from IPython.display import Image, display

# Agent & Runner (kept as in the original content)
from agents import Agent, Runner

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
Be mindful of API usage and costs when using large models like gpt-4 and uploading audio files. Use lower-cost models for development and testing if desired.

Transcription (Whisper)

We create an async helper to upload a WAV file to the Whisper transcription model and return the transcription text. The function validates the path and handles common return shapes from the client.
async def transcribe_audio(file_path: str) -> str:
    """
    Transcribe a WAV file using the Whisper model and return the transcription text.
    """
    # Ensure the file exists
    path = Path(file_path)
    if not path.exists():
        raise FileNotFoundError(f"Audio file not found: {file_path}")

    with open(file_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )

    # The client returns an object with a text attribute for the transcript
    return getattr(transcript, "text", transcript.get("text") if isinstance(transcript, dict) else None)
Reference: Whisper docs — https://platform.openai.com/docs/models/whisper-1

Language and Emotion Analysis

Use a chat model to detect the language and provide a one-word emotional descriptor. The function uses a deterministic temperature (0.3) and extracts values using tolerant regular expressions to handle slightly varied replies.
async def analyze_language_and_emotion(text: str) -> dict:
    """
    Ask a chat model to detect the language and a one-word emotional tone for the given text.
    Returns: {"language": "<language>", "emotion": "<emotion>"}
    """
    system_msg = (
        "You're an AI that analyzes messages. Detect the language (e.g., English, French) "
        "and describe the emotional tone in one word (e.g., joyful, sad, angry, professional, excited, persuasive). "
        "Respond in the format:\nLanguage: <language>\nEmotion: <emotion>"
    )

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": f"Here is the message:\n{text}"}
        ],
        temperature=0.3
    )

    content = response.choices[0].message.content.strip()

    # Tolerant regex to capture "Language: ..." and "Emotion: ..." (allow multi-word and punctuation)
    language_match = re.search(r"(?i)^\s*Language[:\-\s]*([^\r\n]+)", content, re.MULTILINE)
    emotion_match = re.search(r"(?i)^\s*Emotion[:\-\s]*([^\r\n]+)", content, re.MULTILINE)

    return {
        "language": language_match.group(1).strip() if language_match else "Unknown",
        "emotion": emotion_match.group(1).strip() if emotion_match else "Unknown"
    }
Note: temperature is set to 0.3 to favor more deterministic outputs, which helps reliable parsing of the model response.

Translator Agent and translate_text

This example uses a simple Agent to translate text into English and a Runner to execute it. The Agent/Runner implementation is assumed from the original content; if your agents package returns different shapes, adapt the result extraction accordingly.
translator_agent = Agent(
    name="Translator",
    instructions="Translate the input text into English. Only return the translated result."
)

async def translate_text(text: str) -> str:
    """
    Use the Agent Runner to translate text into English.
    Returns the final translated string returned by the agent.
    """
    result = await Runner.run(translator_agent, input=text)
    # Handle common return shapes: string, object with attribute, or dict
    if isinstance(result, str):
        return result
    return getattr(result, "final_output", result.get("final_output") if isinstance(result, dict) else None)
If your agents/runner implementation differs, adapt the return extraction accordingly.

Title and Summary Generation

Ask a chat model to provide a concise summary and a suggested title. Temperature is slightly higher for creativity (0.5).
async def generate_title_and_summary(text: str) -> str:
    """
    Generate a concise summary and a suggested title for the given text.
    Returns a string containing both title and summary.
    """
    system_msg = "You are a helpful AI assistant. Summarize the user's message and suggest a title for it."

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": f"Here's the text:\n\n{text}"}
        ],
        temperature=0.5
    )

    return response.choices[0].message.content.strip()
Tip: If you prefer structured outputs (e.g., JSON with title and summary), ask the model to respond in JSON and parse the result. For simple display, the free-text response above often suffices.

Full pipeline: process_audio_translation

This orchestrator composes the previous functions into a complete asynchronous flow. Each step’s output is printed; you can replace prints with logging, storage, or event emissions for production usage.
async def process_audio_translation(file_path: str):
    """
    Full pipeline for an audio file:
      1. Transcribe audio
      2. Analyze language and emotion
      3. Translate into English
      4. Generate title and summary
    Prints each result step to the console.
    """
    # 1) Transcribe
    transcript = await transcribe_audio(file_path)
    print(f"Transcript:\n{transcript}\n")

    # 2) Language & emotion analysis
    analysis = await analyze_language_and_emotion(transcript)
    print(f"Detected language: {analysis['language']}")
    print(f"Detected emotion: {analysis['emotion']}\n")

    # 3) Translate to English
    translation = await translate_text(transcript)
    print(f"Translation:\n{translation}\n")

    # 4) Title & summary
    extras = await generate_title_and_summary(translation)
    print(f"Title and Summary:\n{extras}\n")

Run the pipeline

Pass in the full path to your WAV file. In Jupyter or other async-capable REPLs you can await the function directly.
# Replace with the path to your WAV file
audio_path = "/Users/gavinridgeway/Documents/Anaconda/AiAgent/final_fixed.wav"

await process_audio_translation(audio_path)
If running from a standard Python script, wrap the call in asyncio:
if __name__ == "__main__":
    audio_path = "/path/to/your/file.wav"
    asyncio.run(process_audio_translation(audio_path))

Troubleshooting common issues

  • File not found: ensure the file_path is correct and accessible by your process.
  • UnboundLocalError or NameError: double-check variable names and that you return the expected attributes (for example result.final_output).
  • API key errors: confirm OPENAI_API_KEY is set and loaded via load_dotenv() or environment variables.
  • Agent/Runner differences: the agents package usage (Agent, Runner) is retained from the original content — adapt Runner.run() and result access if your agents library returns different shapes.
  • Unexpected model output format: prefer instructing the model to respond in a strict format (for example Language: <language>\nEmotion: <emotion> or JSON), then validate with regex or a JSON parser.

Example output (expected)

After running on a French sample, the pipeline prints something like:
  • Transcript: “Apprendre à programmer, c’est comme avoir un super-pouvoir…”
  • Detected language: French
  • Detected emotion: Encouraging
  • Translation: “Learning to program is like having a superpower…”
  • Title and Summary: (a short summary and a suggested title)
You now have a working asynchronous pipeline that transcribes audio, detects language and emotion, translates into English, and generates a title plus a short summary. If you want to extend this pipeline: consider adding speaker diarization, punctuation normalization, or persisting outputs to a database for downstream search and analytics.

Watch Video

Practice Lab