Demo Automatic Language Recognition and Translation
Asynchronous Python pipeline that transcribes WAV audio, detects language and emotion, translates to English, and generates concise summaries and suggested titles using OpenAI models.
Welcome back.In this lesson we’ll build an asynchronous audio-to-insight pipeline that:
Accepts a WAV audio file
Transcribes the audio
Detects the spoken language and a prevailing emotion/tone
Translates the transcription into English
Generates a concise summary and a suggested title
This design separates each step into small, composable async functions so you can reuse or replace individual components (for example swapping models or custom agents).
Store your OpenAI API key in a .env file (for example OPENAI_API_KEY=<your_key>). This lesson will load environment variables via python-dotenv.
Quick overview — Pipeline steps and the corresponding functions:
Load environment variables, initialize the OpenAI client, and import utilities. This example uses the modern OpenAI Python client (OpenAI()), plus an assumed agents package providing Agent and Runner as used in the original material.
from dotenv import load_dotenvimport osfrom pathlib import Pathimport asyncioimport re# OpenAI Python clientfrom openai import OpenAI# Optional display in notebooksfrom IPython.display import Image, display# Agent & Runner (kept as in the original content)from agents import Agent, Runner# Load environment variablesload_dotenv()# Initialize OpenAI clientclient = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
Be mindful of API usage and costs when using large models like gpt-4 and uploading audio files. Use lower-cost models for development and testing if desired.
We create an async helper to upload a WAV file to the Whisper transcription model and return the transcription text. The function validates the path and handles common return shapes from the client.
async def transcribe_audio(file_path: str) -> str: """ Transcribe a WAV file using the Whisper model and return the transcription text. """ # Ensure the file exists path = Path(file_path) if not path.exists(): raise FileNotFoundError(f"Audio file not found: {file_path}") with open(file_path, "rb") as audio_file: transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file ) # The client returns an object with a text attribute for the transcript return getattr(transcript, "text", transcript.get("text") if isinstance(transcript, dict) else None)
Use a chat model to detect the language and provide a one-word emotional descriptor. The function uses a deterministic temperature (0.3) and extracts values using tolerant regular expressions to handle slightly varied replies.
async def analyze_language_and_emotion(text: str) -> dict: """ Ask a chat model to detect the language and a one-word emotional tone for the given text. Returns: {"language": "<language>", "emotion": "<emotion>"} """ system_msg = ( "You're an AI that analyzes messages. Detect the language (e.g., English, French) " "and describe the emotional tone in one word (e.g., joyful, sad, angry, professional, excited, persuasive). " "Respond in the format:\nLanguage: <language>\nEmotion: <emotion>" ) response = client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": system_msg}, {"role": "user", "content": f"Here is the message:\n{text}"} ], temperature=0.3 ) content = response.choices[0].message.content.strip() # Tolerant regex to capture "Language: ..." and "Emotion: ..." (allow multi-word and punctuation) language_match = re.search(r"(?i)^\s*Language[:\-\s]*([^\r\n]+)", content, re.MULTILINE) emotion_match = re.search(r"(?i)^\s*Emotion[:\-\s]*([^\r\n]+)", content, re.MULTILINE) return { "language": language_match.group(1).strip() if language_match else "Unknown", "emotion": emotion_match.group(1).strip() if emotion_match else "Unknown" }
Note: temperature is set to 0.3 to favor more deterministic outputs, which helps reliable parsing of the model response.
This example uses a simple Agent to translate text into English and a Runner to execute it. The Agent/Runner implementation is assumed from the original content; if your agents package returns different shapes, adapt the result extraction accordingly.
translator_agent = Agent( name="Translator", instructions="Translate the input text into English. Only return the translated result.")async def translate_text(text: str) -> str: """ Use the Agent Runner to translate text into English. Returns the final translated string returned by the agent. """ result = await Runner.run(translator_agent, input=text) # Handle common return shapes: string, object with attribute, or dict if isinstance(result, str): return result return getattr(result, "final_output", result.get("final_output") if isinstance(result, dict) else None)
If your agents/runner implementation differs, adapt the return extraction accordingly.
Ask a chat model to provide a concise summary and a suggested title. Temperature is slightly higher for creativity (0.5).
async def generate_title_and_summary(text: str) -> str: """ Generate a concise summary and a suggested title for the given text. Returns a string containing both title and summary. """ system_msg = "You are a helpful AI assistant. Summarize the user's message and suggest a title for it." response = client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": system_msg}, {"role": "user", "content": f"Here's the text:\n\n{text}"} ], temperature=0.5 ) return response.choices[0].message.content.strip()
Tip: If you prefer structured outputs (e.g., JSON with title and summary), ask the model to respond in JSON and parse the result. For simple display, the free-text response above often suffices.
This orchestrator composes the previous functions into a complete asynchronous flow. Each step’s output is printed; you can replace prints with logging, storage, or event emissions for production usage.
async def process_audio_translation(file_path: str): """ Full pipeline for an audio file: 1. Transcribe audio 2. Analyze language and emotion 3. Translate into English 4. Generate title and summary Prints each result step to the console. """ # 1) Transcribe transcript = await transcribe_audio(file_path) print(f"Transcript:\n{transcript}\n") # 2) Language & emotion analysis analysis = await analyze_language_and_emotion(transcript) print(f"Detected language: {analysis['language']}") print(f"Detected emotion: {analysis['emotion']}\n") # 3) Translate to English translation = await translate_text(transcript) print(f"Translation:\n{translation}\n") # 4) Title & summary extras = await generate_title_and_summary(translation) print(f"Title and Summary:\n{extras}\n")
Pass in the full path to your WAV file. In Jupyter or other async-capable REPLs you can await the function directly.
# Replace with the path to your WAV fileaudio_path = "/Users/gavinridgeway/Documents/Anaconda/AiAgent/final_fixed.wav"await process_audio_translation(audio_path)
If running from a standard Python script, wrap the call in asyncio:
if __name__ == "__main__": audio_path = "/path/to/your/file.wav" asyncio.run(process_audio_translation(audio_path))
File not found: ensure the file_path is correct and accessible by your process.
UnboundLocalError or NameError: double-check variable names and that you return the expected attributes (for example result.final_output).
API key errors: confirm OPENAI_API_KEY is set and loaded via load_dotenv() or environment variables.
Agent/Runner differences: the agents package usage (Agent, Runner) is retained from the original content — adapt Runner.run() and result access if your agents library returns different shapes.
Unexpected model output format: prefer instructing the model to respond in a strict format (for example Language: <language>\nEmotion: <emotion> or JSON), then validate with regex or a JSON parser.
After running on a French sample, the pipeline prints something like:
Transcript: “Apprendre à programmer, c’est comme avoir un super-pouvoir…”
Detected language: French
Detected emotion: Encouraging
Translation: “Learning to program is like having a superpower…”
Title and Summary: (a short summary and a suggested title)
You now have a working asynchronous pipeline that transcribes audio, detects language and emotion, translates into English, and generates a title plus a short summary.
If you want to extend this pipeline: consider adding speaker diarization, punctuation normalization, or persisting outputs to a database for downstream search and analytics.