Logo
Building a Real-Time Audio Transcription: From Audio Streams to Text

Building a Real-Time Audio Transcription: From Audio Streams to Text

August 19, 2025
9 min read
index

Introduction

Real-time audio transcription has become increasingly important in our digital world - from accessibility tools for the hearing impaired to automated meeting notes and live captioning systems. When I set out to build a real-time transcription application, I wanted to create something that could capture system audio (like from video calls, music, or presentations) and transcribe it live with minimal latency.

The result is a Python application that combines OpenAI’s Whisper for state-of-the-art transcription accuracy, sounddevice for robust audio capture, and DearPyGui for a responsive desktop interface. This post will walk through the architecture, challenges, and solutions I encountered while building this system.

The Challenge: Real-Time Audio Processing

Building a real-time transcription system presents several unique challenges:

  1. Audio Capture: Reliably capturing system audio across different platforms
  2. Buffer Management: Handling continuous audio streams without dropouts
  3. Processing Pipeline: Balancing transcription accuracy with latency
  4. UI Responsiveness: Keeping the interface smooth while processing intensive operations
  5. Resource Management: Efficiently managing memory and CPU usage

Let’s dive into how I approached each of these challenges.

Architecture Overview

The application is built around a modular architecture with clear separation of concerns:

┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Audio Capture │───▶│ Session Manager │───▶│ GUI Layer │
│ System │ │ & Processing │ │ (DearPyGui) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Audio Buffer & │ │ Whisper │ │ User Controls │
│ Queue System │ │ Transcriber │ │ & Callbacks │
└─────────────────┘ └──────────────────┘ └─────────────────┘

Core Components

  1. AudioCaptureSystem: Handles real-time audio input using system loopback
  2. WhisperTranscriber: Manages Whisper model loading and transcription
  3. TranscriptionSession: Orchestrates the entire processing pipeline
  4. TranscriptionGUI: Provides the user interface and controls

Audio Capture: The Foundation

The heart of any real-time transcription system is reliable audio capture. I chose to focus on system audio loopback, which captures whatever audio is playing through the speakers - perfect for transcribing video calls, presentations, or media.

Warning

Platform Considerations: System audio capture works differently across platforms. Windows uses WASAPI loopback devices, while macOS and Linux have their own mechanisms. This implementation focuses primarily on Windows WASAPI.

Finding the Right Audio Device

The first challenge is identifying the correct audio device for system capture:

def get_system_audio_device(self) -> Optional[int]:
"""Find system audio loopback device (Windows WASAPI)"""
try:
devices = sd.query_devices()
for i, device in enumerate(devices):
# Look for WASAPI loopback devices
if ('wasapi' in device['name'].lower() or
'loopback' in device['name'].lower() or
'speakers' in device['name'].lower() or
'what u hear' in device['name'].lower()):
if device['max_input_channels'] > 0:
logger.info(f"Found system audio device: {device['name']}")
return i
# Fallback to default input device
default_device = sd.query_devices(kind='input')
return sd.default.device[0]
except Exception as e:
logger.error(f"Error finding audio device: {e}")
return None

This approach searches for specific keywords that indicate system loopback capability, with fallback options for broader compatibility.

Audio Configuration

I standardized on specific audio parameters that balance quality with processing efficiency:

@dataclass
class AudioConfig:
sample_rate: int = 16000 # Whisper's preferred sample rate
channels: int = 1 # Mono audio for efficiency
chunk_size: int = 1024 # Balance between latency and efficiency
buffer_duration: float = 2.0 # Rolling buffer window
device_type: str = "wasapi" # Platform-specific audio system

The choice of 16kHz sample rate is deliberate - it’s Whisper’s native rate, avoiding unnecessary resampling. Mono audio reduces processing overhead while maintaining transcription quality.

The Threading Challenge

Real-time audio processing requires careful threading to prevent audio dropouts:

  1. Audio Callback Thread: Captures audio frames at hardware intervals
  2. Processing Thread: Handles transcription in the background
  3. GUI Thread: Maintains UI responsiveness

Queue Management Strategy

The audio callback runs at high frequency and cannot be blocked. I implemented a robust queuing system:

def audio_callback(self, indata, frames, time, status):
"""Callback for audio stream - must be non-blocking"""
if status:
logger.warning(f"Audio callback status: {status}")
if self.is_recording:
# Convert to mono if needed
if indata.shape[1] > 1:
audio_data = np.mean(indata, axis=1)
else:
audio_data = indata[:, 0]
try:
self.audio_queue.put_nowait(audio_data.copy())
except queue.Full:
# Drop oldest frames to prevent blocking
for _ in range(min(10, self.audio_queue.qsize())):
try:
self.audio_queue.get_nowait()
except queue.Empty:
break
self.audio_queue.put_nowait(audio_data.copy())
self.dropped_frames += 1
Note

Key Insight: When the queue fills up (indicating processing can’t keep up), I drop the oldest frames rather than the newest. This maintains real-time behavior at the cost of some historical audio data.

Whisper Integration: Balancing Accuracy and Speed

OpenAI’s Whisper offers multiple model sizes with different trade-offs:

ModelSizeSpeedAccuracyUse Case
Tiny~39MBFastestBasicReal-time, low-end hardware
Base~140MBFastGoodRecommended for most users
Small~460MBModerateBetterHigher accuracy needs
Medium~1.5GBSlowHighOffline processing
Large~2.9GBSlowestBestMaximum accuracy
Turbo~1.5GBVery FastHighBest balance for real-time

Dynamic Model Switching

I implemented hot-swapping of models without restarting the application:

def change_model(self, new_model_name: str) -> bool:
"""Change to a different Whisper model"""
if new_model_name == self.model_name and self.is_loaded:
return True # Already using this model
# Clear current model to free memory
self.model = None
self.is_loaded = False
self.model_name = new_model_name
# Load new model in background thread
return self.load_model()
Tip

Performance Tip: Model loading can take 5-30 seconds depending on size. I perform this in a background thread with UI feedback to maintain responsiveness.

Processing Pipeline Optimization

The transcription pipeline balances latency with accuracy through several strategies:

  1. Buffered Processing: Accumulate 2 seconds of audio before transcription
  2. Batch Processing: Handle multiple audio chunks per iteration
  3. Smart Scheduling: Process up to 50 chunks at once when behind
def _processing_loop(self):
"""Main processing loop with optimized batch handling"""
while not self.stop_event.is_set():
# Process multiple chunks per iteration
chunks_this_iteration = 0
max_chunks_per_iteration = 50
while chunks_this_iteration < max_chunks_per_iteration:
audio_chunk = self.audio_capture.get_audio_chunk()
if audio_chunk is None:
break
if self.state == SessionState.RUNNING_ACTIVE:
self.audio_buffer.extend(audio_chunk)
chunks_this_iteration += 1
# Adaptive sleep based on processing load
if chunks_this_iteration > 0:
time.sleep(0.01) # Active processing
else:
time.sleep(0.05) # Idle state

GUI Design: Responsive User Experience

I chose DearPyGui for the interface because it offers immediate mode rendering with Python native integration. The challenge was maintaining responsiveness while heavy transcription occurs in the background.

Thread-Safe UI Updates

GUI updates from background threads require careful coordination:

def _on_transcription_update(self, text: str, is_final: bool):
"""Thread-safe callback for transcription updates"""
if is_final:
timestamp = time.strftime('%H:%M:%S')
self.transcription_text += f"\n[{timestamp}] {text}"
# Update GUI in thread-safe manner
dpg.set_value("transcription_display", self.transcription_text)
# Auto-scroll with multiple fallback methods
try:
dpg.set_y_scroll("transcription_window", 1.0)
# Delayed scroll for timing issues
def delayed_scroll():
time.sleep(0.1)
scroll_max = dpg.get_y_scroll_max("transcription_window")
if scroll_max > 0:
dpg.set_y_scroll("transcription_window", scroll_max)
threading.Thread(target=delayed_scroll, daemon=True).start()
except Exception as e:
logger.debug(f"Scroll update failed: {e}")

Progressive Enhancement

The interface starts simple but provides advanced options:

  1. Basic Controls: Start/Stop, Clear, Save
  2. Audio Device Selection: Choose from available input devices
  3. Model Selection: Switch between Whisper models
  4. Real-time Status: Processing state and performance metrics

Performance Considerations

Several optimizations ensure smooth real-time operation:

Memory Management

# Circular buffer prevents unbounded memory growth
self.audio_buffer = deque(maxlen=int(sample_rate * buffer_duration))
# Explicit cleanup of Whisper models
self.model = None # Triggers garbage collection

CPU Optimization

  • Mono Audio: Halves processing load compared to stereo
  • Efficient Queuing: get_nowait() prevents thread blocking
  • Batch Processing: Reduces per-chunk overhead
  • Adaptive Sleeping: CPU usage scales with processing demand

Error Recovery

Robust error handling prevents crashes from audio glitches:

try:
result = self.model.transcribe(audio_data, ...)
if result and result['text'].strip():
# Process successful transcription
self._update_transcription(result['text'])
except Exception as e:
logger.error(f"Transcription error: {e}")
# Continue processing - don't crash on individual failures

Lessons Learned

Building this application taught me several important lessons about real-time audio processing:

Audio is Unforgiving

  • Unlike text or image processing, audio streams are continuous and time-sensitive. Drop a few milliseconds and users notice immediately. The callback-based architecture with careful queue management is essential.

Model Size Matters

  • Whisper’s larger models provide better accuracy but at the cost of latency. For real-time applications, the “Turbo” model offers the best balance - near large-model accuracy with much faster processing.

Threading is Critical

  • Real-time applications require careful separation of concerns across threads. The audio callback, processing loop, and GUI each need dedicated threads with appropriate priorities.

Platform Differences Are Real

  • Audio capture varies significantly between operating systems. Windows WASAPI, macOS Core Audio, and Linux ALSA/PulseAudio all have different capabilities and limitations.

User Feedback is Essential

  • Long-running operations (like model loading) need clear user feedback. Progress indicators and status messages transform a frustrating wait into an understandable process.

Future Enhancements

Several improvements could enhance the application:

  1. Multi-language Support: Leverage Whisper’s multilingual capabilities
  2. Speaker Diarization: Identify and separate different speakers
  3. Custom Models: Fine-tuned Whisper models for specific domains
  4. Cloud Integration: Option for cloud-based processing for lower-end devices
  5. Export Formats: Support for SRT subtitles, structured JSON, etc.
  6. Voice Activity Detection: Only transcribe when speech is detected
  7. Cross-platform Audio: Better support for macOS and Linux audio capture

Conclusion

Building a real-time transcription application highlighted the intricate balance between accuracy, latency, and resource usage. The combination of robust audio capture, intelligent buffering, and responsive UI design creates a system that feels natural to use while handling the complex technical challenges behind the scenes.

The modular architecture makes it easy to swap components - different transcription engines, audio sources, or UI frameworks - while maintaining the core functionality. This flexibility has proven valuable as requirements evolved and new use cases emerged.

Whether you’re building accessibility tools, productivity applications, or research systems, the patterns and techniques in this implementation provide a solid foundation for real-time audio processing applications.

Key Takeaways

  • Audio capture requires platform-specific knowledge but can be abstracted effectively
  • Threading architecture is crucial for real-time performance
  • Queue management strategies determine system stability under load
  • Model selection significantly impacts both accuracy and latency
  • User experience considerations are as important as technical performance
  • Error recovery and logging are essential for production reliability

The complete source code demonstrates these principles in action, providing a reference implementation for anyone tackling similar real-time audio processing challenges.


You can explore the complete implementation here: Github repo