
How to Split Audio and Video into Chunks for Gemma 3n
Why Chunking Matters for Gemma 3n Transcription and Translation
Gemma 3n AI models are designed to process short audio segments (chunks)—ideally 20–30 seconds at a time. This maximizes accuracy, speed, and stability while keeping memory usage reasonable. Processing full-length audio or video in one go often leads to errors, slowdowns, or dropped content.
Below, you'll find up-to-date chunking strategies and clear explanations used in the berdaflex videoscribe.
Step 1. Extract Audio from Video and Prepare for Chunking
Extract audio from video (e.g. MP4) using FFmpeg:
ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 output_audio.wav
- Converts to 16kHz mono WAV: what Gemma 3n expects.
Step 2. Load Audio and Make It Mono
import torchaudio
import numpy as np
waveform, sample_rate = torchaudio.load("output_audio.wav")
if waveform.shape[0] > 1: # Stereo audio
waveform = waveform.mean(dim=0, keepdim=True) # Convert to mono
audio_data = waveform.squeeze().numpy()
Step 3. Smart Chunking Logic
Key Parameters
- max_chunk_size = 30.0 seconds (Gemma 3n’s hard limit)
- min_chunk_size = 0.3 seconds (to avoid too-tiny/noisy chunks)
- preferred_min_size = 15.0 seconds
- preferred_max_size = 30.0 seconds
- optimal_chunk_size = 22.0 seconds
Pause-Based Chunking: Natural Splits
Why do this?
Instead of blindly cutting every X seconds, we split where there are natural speech pauses (silence).
def smart_chunk_audio(audio_data, sample_rate):
min_silence_db = -14
min_pause_len_ms = 500
seek_step_ms = 100
optimal_chunk_samples = int(22 * sample_rate)
max_chunk_samples = int(30 * sample_rate)
min_chunk_samples = int(0.3 * sample_rate)
overlap_samples = int(0.5 * sample_rate) # Overlap for smoother transitions
envelope = np.abs(audio_data)
silence_mask = envelope = min_pause_samples:
boundary = current_silence + duration // 2
boundaries.append(boundary)
current_silence = None
# Chunking loop
chunks = []
current_start = 0
chunk_idx = 0
while current_start max_chunk_samples:
best_boundary = current_start + max_chunk_samples
chunk = audio_data[current_start:best_boundary]
chunks.append(chunk)
current_start = best_boundary - overlap_samples # move with overlap
chunk_idx += 1
return chunks
This logic closely matches the robust logic from processor.py
. The real code is more thorough (handles energy, more edge-cases), but this covers the essential mechanics.
Step 4. Save Each Chunk
You can save these chunks for batch processing and easier error recovery.
import torch
def save_chunks(chunks, sample_rate):
for idx, chunk in enumerate(chunks):
chunk_tensor = torch.tensor(chunk).unsqueeze(0)
torchaudio.save(f"chunk_{idx:03d}.wav", chunk_tensor, sample_rate)
Step 5. Advanced Suitability Checks
Each chunk is analyzed to ensure it's “suitable” for transcription (good energy, not mostly silence).
def check_chunk_suitability(audio_data, sample_rate):
duration = len(audio_data) / sample_rate
energy = np.mean(audio_data ** 2)
max_amplitude = np.max(np.abs(audio_data))
silence_ratio = np.sum(np.abs(audio_data) 0.00001 and max_amplitude > 0.001 and silence_ratio < 0.99 and 0.1 <= duration <= 30.0:
return True
return False
Step 6. Why Not Just Cut Every X Seconds?
Simple cut-every-N-seconds methods can split words, lose context, or make transcription choppy. Smart chunking (as shown) finds pauses, natural speech boundaries, and skips untranscribable noise.
Step 7. Full Workflow Overview
From processor.py
, your workflow is:
- Extract audio
- Chunk intelligently (pause/energy based, overlap)
- Validate chunk suitability
- Save chunk files
- Transcribe/translate each chunk with Gemma 3n
- Recombine results in order
Visual Summary:
View PlantUML source code
@startuml
title Gemma 3n Audio/Video Chunking & Transcription Pipeline
start
:Input Video/Audio;
:Extract Audio;
:Smart Chunking (pause/energy based, overlap);
:Save Chunks;
:Validate Suitability (energy/noise checks);
:Transcribe/Translate each chunk (Gemma 3n);
:Combine Output;
stop
@enduml
Best Practice Tips
- Use 30s maximum chunk for Gemma 3n.
- Add 0.5s overlap for completeness.
- Always convert to mono, 16kHz.
- Use pause-based chunking where possible.
- Check chunk energy: skip silent/noisy chunks.
- Automate with Python for speed and reliability.
Final Thoughts
Chunking is key for accurate, high-speed transcription and translation with Gemma 3n. The logic in the processor.py is carefully designed for real-world use: robust, efficient, and proven in production.
With these steps and sample code, anyone—regardless of background—can split audio/video into smart chunks and unlock powerful AI transcription and translation capabilities with Gemma 3n.
Published on 8/16/2025