Day 2 - Lightweight Podcast Archive Indexing
I built a complete pipeline for downloading, transcribing, and semantically searching podcast archives. This is a first step toward extracting clips from videos and building a system that can answer questions about content.
What I did
Created a toolkit to process podcast archives: download all episodes from RSS, transcribe with whisper.cpp, store embeddings in SQLite, and search with LLM-powered re-ranking. The end-to-end workflow lets me find specific moments across hundreds of hours of audio.
Key Technical Decisions
Chunked Transcription: whisper.cpp works best on shorter audio, so I split files into 10-minute chunks with 30-second overlaps. The overlap provides context at boundaries and prevents cutting mid-sentence. Deduplication filters segments with start times within 0.5 seconds.
Vector Storage: Chose sqlite-vector over dedicated vector databases. 4096-dimensional embeddings from Ollama’s qwen3-embedding model stored as BLOBs with cosine distance. Filters out segments under 20 characters or 3 words to skip filler content. Commits every 100 segments for performance.
Two-Stage Search: Vector similarity alone returns too many false positives. Stage 1 retrieves top 50 candidates via cosine distance, then Stage 2 uses llama.cpp to re-rank and select top 5 with reasoning. The LLM explains why each result matches, which helps validate relevance.
Audio Stitching: Search results pipe directly to an ffmpeg-based stitcher. It extracts each segment, re-encodes for clean cuts (libx264 for video, codec copy for audio), and concatenates via ffmpeg’s concat demuxer.
Results
Processed 200+ episodes of Cal Newport’s Deep Life podcast. Full indexing takes about 2 hours on a 4090 GPU running whisper.cpp. Searches return in under a second. The pipeline handles arbitrary RSS feeds with minor config changes.
The complete workflow:
# Download → Transcribe → Index → Search → Stitch
bun run search media/podcasts "deep work strategies" --json | bun run stitch - output.mp3