Getting Started with LLMs for Call Transcript Analysis

Summary

This field note documents an exploratory effort to determine how quickly we could prototype an LLM-powered approach for extracting structured, fraud-relevant insights from call transcripts. Using a combination of local open-source models and hosted Gemini models, we focused on speed, output quality, and practical trade-offs rather than production readiness. Overall, smaller local models proved sufficient for most transcripts, with limited incremental gains from larger models or vector storage at this stage.

Context

This experiment explored whether we can rapidly prototype LLM-based transcription analysis using test transcripts, open-source ML tooling, and Gemini models.

The primary objective was to generate fast, reliable, and queryable transcript intelligence capable of identifying fraud-relevant signals such as vulnerability and financial resilience. A secondary goal was to improve our understanding of model behaviour, performance trade-offs, and the potential impact of agent-based approaches when combining multiple models.

What we did

Defined the goal and approach. Focused on transcript analysis rather than live transcription. Prioritised speed of experimentation and learning over optimisation.
Set up a local LLM environment. Installed Ollama and ran two local models — mistral:7b and qwen2.5:14b-instruct — and created a local Python environment using LangChain.
Implemented a multi-agent analysis flow. Built two agent personas: a Classifier that categorises transcripts based on predefined properties, and a Quality Analyst that evaluates transcript handling against predefined criteria and assigns a score from 1–10. Used LangChain and Pydantic schemas to pass structured outputs between agents.
Introduced vector storage. Deployed Milvus via Docker to store transcript embeddings, and compared vector retrieval performance against local disk reads.
Benchmarked against Gemini models. Ran a subset of transcripts through Gemini models and used those outputs as a quality reference for team validation.

What we observed

Running LLMs locally is resource-intensive; smaller models respond significantly faster. Larger models generate marginally richer outputs, but with longer inference times. In approximately 8 out of 10 transcripts, quality differences between models were negligible. Using Milvus for retrieval yielded minimal performance improvement — sub-2-to-3-second gains compared to reading from local disk.

Decisions made

We chose to run only one model at a time due to local device hardware constraints, and selected Mistral 7B for bulk local processing because of its speed. Gemini models were retained for limited use on a subset of transcripts for quality assurance and comparison. We avoided simultaneous multi-model execution to prioritise stability and performance.

Reflections and learnings

System prompts play a critical role in grounding agent behaviour and task execution. Passing structured outputs between agents using LangChain and Pydantic enables rapid prototyping. Multi-agent setups are conceptually useful but add overhead without clear quality gains at this scale. Hardware limitations have a larger impact on experimentation speed than model selection alone.

Risks and concerns

The transcripts used were synthetic, which may limit the applicability of results. Output consistency is not guaranteed when running the same transcript multiple times. Hosted LLM usage introduces cost considerations — reducing hallucinations often requires larger system and user prompts, and increased prompt size leads to higher token usage and ongoing context costs.

Open questions

How can contextual memory be retained effectively across multiple transcript interactions? How can Milvus, or any other vector database, be better leveraged for contextual grounding rather than simple retrieval? Will adopting RAG reduce hallucinations overall, or increase token consumption and cost?

These are the questions we expect to address in the next iteration of this work.