Topic Intelligence
Automatically discover what your team is talking about, surface unusual messages, and trace topics across channels — all powered by semantic embeddings.
Overview
Topic Intelligence uses vector embeddings — numerical representations of meaning — to analyse your voice transcripts. Every time a message is transcribed, an embedding is automatically generated in the background. This page then uses those embeddings to:
- Cluster messages into conversation topics.
- Detect anomalies — messages that don't fit any typical pattern.
- Find similar messages — given any message, find others with closely related content.
Unlike AI Insights, Topic Intelligence does not send transcripts to an AI service. All analysis runs as native SQL and pgvector operations inside your database. There are no token costs and no transcript text leaves your infrastructure.
Prerequisites
For Topic Intelligence to return results, three conditions must be met:
- Transcription must be enabled — At least some of your channels need transcription turned on so that voice messages produce text transcripts.
- Embeddings must be generating — The embedding Edge Function runs automatically after each transcript is saved. This requires the
edge_function_url_generate_embedding and service_role_key secrets to be configured in Supabase Vault, and an OPENAI_API_KEY set for the Edge Function. - Sufficient data volume — Clustering works best with at least 50–100 messages with embeddings in the selected time range. More data produces more meaningful clusters.
If you click "Discover" and see no results, the most likely cause is that your messages don't have embeddings yet. Check that the prerequisites above are in place, then wait for new transcripts to be generated.
How It Works
Topic Intelligence operates in three steps:
- Voice → Transcript — When someone speaks in a channel, the audio is transcribed to text in real time by whoot.'s transcription engine.
- Transcript → Embedding — A database trigger fires an Edge Function that converts the transcript into a 1,536-dimension vector (using OpenAI's embedding model). This vector captures the meaning of the text, not just the exact words — so "the equipment is broken" and "the machine stopped working" produce similar vectors.
- Embedding → Analysis — The three analysis modes on this page query those vectors using pgvector's cosine distance operator to group, rank, and compare messages by meaning.
Topic Clusters
The primary mode. Clustering groups your transcripts into conversation topics automatically — you don't need to define the topics in advance.
What Is a Cluster?
A cluster is a group of messages that are semantically similar — they talk about the same thing, even if they use different words. For example:
- "The forklift in warehouse B needs maintenance"
- "Can someone check the hydraulic lift? It's making a weird noise"
- "Equipment issue in the loading dock area"
These three messages would likely end up in the same cluster — "equipment / maintenance issues" — even though they share almost no exact words.
Controls
- Time Range — Choose Last 24 hours, 7 days, or 30 days. Longer ranges give more data to cluster but may mix topics from different periods.
- Number of Clusters — Choose how many topics to produce (3, 5, 8, 10, or 15). A lower number creates broader groupings; a higher number creates more specific sub-topics.
- Discover button — Runs the clustering algorithm. Results appear within a few seconds.
Reading Cluster Results
Each cluster card shows:
- Topic number — An auto-assigned label (Topic 1, Topic 2, etc.).
- Message count — How many messages belong to this topic.
- Cohesion score — A percentage indicating how tightly related the messages are. "Very tight (85%)" means the messages are very similar; "Loose (35%)" means the grouping is broad.
- Representative message — The "seed" message chosen as the centre of the cluster. It's the most typical example of the topic.
Click a cluster card to expand it and see all the messages in that topic. Each message shows the speaker, channel, timestamp, and a match percentage showing how closely it relates to the representative message. You can click Similar on any message to jump to the Similarity Search with that message pre-loaded.
Anomaly Detection
The Anomaly Detection mode finds messages that are semantically unusual — they don't fit the normal conversation patterns of your workspace.
What Is an Anomaly?
An anomaly is a message whose embedding is far from its nearest neighbours. In plain terms: nobody else said anything similar in the same time period. This might be:
- An urgent or unusual event — "There's flooding in the server room" when everything else is about normal operations.
- An off-topic message — Personal conversation in a work channel.
- A rare topic — Something discussed only once, like a new policy announcement.
- A false transcript — Garbled or incorrect transcription that produces an unusual embedding.
Controls
- Time Range — Choose the analysis window. Shorter ranges may produce more anomalies because there's less "normal" context for comparison.
- Detect button — Runs the anomaly scoring algorithm.
Reading Anomaly Results
Results are ranked by anomaly score — the most unusual messages appear first. Each card shows:
- Rank — Colour-coded position (red = most anomalous, orange = moderately, yellow = mildly).
- Speaker & channel — Who said it and where.
- Transcript text — The full message content.
- Score badge — The raw anomaly score. Higher values mean more unusual.
Click the search icon on any anomaly to jump to the Similarity Search and see what (if anything) is close to it.
Similarity Search
The Similarity Search mode lets you take any single message and find other messages with closely related content across your entire workspace — any channel, any speaker, any time period.
How to Search
You need a message ID (a UUID) to search. There are three ways to get one:
- From cluster results — Click "Similar" on any message inside an expanded cluster. The ID is filled in automatically.
- From anomaly results — Click the search icon on any anomaly card.
- From the Replay page — Click a transcript message in Session Replay to see its details, which include the message ID.
Paste the UUID into the filter bar and click Find Similar. The algorithm uses pgvector's HNSW index to perform a fast approximate nearest-neighbour search.
Reading Similarity Results
Results are ranked by similarity percentage:
- 90%+ (green badge) — Very closely related. Likely discussing the same topic or event.
- 70–89% (blue badge) — Related. Similar topic area but may differ in specifics.
- Below 70% — Loosely related. Some thematic overlap but different conversations.
Each result shows the speaker, channel, timestamp, transcript, and a copy button to grab the message ID if you want to chain searches.
Privacy & Cost
All three analysis modes run entirely inside your database using pgvector. No transcript text is sent to any external service during analysis. The only external call is the one-time embedding generation (via OpenAI) when a transcript is first created — this happens before you ever visit this page.
There are no per-query AI costs for Topic Intelligence. Once embeddings exist, clustering, anomaly detection, and similarity search are pure SQL operations.
Permissions
Access to Topic Intelligence requires the analytics.clustering permission. By default this is granted to any role that has analytics.insights access. Tenant administrators can configure this in the Roles & Permissions page.
Tips & Best Practices
- Start with 5 clusters on 7 days — This is a good default that gives meaningful groupings without being too granular. Increase the cluster count if you see topics that seem too broad.
- Use anomalies as an early warning system — Review anomalies weekly. They often surface things that slip through normal monitoring — unusual incidents, one-off requests, or process deviations.
- Chain searches — When you find an interesting message in a cluster, click "Similar" to see related messages. Then click "Similar" again on one of those results to explore further. This lets you trace a topic across time and channels.
- Combine with Keyword Watchdog — If clustering reveals a recurring topic you want to monitor, create a Watchdog rule with key phrases from that cluster to get automated alerts going forward.
- Understand cohesion scores — A "Very tight" cluster with many messages is a strong, recurring topic. A "Loose" cluster with few messages might just be noise. Focus your attention on tight, high-count clusters.
- Try different numbers of clusters — If all your clusters look the same, reduce the count to 3 to get broader themes. If topics seem mixed, increase to 10 or 15 for finer granularity.
Troubleshooting
- "No clusters found" — There aren't enough messages with embeddings in the selected time range. Try a longer time range, or verify that transcription and embedding generation are enabled.
- "No anomalies detected" — Either there aren't enough embedded messages, or all messages are semantically consistent (nothing unusual was said).
- "No similar messages found" — The target message may not have an embedding yet (embeddings are generated a few seconds after the transcript). Wait a moment and try again.
- Results seem random — With very few messages, the clustering algorithm doesn't have enough data to form meaningful groups. More data (100+ messages) produces significantly better results.