Voice interruption¶
When a user speaks for at least min_interruption_s while the familiar is
generating a response or playing one, it counts as an interruption. This
applies only to voice channels — text conversations are unaffected.
Back-channel sounds (brief "mm-hm", laughter) are filtered by the minimum- duration threshold — no audio classification is needed.
Multiple people speaking in the same window are merged into one event.
Configuration¶
character.toml¶
[voice.interruption]
# Five named tiers or a raw float 0.0–1.0.
# very_meek = 0.10, meek = 0.20, average = 0.30 (default),
# stubborn = 0.45, very_stubborn = 0.60
interrupt_tolerance = "average"
# Minimum seconds of continuous user speech before it counts as an interruption.
# Shorter bursts ("mm-hm") are discarded.
min_interruption_s = 2.0
# Duration threshold separating short from long interruptions.
# Must exceed min_interruption_s.
short_long_boundary_s = 30.0
All three values live on CharacterConfig and are loaded at startup.
They describe the familiar's personality, not a channel's tuning, so they
are intentionally not per-channel.
Tolerance tiers¶
| Tier | Base probability |
|---|---|
very_meek |
0.10 |
meek |
0.20 |
average |
0.30 (default) |
stubborn |
0.45 |
very_stubborn |
0.60 |
The probability is the chance the familiar keeps talking when interrupted while speaking. It is combined with the mood modifier and an unsolicited bias before the roll:
effective = clamp(base + mood_modifier + unsolicited_bias, 0, 1)
keep_talking = random() < effective
unsolicited_bias = +0.35 when the response was initiated as an
interjection (the familiar barged in without being addressed). Lulls and
direct addresses carry no bias. average tolerance + unsolicited bias
yields a 65% push-through probability.
Response state machine¶
A per-guild ResponseTracker tracks the familiar's current phase:
IDLE ──→ GENERATING ──→ SPEAKING ──→ IDLE
│ │
(interrupted) (interrupted)
ResponseState enum¶
| State | Meaning |
|---|---|
IDLE |
Not generating or speaking. Interruption detection is off. |
GENERATING |
LLM call is in-flight. TTS has not started. |
SPEAKING |
TTS audio is playing via vc.play(). |
The tracker holds:
generation_task— cancellableasyncio.Taskwrapping the LLM callresponse_text— full reply text (set after generation completes)timestamps—list[WordTimestamp]from Cartesia (for word-boundary splits)playback_start_time—time.monotonic()at theSPEAKINGtransitionvc— reference to the DiscordVoiceClientis_unsolicited— whether the reply was initiated by the chattiness pathmood_modifier— per-response float cached fromMoodEvaluatoratGENERATINGentryinterruption_elapsed_ms— ms since playback started when the familiar yieldedinterrupt_event—asyncio.Eventset at burst finalization on a SPEAKING yieldinterrupt_classification—InterruptionClasspopulated at finalization on yieldinterrupt_transcript— interrupter's words (from burst transcript)interrupt_starter_name— display name of the user whose burst triggered the yieldpending_interrupter_turns—(name, text)pairs from short@GENERATING bursts, flushed to history after the original buffer writeshort_yield_pending— True while a short@SPEAKING resume task is in flight; suppresses the concurrent voice endpointing lull
Every transition is logged at INFO:
tracker guild=999 state: IDLE→GENERATING (unsolicited=False)
tracker guild=999 state: GENERATING→SPEAKING (unsolicited=False)
tracker guild=999 state: SPEAKING→IDLE (unsolicited=False)
Interruption detection¶
Signal source¶
Deepgram interim and final Results messages feed
DeepgramVoiceActivityDetector (src/familiar_connect/voice/deepgram_vad.py),
which derives per-speaker speech-start / speech-end edges from recognized-word
arrivals. These callbacks drive VoiceLullMonitor, which re-exposes them via
on_voice_activity:
speech_started(user_id)— first non-empty Deepgram interim or final for this speaker (DGVAD speech-start edge).speech_ended(user_id)— 700 ms silence watchdog expiry with no new Deepgram results (DGVAD watchdog), or Deepgram endpointed final for a speaking user.
InterruptionDetector subscribes to this hook. It ignores all events when
the tracker is IDLE.
Burst accumulation¶
The detector accumulates a burst from the first speech_started while
state ≠ IDLE until all users have been quiet for lull_timeout_s. Burst
duration is wall-clock time from first-started to last-ended.
voice activity user=12345 event=started
interruption: min threshold crossed by user=12345 during GENERATING
voice activity user=12345 event=ended
interruption: short (3.20s) by user=12345 during GENERATING
The burst transcript is accumulated from Deepgram finals that arrive during
the active burst. InterruptionDetector.on_transcript is wired from
_route_transcript_to_monitor alongside the existing VoiceLullMonitor call.
Classification¶
| Duration | Class | Behaviour |
|---|---|---|
< min_interruption_s |
discarded | Drop. No dispatch. |
≥ min_interruption_s, < short_long_boundary_s |
short | See dispatch matrix below |
≥ short_long_boundary_s |
long | See dispatch matrix below |
State-upgrade rule¶
If a burst starts while state is GENERATING and the tracker transitions to
SPEAKING before the burst ends, the burst is upgraded to SPEAKING at
classification time.
Dispatch matrix¶
Five paths, keyed on (classification × state-at-burst-start):
| Classification | State | Path | Log |
|---|---|---|---|
| discarded | any | Drop | interruption: discarded (…s) |
| short | GENERATING |
Polite wait (see below) | dispatch: short@GENERATING → polite-wait speaker=… |
| long | GENERATING |
Cancel + regen (see below) | dispatch: long@GENERATING → cancel+regen user=… |
| short | SPEAKING |
Yield+resume or push-through (see below) | dispatch: short@SPEAKING → yield+resume speaker=… |
| long | SPEAKING |
Yield + regen with context (see below) | dispatch: long@SPEAKING → regen speaker=… |
Short@GENERATING — polite wait¶
The familiar keeps generating. A delivery gate (asyncio.Event) is
cleared when any burst starts and opened at finalize. _run_voice_response
awaits detector.wait_for_lull() between TTS synthesis and playback; it
returns short and playback proceeds normally.
The interrupter's transcript is stashed on
tracker.pending_interrupter_turns at finalize time. _run_voice_response
flushes them to history after the original buffer write so chronology
reads: original user turn → interrupter turn → assistant reply.
An early long-boundary timer also fires if the burst crosses
short_long_boundary_s during GENERATING before the lull — see the next
section.
Long@GENERATING — cancel + regen¶
The long-boundary timer fires as soon as accumulated burst duration crosses
short_long_boundary_s. _on_long_boundary_crossed cancels
tracker.generation_task immediately — before the lull — so no tokens are
wasted on a reply that will be discarded.
At lull finalization, on_long_during_generating fires. It:
- Schedules
dispatch_interruption_regenas anasyncio.Task. - Sets
extras["_regen_pending"]synchronously so the concurrent voice endpointing lull's_deliver_to_monitorcall is suppressed.
dispatch_interruption_regen calls _run_voice_response with an
interruption_context note:
{speaker} interrupted while you were forming a response.
They said: "{transcript}"
History is not written for the cancelled generation.
Moment 1 — tolerance roll (SPEAKING)¶
When a burst crosses min_interruption_s while the familiar is SPEAKING:
effective = clamp(base + mood_modifier + unsolicited_bias, 0, 1)
keep_talking = random() < effective
- keep talking → audio continues (push-through path)
- yield →
vc.stop()called immediately;interruption_elapsed_msand_remaining_timestampscaptured;interrupt_eventarmed for post-playback coordination
toll: base=0.30 mood=+0.00 unsolicited=+0.00 effective=0.30 roll=0.71 → yield
Short@SPEAKING — yield+resume or push-through¶
Yield path (keep_talking = False):
vc.stop()at Moment 1;interrupt_eventset.- At finalize:
_on_short_yield_resume(remaining)created as anasyncio.Task.tracker.short_yield_pending = Trueis set first so the concurrent voice endpointing lull is suppressed while the resume is pending. - Resume task: re-synthesizes
remainingwords and plays. - History records the full
tracker.response_text(not just the delivered portion).
Push-through path (keep_talking = True):
- Audio continues uninterrupted.
- At finalize:
_on_push_through_transcriptwrites the interrupter's transcript to history immediately so it is not lost.
Long@SPEAKING — yield + regen with context¶
vc.stop()at Moment 1._run_voice_responseawaitstracker.interrupt_eventpost-playback.tracker.interrupt_classificationislong→ computedelivered_textfromsplit_at_elapsed(tracker.timestamps, elapsed_ms).- Write
delivered_textto history as the interrupted assistant turn. - Re-generate with
interruption_context:
You were speaking and said: "{delivered_text}". {name} interrupted you. They said: "{transcript}"
Word-boundary split¶
split_at_elapsed(timestamps, elapsed_ms) partitions the list[WordTimestamp]
returned by Cartesia. A word is treated as delivered once its start_ms
has passed — even if playback was cut mid-word — so resumption always begins
on a clean boundary with no stutter.
delivered, remaining = split_at_elapsed(tracker.timestamps, elapsed_ms)
delivered_text = " ".join(w.word for w in delivered)
remaining_text = " ".join(w.word for w in remaining)
Mood modifier¶
MoodEvaluator.evaluate() issues a short LLM prompt inspecting the last
six history turns and returns a float in [−0.5, +0.5], cached on
tracker.mood_modifier at GENERATING entry.
Examples: - Excited about a topic → positive modifier (talks over people more) - Just got corrected → negative modifier (yields more readily) - Asking a question → negative modifier (wants the answer)
Uses the mood_eval LLM slot (configured under [llm.mood_eval] in
character.toml).
Key components¶
| Component | Location |
|---|---|
ResponseState enum |
voice/interruption.py |
ResponseTracker |
voice/interruption.py |
ResponseTrackerRegistry |
voice/interruption.py |
InterruptionDetector |
voice/interruption.py |
InterruptionClass enum |
voice/interruption.py |
split_at_elapsed |
voice/interruption.py |
MoodEvaluator |
mood.py |
InterruptTolerance enum |
config.py |
| Cancellable generation task | bot.py — tracker.generation_task |
| Delivery gate | InterruptionDetector._delivery_gate |
ContextRequest.interruption_context |
context/types.py |
| Interruption-context renderer | context/render.py |
WordTimestamp, TTSResult |
tts.py |
Sequence diagrams¶
Short@GENERATING — polite wait¶
User speaks (<boundary) ── detector ── short@GENERATING
│
delivery gate cleared on burst start
(bot.py awaits gate before playback)
│
User quiet ─────────────────────────────────►
│
gate opens → interrupter turns stashed
playback proceeds as normal
Long@GENERATING — cancel + regen¶
User speaks (>boundary) ── long-boundary timer fires
│
generation_task.cancel()
│
User quiet ─── capture transcript ──────────►
│
build ContextRequest (interruption_context set)
│
new LLM call → TTS + play
Short@SPEAKING — yield + resume¶
Familiar playing ─── vc.play(audio) ──────────────────────────►
│
User speaks (>min, <boundary) ─── roll: yield ─────────────────►
│
vc.stop()
elapsed_ms captured
short_yield_pending = True
│
User quiet ─────────────────────────────────────────────────────►
│
split_at_elapsed → remaining
re-synthesize + vc.play(remaining)
full response_text → history
Long@SPEAKING — yield + regen¶
Familiar playing ─── vc.play(audio) ──────────────────────────►
│
User speaks (>boundary) ─── roll: yield ──────────────────────►
│
vc.stop()
interrupt_event armed
│
User quiet ─── capture transcript ──────────────────────────────►
│
delivered_text → history
build ContextRequest (delivered + transcript)
│
LLM + TTS + play