Architecture overview¶
An AI familiar that joins Discord voice channels, listens, understands speech, and talks back using real AI voices.
Goals¶
- Single runtime. The entire backend runs as one Python process
using
asyncio. No separate worker scripts, no external message broker. - Unified entry point. One
familiar-connect runstarts everything — Discord gateway, voice capture, transcription, LLM, TTS, and Twitch listener all run as concurrent tasks under a singleasyncio.TaskGroup. - Local-first. The context layer makes no calls to third-party state stores. All context state lives in-process, in the filesystem next to the bot, or in the bot's own SQLite. The only network calls in the context layer are to the LLM endpoints we're already using for generation.
- Single operator, one active familiar per process.
Familiar-Connect is run by a single admin on their own machine —
there is no multi-user / multi-tenant ambition. Multiple character
folders may coexist under
data/familiars/, but exactly one is active at a time. See Configuration model for the detailed ownership rules.
Target architecture¶
All components run as coroutines within a single asyncio event
loop, scoped by asyncio.TaskGroup for structured concurrency and
clean cancellation (Python 3.13+):
flowchart TB
dv([Discord voice channel]) --> cap[Audio capture]
cap --> aq[(asyncio.Queue: PCM)]
aq --> stt[Transcription<br/>Deepgram streaming]
stt --> tq[(asyncio.Queue: text)]
tw([Twitch EventSub]) --> tq
tq --> cm[ConversationMonitor<br/>chattiness & interjection]
cm --> cp[Context pipeline]
cp --> or[OpenRouter<br/>streaming completion]
or --> tts[TTS<br/>Cartesia / Azure]
tts --> oq[(asyncio.Queue: audio)]
oq --> dvp([Discord voice playback])
classDef queue fill:#eee,stroke:#888,stroke-dasharray:3 3;
class aq,tq,oq queue;
Every box runs under one root asyncio.TaskGroup — a crash anywhere
cancels the whole reply path.
Development uses red/green TDD throughout.
External services¶
The runtime talks to five outside services. Two are required; the rest are optional and the bot degrades gracefully without them.
flowchart LR
subgraph required [Required]
direction TB
discord[Discord Gateway]
openrouter[OpenRouter LLM]
end
bot([familiar-connect<br/>runtime])
subgraph optional [Optional]
direction TB
cartesia[Cartesia TTS]
azure[Azure Speech]
deepgram[Deepgram STT]
twitch[Twitch EventSub]
end
discord <--> bot
openrouter <--> bot
bot --> cartesia
bot --> azure
deepgram --> bot
twitch --> bot
- Discord Gateway (required) —
DISCORD_BOT. The bot has nothing to listen to or speak into without it. - OpenRouter (required) —
OPENROUTER_API_KEY. The reply generation call. Model selectable per-familiar. - Cartesia / Azure Speech (optional) — TTS providers. Without either, the bot still replies in text channels it is subscribed to.
- Deepgram (optional) — streaming STT for voice input. See Voice input for the wiring.
- Twitch EventSub (optional) — only needed if a familiar uses the Twitch commentary features in the Twitch guide.
See Installation for the exact env-var names, the minimal "just text replies" configuration, and how to turn each optional service on.
Core components¶
Discord bot¶
Built with py-cord. Voice send/receive uses davey to handle Discord's DAVE (Audio/Video E2E Encryption) protocol. The subscription surface and channel-mode slash commands are documented in Slash commands.
Transcription¶
Primary: Deepgram (Nova-2, streaming)
- Native WebSocket streaming API, ~300ms latency, strong accuracy
- Handles raw PCM streams directly — maps well to Discord's audio pipeline
- Good Python SDK (
deepgram-sdk)
Fallback: faster-whisper (local)
- Zero cost, no rate limits, no external dependency
- Requires a GPU for real-time performance
- Good offline / privacy-preserving option
Pipeline: Discord 48kHz Opus → decode to PCM → resample to 16kHz → stream to Deepgram WebSocket (or feed chunks to faster-whisper).
Incoming voice audio is fed into the ConversationMonitor via
VoiceLullMonitor debouncing — see Voice input for
the full wiring.
AI response (OpenRouter)¶
The LLM call is the core of the bot's reply path. Its inputs — system
prompt, retrieved knowledge, conversation history, per-user notes —
are assembled by the Context pipeline, not
inline in the bot loop. The LLM client (familiar_connect.llm) only
speaks to OpenRouter; it is deliberately unaware of where its messages
came from so the pipeline can be tested and extended in isolation.
- Provider: OpenRouter. Model selection is per-call-site and per
familiar — set individually in
character.tomlunder[llm.main_prose],[llm.post_process_style],[llm.reasoning_context],[llm.history_summary],[llm.memory_search],[llm.interjection_decision], and[llm.mood_eval]. - Streaming: Responses are streamed so the TTS path can start speaking before the full reply arrives.
- Per-call-site slots: Each provider/processor holds its own
LLMClientdrawn from the slot it owns, so a familiar can pin a cheap model (e.g.openai/gpt-4o-mini) on the cheap slots while still using a heavyweight model onmain_prose. The process-wide rate-limit semaphore is shared across every slot.
Context pipeline¶
Everything upstream of the OpenRouter call — character cards, system
prompt assembly, memory retrieval, conversation history, and the
cheap side calls each call site makes from its own LLMClient slot
— is assembled by a single context pipeline that runs as a
scoped asyncio.TaskGroup on every reply. The pipeline is the
architectural backbone for all "AI behaviour knobs" in the bot.
See Context pipeline for the full design and step-by-step implementation history, and Memory for the on-disk memory directory the pipeline reads and writes.
Text-to-speech¶
Both providers are implemented; the active one is set via [tts].provider
in character.toml. Default is "azure".
Azure Speech (default, provider = "azure")
- Default voice:
en-US-AmberNeural(setazure_voiceto change) - Requires
AZURE_SPEECH_KEY+AZURE_SPEECH_REGIONenv vars - Runs via the
azure-cognitiveservices-speechSDK in a thread executor - Outputs
Raw48Khz16BitMonoPcm— no resampling needed for Discord - Word-boundary events feed per-word timestamps for interruption detection
Cartesia Sonic (provider = "cartesia")
- Purpose-built for real-time conversational AI; sub-100ms TTFB
- Native WebSocket streaming
- Requires
CARTESIA_API_KEYenv var; setvoice_id+modelin[tts]
Pipeline: LLM text → TTS (Azure SDK / Cartesia WebSocket) → PCM → resample to 48kHz Opus → Discord voice playback.
Voice interruption¶
When a user speaks during an active voice response, InterruptionDetector
classifies the burst and dispatches one of five paths:
| State | Classification | Action |
|---|---|---|
GENERATING |
discarded | Drop. Generation continues. |
GENERATING |
short | Polite wait — delivery gate holds playback until user is quiet. Interrupter transcript flushed to history after original buffer. |
GENERATING |
long | Cancel generation_task immediately on boundary crossing. Re-generate with interruption context. |
SPEAKING |
short | Tolerance roll decides yield or push-through. Yield: vc.stop(), re-synthesize remaining words, resume. Push-through: audio continues; interrupter transcript written to history. |
SPEAKING |
long | Tolerance roll → yield → vc.stop(), write delivered portion to history, re-generate with delivered + interrupter transcript as context. |
ResponseTracker holds per-guild state (IDLE / GENERATING / SPEAKING),
the cancellable LLM task, word timestamps for word-boundary splits, and
scratch fields consumed by the post-playback dispatch logic.
See Voice interruption for the full design, configuration reference, and sequence diagrams.
Twitch integration¶
Connects to Twitch EventSub WebSocket as a task in the root
asyncio.TaskGroup. See the Twitch guide for
the event catalogue and slash command surface.
Monitoring dashboard¶
Starlette + Hypercorn (asyncio-native web dashboard):
- Hypercorn runs on asyncio and mounts as a task in the bot's root
asyncio.TaskGroup - Routes:
/health— JSON status of each service (Discord, Twitch, transcription, TTS, LLM)/events— Recent event log via SSE or WebSocket/context— Per-turn, per-provider latency and token metrics from the context pipeline, so provider/processor enable/disable decisions can be made from real measurements
Dashboard not yet shipped
The PipelineOutput.outcomes data is already captured per turn
and bot.py logs a structured line per outcome; the web
dashboard itself is a separate work item. See the
Context pipeline page for the full list
of deferred items.
Resilience¶
Third-party service calls. Service clients (LLMClient,
CartesiaTTSClient, the Deepgram transcriber) raise on failure;
callers decide the fallback. Every except clause on a service
call enumerates its exception types — either directly (e.g.
(httpx.HTTPError, ValueError, KeyError) for LLMClient.chat)
or via a Protocol-declared type (e.g. PreProcessorError for
pre-processors). No catch-all except Exception: is used on new
code; contract violations surface loudly.
- Main reply failure (
bot.py). The main-proseLLMClient.chatcall in both the text and voice reply paths catches the closed raise set(httpx.HTTPError, ValueError, KeyError), logs a warning, and returns silently. No apology text, no reaction, no history write for the failed turn — the user sees nothing and can simply retry.LLMClient's own 120 s httpx timeout is the ceiling; no extraasyncio.timeoutwrapper is added because the main reply is the one call for which a long wait is preferable to a fallback. - LLM retry policy.
LLMClientretries only on HTTP 429 with exponential backoff (honouringRetry-Afterwhen present, up to_MAX_DELAY_S). Every other failure — transport error, non-2xx response, malformed payload — is the caller's responsibility. - TTS failure.
CartesiaTTSClient.synthesizeraises on every non-2xx response and on any transport error. Thebot.pycall sites swallow TTS exceptions with_logger.exceptionso a missing voice clip never blocks a text reply. The client itself has no retry logic. - Context pipeline pre-processors. The pre-processor loop in
ContextPipeline.assemblecatches the Protocol-declaredPreProcessorErroronly, logs a warning, and passes the last successful request on to the next stage. Any other exception escapingPreProcessor.processis a contract violation and propagates out of the pipeline — this is intentional so bugs surface loudly rather than being silently hidden. - Context pipeline providers and post-processors. Providers
run under a scoped
asyncio.TaskGroupwith per-provider deadlines; misses are recorded as"timeout"/"error"outcomes. Post-processors are each wrapped in a pass-throughtry/exceptso a failing cleanup pass degrades to a no-op for just that stage.
Persistence¶
- Raw transcripts of every conversation are stored verbatim in
SQLite (
familiar_connect.history.store.HistoryStore). - The memory directory contains the distilled, human-readable form of everything the familiar "knows." It is the model's view of the world.
- Per-turn performance traces (latency, token counts, provider
outcomes, A/B tags) are persisted in a separate SQLite DB —
data/familiars/<id>/metrics.db— viafamiliar_connect.metrics.SQLiteCollector. See Metrics and profiling for the data model, the logging-vs-metrics boundary, and CLI usage. - Derived artefacts — rolling summaries, future vector indices, tag caches — are rebuildable from the raw transcript store and the memory directory. Losing them is annoying but not destructive.
- Original imported character cards are kept verbatim alongside the
unpacked
self/files (memory/self/.original.png), so a future change to the unpacking logic can re-run against the originals.