Skip to content

Architecture overview

An AI familiar that joins Discord voice channels, listens, understands speech, and talks back using real AI voices.

Goals

  • Single runtime. The entire backend runs as one Python process using asyncio. No separate worker scripts, no external message broker.
  • Unified entry point. One familiar-connect run starts everything — Discord gateway, voice capture, transcription, LLM, TTS, and Twitch listener all run as concurrent tasks under a single asyncio.TaskGroup.
  • Local-first. The context layer makes no calls to third-party state stores. All context state lives in-process, in the filesystem next to the bot, or in the bot's own SQLite. The only network calls in the context layer are to the LLM endpoints we're already using for generation.
  • Single operator, one active familiar per process. Familiar-Connect is run by a single admin on their own machine — there is no multi-user / multi-tenant ambition. Multiple character folders may coexist under data/familiars/, but exactly one is active at a time. See Configuration model for the detailed ownership rules.

Target architecture

All components run as coroutines within a single asyncio event loop, scoped by asyncio.TaskGroup for structured concurrency and clean cancellation (Python 3.13+):

flowchart TB
    dv([Discord voice channel]) --> cap[Audio capture]
    cap --> aq[(asyncio.Queue: PCM)]
    aq --> stt[Transcription<br/>Deepgram streaming]
    stt --> tq[(asyncio.Queue&#58; text)]
    tw([Twitch EventSub]) --> tq
    tq --> cm[ConversationMonitor<br/>chattiness &amp; interjection]
    cm --> cp[Context pipeline]
    cp --> or[OpenRouter<br/>streaming completion]
    or --> tts[TTS<br/>Cartesia / Azure]
    tts --> oq[(asyncio.Queue&#58; audio)]
    oq --> dvp([Discord voice playback])

    classDef queue fill:#eee,stroke:#888,stroke-dasharray:3 3;
    class aq,tq,oq queue;

Every box runs under one root asyncio.TaskGroup — a crash anywhere cancels the whole reply path.

Development uses red/green TDD throughout.

External services

The runtime talks to five outside services. Two are required; the rest are optional and the bot degrades gracefully without them.

flowchart LR
    subgraph required [Required]
        direction TB
        discord[Discord Gateway]
        openrouter[OpenRouter LLM]
    end

    bot([familiar-connect<br/>runtime])

    subgraph optional [Optional]
        direction TB
        cartesia[Cartesia TTS]
        azure[Azure Speech]
        deepgram[Deepgram STT]
        twitch[Twitch EventSub]
    end

    discord <--> bot
    openrouter <--> bot
    bot --> cartesia
    bot --> azure
    deepgram --> bot
    twitch --> bot
  • Discord Gateway (required) — DISCORD_BOT. The bot has nothing to listen to or speak into without it.
  • OpenRouter (required) — OPENROUTER_API_KEY. The reply generation call. Model selectable per-familiar.
  • Cartesia / Azure Speech (optional) — TTS providers. Without either, the bot still replies in text channels it is subscribed to.
  • Deepgram (optional) — streaming STT for voice input. See Voice input for the wiring.
  • Twitch EventSub (optional) — only needed if a familiar uses the Twitch commentary features in the Twitch guide.

See Installation for the exact env-var names, the minimal "just text replies" configuration, and how to turn each optional service on.

Core components

Discord bot

Built with py-cord. Voice send/receive uses davey to handle Discord's DAVE (Audio/Video E2E Encryption) protocol. The subscription surface and channel-mode slash commands are documented in Slash commands.

Transcription

Primary: Deepgram (Nova-2, streaming)

  • Native WebSocket streaming API, ~300ms latency, strong accuracy
  • Handles raw PCM streams directly — maps well to Discord's audio pipeline
  • Good Python SDK (deepgram-sdk)

Fallback: faster-whisper (local)

  • Zero cost, no rate limits, no external dependency
  • Requires a GPU for real-time performance
  • Good offline / privacy-preserving option

Pipeline: Discord 48kHz Opus → decode to PCM → resample to 16kHz → stream to Deepgram WebSocket (or feed chunks to faster-whisper).

Incoming voice audio is fed into the ConversationMonitor via VoiceLullMonitor debouncing — see Voice input for the full wiring.

AI response (OpenRouter)

The LLM call is the core of the bot's reply path. Its inputs — system prompt, retrieved knowledge, conversation history, per-user notes — are assembled by the Context pipeline, not inline in the bot loop. The LLM client (familiar_connect.llm) only speaks to OpenRouter; it is deliberately unaware of where its messages came from so the pipeline can be tested and extended in isolation.

  • Provider: OpenRouter. Model selection is per-call-site and per familiar — set individually in character.toml under [llm.main_prose], [llm.post_process_style], [llm.reasoning_context], [llm.history_summary], [llm.memory_search], [llm.interjection_decision], and [llm.mood_eval].
  • Streaming: Responses are streamed so the TTS path can start speaking before the full reply arrives.
  • Per-call-site slots: Each provider/processor holds its own LLMClient drawn from the slot it owns, so a familiar can pin a cheap model (e.g. openai/gpt-4o-mini) on the cheap slots while still using a heavyweight model on main_prose. The process-wide rate-limit semaphore is shared across every slot.

Context pipeline

Everything upstream of the OpenRouter call — character cards, system prompt assembly, memory retrieval, conversation history, and the cheap side calls each call site makes from its own LLMClient slot — is assembled by a single context pipeline that runs as a scoped asyncio.TaskGroup on every reply. The pipeline is the architectural backbone for all "AI behaviour knobs" in the bot.

See Context pipeline for the full design and step-by-step implementation history, and Memory for the on-disk memory directory the pipeline reads and writes.

Text-to-speech

Both providers are implemented; the active one is set via [tts].provider in character.toml. Default is "azure".

Azure Speech (default, provider = "azure")

  • Default voice: en-US-AmberNeural (set azure_voice to change)
  • Requires AZURE_SPEECH_KEY + AZURE_SPEECH_REGION env vars
  • Runs via the azure-cognitiveservices-speech SDK in a thread executor
  • Outputs Raw48Khz16BitMonoPcm — no resampling needed for Discord
  • Word-boundary events feed per-word timestamps for interruption detection

Cartesia Sonic (provider = "cartesia")

  • Purpose-built for real-time conversational AI; sub-100ms TTFB
  • Native WebSocket streaming
  • Requires CARTESIA_API_KEY env var; set voice_id + model in [tts]

Pipeline: LLM text → TTS (Azure SDK / Cartesia WebSocket) → PCM → resample to 48kHz Opus → Discord voice playback.

Voice interruption

When a user speaks during an active voice response, InterruptionDetector classifies the burst and dispatches one of five paths:

State Classification Action
GENERATING discarded Drop. Generation continues.
GENERATING short Polite wait — delivery gate holds playback until user is quiet. Interrupter transcript flushed to history after original buffer.
GENERATING long Cancel generation_task immediately on boundary crossing. Re-generate with interruption context.
SPEAKING short Tolerance roll decides yield or push-through. Yield: vc.stop(), re-synthesize remaining words, resume. Push-through: audio continues; interrupter transcript written to history.
SPEAKING long Tolerance roll → yield → vc.stop(), write delivered portion to history, re-generate with delivered + interrupter transcript as context.

ResponseTracker holds per-guild state (IDLE / GENERATING / SPEAKING), the cancellable LLM task, word timestamps for word-boundary splits, and scratch fields consumed by the post-playback dispatch logic.

See Voice interruption for the full design, configuration reference, and sequence diagrams.

Twitch integration

Connects to Twitch EventSub WebSocket as a task in the root asyncio.TaskGroup. See the Twitch guide for the event catalogue and slash command surface.

Monitoring dashboard

Starlette + Hypercorn (asyncio-native web dashboard):

  • Hypercorn runs on asyncio and mounts as a task in the bot's root asyncio.TaskGroup
  • Routes:
    • /health — JSON status of each service (Discord, Twitch, transcription, TTS, LLM)
    • /events — Recent event log via SSE or WebSocket
    • /context — Per-turn, per-provider latency and token metrics from the context pipeline, so provider/processor enable/disable decisions can be made from real measurements

Dashboard not yet shipped

The PipelineOutput.outcomes data is already captured per turn and bot.py logs a structured line per outcome; the web dashboard itself is a separate work item. See the Context pipeline page for the full list of deferred items.

Resilience

Third-party service calls. Service clients (LLMClient, CartesiaTTSClient, the Deepgram transcriber) raise on failure; callers decide the fallback. Every except clause on a service call enumerates its exception types — either directly (e.g. (httpx.HTTPError, ValueError, KeyError) for LLMClient.chat) or via a Protocol-declared type (e.g. PreProcessorError for pre-processors). No catch-all except Exception: is used on new code; contract violations surface loudly.

  • Main reply failure (bot.py). The main-prose LLMClient.chat call in both the text and voice reply paths catches the closed raise set (httpx.HTTPError, ValueError, KeyError), logs a warning, and returns silently. No apology text, no reaction, no history write for the failed turn — the user sees nothing and can simply retry. LLMClient's own 120 s httpx timeout is the ceiling; no extra asyncio.timeout wrapper is added because the main reply is the one call for which a long wait is preferable to a fallback.
  • LLM retry policy. LLMClient retries only on HTTP 429 with exponential backoff (honouring Retry-After when present, up to _MAX_DELAY_S). Every other failure — transport error, non-2xx response, malformed payload — is the caller's responsibility.
  • TTS failure. CartesiaTTSClient.synthesize raises on every non-2xx response and on any transport error. The bot.py call sites swallow TTS exceptions with _logger.exception so a missing voice clip never blocks a text reply. The client itself has no retry logic.
  • Context pipeline pre-processors. The pre-processor loop in ContextPipeline.assemble catches the Protocol-declared PreProcessorError only, logs a warning, and passes the last successful request on to the next stage. Any other exception escaping PreProcessor.process is a contract violation and propagates out of the pipeline — this is intentional so bugs surface loudly rather than being silently hidden.
  • Context pipeline providers and post-processors. Providers run under a scoped asyncio.TaskGroup with per-provider deadlines; misses are recorded as "timeout" / "error" outcomes. Post-processors are each wrapped in a pass-through try/except so a failing cleanup pass degrades to a no-op for just that stage.

Persistence

  • Raw transcripts of every conversation are stored verbatim in SQLite (familiar_connect.history.store.HistoryStore).
  • The memory directory contains the distilled, human-readable form of everything the familiar "knows." It is the model's view of the world.
  • Per-turn performance traces (latency, token counts, provider outcomes, A/B tags) are persisted in a separate SQLite DB — data/familiars/<id>/metrics.db — via familiar_connect.metrics.SQLiteCollector. See Metrics and profiling for the data model, the logging-vs-metrics boundary, and CLI usage.
  • Derived artefacts — rolling summaries, future vector indices, tag caches — are rebuildable from the raw transcript store and the memory directory. Losing them is annoying but not destructive.
  • Original imported character cards are kept verbatim alongside the unpacked self/ files (memory/self/.original.png), so a future change to the unpacking logic can re-run against the originals.