Stage 3: Extractor¶
For each planned claim, the extractor locates the verbatim span in the raw transcript and produces a proposal with full metadata. No paraphrasing — extraction only.
Test this stage in isolation with
auto-lorebook seed-ingest --at=plan, thenreplan <sid>(which runs the planner then this extractor). See QA seeding.
Purpose¶
Narrow by design: locate and snip. Cleanup happens upstream (reading- stage corrections) and downstream (summarizer prose). Narrowness buys the mechanical substring guarantee, which a paraphrasing extractor can't provide.
Input¶
- Approved plan, including per-claim
locator_hintranges. - Approved reading, including its
name_correctionsmap. - Raw transcript, accessed per-proposal via the hint window — not fed whole.
- Reduced preamble — transcription corrections and entity aliases only. See context pipeline.
Output¶
One YAML file per proposed fact at
pending/<ingest_id>/proposals/<proposal_id>.yaml:
schema_version: 1
proposal_type: new_fact # new_fact | new_entity_with_facts
target_entity: Aldara
proposed_id: aldara-f004
claim_group_id: cg-001 # shared with siblings routing the same claim elsewhere
claim_group_siblings: # other targets of the same claim, informational
- entity: Theron
proposed_id: theron-f011
- entity: Second Age
proposed_id: second-age-f002
text: "Theron's grandfather founded Aldara in the Second Age."
raw_transcript_span: "Fair-on's grandfather founded all-dara in the Second Age."
text_corrects_transcript: true
corrections_applied:
- from: "Fair-on"
to: "Theron"
source: global-transcription-correction
- from: "all-dara"
to: "Aldara"
source: reading-name-correction
source_id: yt-abc123
locator: "0:04:32-0:04:41"
speaker: DM
status: authoritative
status_reason: null
session_date: 2026-01-15
section: founding
reading_section: "[4:30-8:00] Founding of Aldara"
reading_bullet_index: 0
context_before: "So let's talk about the founding of Aldara."
context_after: "And that's why the Theron name matters so much now."
Extraction rules¶
raw_transcript_spanmust be a literal substring of the transcript between the given timestamps. The tool verifies this after the LLM responds; if verification fails, retry or flag.textdiffers fromraw_transcript_spanonly through appliedcorrections_applied. No rewriting, cleanup, or filler removal at this stage.- Windowed search. Each proposal's prompt is fed only the
transcript slice covering its
locator_hintwindow, not the whole transcript or the whole segment. This is the primary lever keeping Stage 3 prompts small and uniform in size across proposals. - Fallback on miss. If substring verification fails within the
hint window, the extractor retries once with the span widened to
the full parent segment from 1a. If that also fails, flag with
extractor_flagged: true. A "widened to segment" retry is logged on the proposal (hint_widened: true) so systematic anchor drift in 1b is visible. - Hints are advisory; the authoritative locator is produced here.
The final
locatoron the proposal is the precise range where the span actually lands in the transcript, derived during extraction — not copied fromlocator_hint. The hint only narrows the search space. - If the claim cannot be found in a single contiguous span, flag with
extractor_flagged: trueand an explanation. Do not synthesize across non-adjacent spans. - Parallelizable: each proposal is independent.
Claim-group deduplication¶
Sibling proposals within a claim_group_id share the same
raw_transcript_span, locator, text, and corrections_applied.
The extractor runs once per claim group — not once per proposal — then
copies the result to each sibling.
Each sibling still gets its own proposal file, its own proposed_id,
and its own target_entity and section; only the span-location
fields are shared. If extraction fails for a claim group, all sibling
proposals inherit the same extractor_flagged state.
Next stage: human fact review.