Skip to content

Bug: YouTube source_url timestamps exceed video duration #205

@juananpe

Description

@juananpe

Summary

Generated YouTube links in assistant responses sometimes include t= timestamps that exceed the video duration (invalid timestamps). Example: for https://www.youtube.com/watch?v=xAgTmSPVDGs (duration 201s) we observed a generated link ...&t=270s.

Steps to reproduce

  1. Query the assistant with a question that triggers RAG results from YouTube transcripts (e.g., "Quiero saber si puedo crear cuestionarios con Moodle").
  2. Inspect the assistant response for YouTube links with t= parameters.
  3. Observe timestamps that are greater than the actual video duration.

Observed behavior

  • The source_url present in some stored RAG metadata points to YouTube with &t=<start_seconds>.
  • Some generated links contain timestamps beyond the video duration (e.g., t=270s for a 201s video).

Expected behavior

  • Generated YouTube links must reference valid timestamps within the video duration.
  • Timestamps should be derived from chunk start_time/end_time and validated against the actual video duration.

Evidence / Notes

  • ChromaDB chunks for the video xAgTmSPVDGs include correct start_time/end_time values (e.g., 4.309-65.59, 65.59-126.749, 126.749-186.789).
  • The ingest code currently sets source_url as: "source_url": f"{url}&t={int(c['start'])}", (see lamb-kb-server-stable/backend/plugins/youtube_transcript_ingest.py). That code does not validate against video duration.
  • The RAG code that formats sources (backend/lamb/completions/rag/simple_rag.py and context_aware_rag.py) does not currently surface source_url in some contexts (it uses file_url / original_file_url).

Suspected causes

  1. Timestamps may be incorrectly computed somewhere downstream (summing chunk times or using wrong fields).
  2. The ingest plugin doesn't validate start against the video's duration (no call to fetch total duration), and LLM or formatting code may be altering timestamps.

Files to inspect

  • lamb-kb-server-stable/backend/plugins/youtube_transcript_ingest.py (chunking and source_url generation)
  • backend/lamb/completions/rag/simple_rag.py and context_aware_rag.py (how sources/metadata are formatted)
  • backend/lamb/completions/pps/simple_augment.py (how rag_context is injected into prompts)

Suggested fix ideas

  • Ensure youtube_transcript_ingest.py fetches and stores video duration, and validate start/end when generating source_url.
  • Ensure RAG formatting surfaces source_url (not just file_url) and avoid any transformations that can produce invalid timestamps.
  • Add unit/integration tests reproducing the issue (e.g., ingest a known video and validate generated links do not exceed duration).

Severity

  • Medium: content is correct but links are broken, harming user experience and credibility.

Please add any additional reproduction steps or assign the issue to a reviewer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions