-
Notifications
You must be signed in to change notification settings - Fork 9
Closed
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
Summary
Generated YouTube links in assistant responses sometimes include t= timestamps that exceed the video duration (invalid timestamps). Example: for https://www.youtube.com/watch?v=xAgTmSPVDGs (duration 201s) we observed a generated link ...&t=270s.
Steps to reproduce
- Query the assistant with a question that triggers RAG results from YouTube transcripts (e.g., "Quiero saber si puedo crear cuestionarios con Moodle").
- Inspect the assistant response for YouTube links with
t=parameters. - Observe timestamps that are greater than the actual video duration.
Observed behavior
- The
source_urlpresent in some stored RAG metadata points to YouTube with&t=<start_seconds>. - Some generated links contain timestamps beyond the video duration (e.g.,
t=270sfor a 201s video).
Expected behavior
- Generated YouTube links must reference valid timestamps within the video duration.
- Timestamps should be derived from chunk
start_time/end_timeand validated against the actual video duration.
Evidence / Notes
- ChromaDB chunks for the video
xAgTmSPVDGsinclude correctstart_time/end_timevalues (e.g.,4.309-65.59,65.59-126.749,126.749-186.789). - The ingest code currently sets
source_urlas:"source_url": f"{url}&t={int(c['start'])}",(seelamb-kb-server-stable/backend/plugins/youtube_transcript_ingest.py). That code does not validate against video duration. - The RAG code that formats sources (
backend/lamb/completions/rag/simple_rag.pyandcontext_aware_rag.py) does not currently surfacesource_urlin some contexts (it usesfile_url/original_file_url).
Suspected causes
- Timestamps may be incorrectly computed somewhere downstream (summing chunk times or using wrong fields).
- The ingest plugin doesn't validate
startagainst the video's duration (no call to fetch total duration), and LLM or formatting code may be altering timestamps.
Files to inspect
lamb-kb-server-stable/backend/plugins/youtube_transcript_ingest.py(chunking andsource_urlgeneration)backend/lamb/completions/rag/simple_rag.pyandcontext_aware_rag.py(how sources/metadata are formatted)backend/lamb/completions/pps/simple_augment.py(howrag_contextis injected into prompts)
Suggested fix ideas
- Ensure
youtube_transcript_ingest.pyfetches and stores video duration, and validatestart/endwhen generatingsource_url. - Ensure RAG formatting surfaces
source_url(not justfile_url) and avoid any transformations that can produce invalid timestamps. - Add unit/integration tests reproducing the issue (e.g., ingest a known video and validate generated links do not exceed duration).
Severity
- Medium: content is correct but links are broken, harming user experience and credibility.
Please add any additional reproduction steps or assign the issue to a reviewer.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed