Need help implementing scalable youtube transcript grabbing feature #628

Alexandre-Chapelle · 2025-02-14T16:18:33Z

Alexandre-Chapelle
Feb 14, 2025

Hey!

For the past few days I've been thinking of a way to implement the missing functionality of youtube transcription grabbing, seems quite easy at first, but the devil is in details.

IMPORTANT: Yes, we could just do everything in memory and don't save anything to the db or find other workarounds, but I would like to implement a scalable and maintainable solution.

Current Challenges & Key Points

A typical 10-minute video transcript contains 800–1500 words .
For a single query, we would usually receive around 6 - 12 videos, resulting in approximately 5000 - 18000 words of transcript data per query.
Edge case: (e.g., 1-hour or 10-hour videos) can significantly increase this volume, making processing expensive and long. For now I just filter out videos <= 1 hour
Youtubei might be too unreliable

My initial approach

User submits a query (e.g., "how do i create a program with c++ today").
Backend sends the query to an LLM to clean it up and simplify (via youtubeSearchRetrieverPrompt)
The transformed query is sent to searxng to retrieve the most relevant videos.
At this point we have the results, now for each video:
4.1 Check if the video duration is < 1 hour.
4.2. Check if this video already exist in the main database, if not, proceed to the following steps:
4.3. Create a record in the main database (pk id, videoId, duration, createdAt, updatedAt)
4.4. Fetch the transcript in english.
4.5. Split the transcript into chunks using RecursiveCharacterTextSplitter (~chunkSize = 10, separators: ['\n\n', '\n', ' ', ''])
4.6. Split the transcript into semantic units (200 words, if < 200 than what's left)
4.7. Generate embeddings for each semantic unit
4.8. Store the embeddings, semantic units, and video metadata in a vector db (let's say milvus)
Use L2 to rank these videos based on their metadata (title/description) and query embedding.
If videos count is more than 5 than select the top 4 - 7 most relevant videos.
Optional, but would add a nice touch. Include adjacent chunks to add context.
This step is optional aswell, but I think it would be nice to have it: apply MMR to diversify results.
Pass the processed data to another LLM with the youtubeSearchResponsePrompt to generate final summary with citations.

Issues with this approach

Storing embeddings, semantic units, and metadata for a thousand or 10k videos is fine, but for a 100k or 1m or more will probably cause either issues with storage or with the speed of retrieval.
This approach involves at least 5 - 10 api calls to the embedding model which is a lot.
The multi-step process introduces significant delays, and I'm sure that it can be simplified.
Long videos (e.g., 1-hour or 10 - hour) produce massive amounts of transcript data.
There might be more issues that I haven't thought about.
As youtubei is using YouTube's internal endpoint, I think we could find a better approach to achieven similair results (scrapping won't last long-term either). My problem with the youtubei/v1 endpoint is that the rate limits are unknown and it can be taken down at any point. For the rate limits we could theoretically introduce a proxy rotation service, but this is already getting to deep.

Final remarks:
I’m here to learn and improve, so if you have any feedback, critiques, or ideas, please share them. I might’ve missed something or made mistakes, since I’m still getting familiar with the codebase, so feel free to correct me or suggest solutions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help implementing scalable youtube transcript grabbing feature #628

{{title}}

Replies: 0 comments

Select a reply

Need help implementing scalable youtube transcript grabbing feature #628

Alexandre-Chapelle Feb 14, 2025

Current Challenges & Key Points

My initial approach

Issues with this approach

Replies: 0 comments

Alexandre-Chapelle
Feb 14, 2025