Need help implementing scalable youtube transcript grabbing feature #628
Alexandre-Chapelle
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey!
For the past few days I've been thinking of a way to implement the missing functionality of youtube transcription grabbing, seems quite easy at first, but the devil is in details.
IMPORTANT: Yes, we could just do everything in memory and don't save anything to the db or find other workarounds, but I would like to implement a scalable and maintainable solution.
Current Challenges & Key Points
My initial approach
4.1 Check if the video duration is < 1 hour.
4.2. Check if this video already exist in the main database, if not, proceed to the following steps:
4.3. Create a record in the main database (pk id, videoId, duration, createdAt, updatedAt)
4.4. Fetch the transcript in english.
4.5. Split the transcript into chunks using RecursiveCharacterTextSplitter (~chunkSize = 10, separators: ['\n\n', '\n', ' ', ''])
4.6. Split the transcript into semantic units (200 words, if < 200 than what's left)
4.7. Generate embeddings for each semantic unit
4.8. Store the embeddings, semantic units, and video metadata in a vector db (let's say milvus)
Issues with this approach
Final remarks:
I’m here to learn and improve, so if you have any feedback, critiques, or ideas, please share them. I might’ve missed something or made mistakes, since I’m still getting familiar with the codebase, so feel free to correct me or suggest solutions.
Beta Was this translation helpful? Give feedback.
All reactions