Skip to content

Classifying, summarzing and topic extracting from text transcribing from video or audio soruces.

License

Notifications You must be signed in to change notification settings

Tasfiq-K/just-transcribe-and-summarize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visitors

A Multilabel Text classifier with integrated transcriber and summarizer

Just Transcribe and Summaraize

A multipurpose text classification model transcribes speech from video/audio sources, summarizes, classifies and extracts important topics from the text.

If you Want to find out the transcriptions and the summarized version of your favorite video or text? May it be a lecture a vlog or a personal one with conversations in it or a long text/paragraph. This is the place then. You can have your video transcribed, summarized. Not only that, you will also be able to know the type of the video or text and important topics in it.

This model can classify 316 different types of tags of the text
The keys of json_files/tag_types_encoded.json contains the tags that it can classify.

Data Collection

For training the data were collected from two sources.

  • A kaggle dataset
  • From Youtube through scraping.

Data from these sources were merged to create the final dataset which was used for training purpose. Checkout the csv_files for the scraped dataset from Youtube. Also, checkout the csv_files/link.md to find out about the the dataset from kaggle and the final merged dataset.

Data from Youtube were collected through web scraping. Selenium 4.9.0 was used to do the scraping. Script for the scraping can be found in scrapers directory.

Data Preparation

The data preparation part was very cruical for this project. The dataset from kaggle contains 190K+ data. So scaling down the data was necessary. I took 35% of the whole data and merged with the data scraped from youtube. So, the total size of the dataset became 68275.

Data scraped from youtube only contained the links of the video and the corresponding tags. To transcribe the text from the video WhisperX was used. The transcribing part took a lot of time. notebook/video_transcribing.ipynb file contains the whole process of video transcribing. notebook/speech_text_preprocessing contains the merging of the two datasets.

Model Training

For training the model distilroberta-base from HuggingFace along with blurr and fastai were used. The traininig part consists two phases. First, training with freezed layers, then with unfreezed layers. Achieved 99% accuracy after training for 5 epochs. Model training notebook can be found in the notebook directory.

Techincal Details 📐

The training was done in Google Colaboratory environment on a T4 GPU with 15 GB VRAM. At training on freezed layers a learning rate of 3.8e-3 was used which was determined by using fastai's lr_find function. The training consists of 2 epochs. Later at unfreezed state more 3 epoch were trained using learning rates between 1.5e-04 to 2.5e-6.

Used Models:

  • distilroberta-base from huggingface models for training
  • whisperX for transcribing videos. Checkout WhisperX
  • faster-wshiper for transcribing audio/video in production. Checkout Faster-whsiper
  • led-large-book-summary from huggingface models for summarizing.
  • keyBert for keyword extraction. Checkout Keybert

Model Compression & ONNX Inference

The trained model has a memory of 317MB. Compressed this model using ONNX quantization and brought it under 80MB.

F1 Scores

F1 Score (Micro) = 0.42

F1 Score (Macro) = 0.32

Demos 🚀

HF spaces 🤗

A live demo for classifying a blog post or text is at Hugging Face spaces. Check out the live demo here

Used text to produce the result in the demo image can be found here

Web Deployment

Deployed a Flask App on onrender. Check deploy branch. The website is live here

Either a video link, a media file (video/audio) or text can be uploaded. which will output the transcriptions, summary of the transcription text, types of the text and important topics extracted from the text. After filling one of the form, click the submit buttion to submit the form.

Caution: UPLOAD ONLY ONE ITEM



The output page might look like this. (Depending on what you upload)



The output generated submitting this Youtube video to the page.

Note: The output generated from the localhost. As to produce the output it takes a lot of memory, the web app on onrender can't hold that and goes into a out of memory phase, which crashes the webpage.

Limitations ⚠️

  • Currently on onrender, It can not provide the result page, where the output would have been shown, because of the memory limit. As the models take up a lot of memory. probably with much memory or using a GPU would solve that problem.
  • The classifier can only classifies 316 types of tags. As the dataset was huge, couldn't afford to use more data than that of used to train the model.
  • The Micro and Macro average score is still not that high.

Future Work 📆

  • Developing a system, that can run and show result even at a low memory system.
  • Using more data to train the model, and trying different model architecture.
  • Improving the Macro and Micro average score.
  • Implementing own summarizer.

Contributions 🙋

I'm open for contribution. If you know how to add more to this project, or to take the project to another level, or you know how to optimize the memory problem, you are more than welcome to contribute. The pull request is always open.