Support object detection #146

RichMorin · 2023-01-07T00:49:57Z

Note: I'm posting this issue at Sean Moriarity's (emailed) suggestion. However, I'm not at all sure what needs to be done here, let alone how. So, I'll just summarize the use case.

As discussed here, I'd like there to be a way to scan videos of technical presentations, extract the text and layout of slides, and generate corresponding Markdown files. Aside from making it possible to search the slides, this could help to make the presentations more accessible to blind and visually impaired users.

Let's assume that a video has been downloaded from a web site (e.g., via VLC media player) and that we can extract individual images from the resulting file (e.g., via Membrane).

On edited videos, these images will often contain regions showing the presenter, a slide, and assorted fill. Before we can process the slide (e.g., via Tesseract OCR), we need to extract it from the surrounding image. And, before we can do that, we need to determine its boundaries. According to Sean:

This is an object segmentation task. It's a task available in pre-trained models on HuggingFace like DETR -- which means we can certainly build the same functionality into Bumblebee. Object segmentation outlines the boundary region of an image as you describe, and then you can use that boundary region to do whatever you want.

As a side note, various related tasks will need to be addressed. For example, a production system should identify and handle duplicate images, dynamic content, embedded graphics, etc. It would also be nice to generate and incorporate transcriptions from the audio track (and a pony...).

jonatanklosko · 2023-04-11T17:07:57Z

I've been looking into this, a couple notes below.

A number of object detection models uses a "backbone" model internally, and for the most part the backbone morel is loaded from the timm python package, which kills it for us. This is the case for DETR.

One model that doesn't rely on external backbone is YOLOS, however it works on dynamic shapes. Specifically, it accepts any height/width-shaped input and the featurizer doesn't target a specific shape either, it only resizes images to fit in certain boundaries.

One related model I noticed is OWL-ViT, which supports zero-shot object detection (i.e. labels are given on the fly) and image-guided object detection (looking for objects similar to a reference image). Looking at the featurizer, it always resizes to a specific size.

jonatanklosko added the kind:feature New feature or request label Jan 9, 2023

jonatanklosko mentioned this issue Feb 13, 2023

[Feature request] Face detection in Images (Haar cascades or DNN) #158

Closed

jonatanklosko changed the title ~~Support image region extraction as an object segmentation task~~ Support object detection Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support object detection #146

Support object detection #146

RichMorin commented Jan 7, 2023 •

edited

jonatanklosko commented Apr 11, 2023

Support object detection #146

Support object detection #146

Comments

RichMorin commented Jan 7, 2023 • edited

jonatanklosko commented Apr 11, 2023

RichMorin commented Jan 7, 2023 •

edited