Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support object detection #146

Open
RichMorin opened this issue Jan 7, 2023 · 1 comment
Open

Support object detection #146

RichMorin opened this issue Jan 7, 2023 · 1 comment
Labels
kind:feature New feature or request

Comments

@RichMorin
Copy link

RichMorin commented Jan 7, 2023

Note: I'm posting this issue at Sean Moriarity's (emailed) suggestion. However, I'm not at all sure what needs to be done here, let alone how. So, I'll just summarize the use case.

As discussed here, I'd like there to be a way to scan videos of technical presentations, extract the text and layout of slides, and generate corresponding Markdown files. Aside from making it possible to search the slides, this could help to make the presentations more accessible to blind and visually impaired users.

Let's assume that a video has been downloaded from a web site (e.g., via VLC media player) and that we can extract individual images from the resulting file (e.g., via Membrane).

On edited videos, these images will often contain regions showing the presenter, a slide, and assorted fill. Before we can process the slide (e.g., via Tesseract OCR), we need to extract it from the surrounding image. And, before we can do that, we need to determine its boundaries. According to Sean:

This is an object segmentation task. It's a task available in pre-trained models on HuggingFace like DETR -- which means we can certainly build the same functionality into Bumblebee. Object segmentation outlines the boundary region of an image as you describe, and then you can use that boundary region to do whatever you want.

As a side note, various related tasks will need to be addressed. For example, a production system should identify and handle duplicate images, dynamic content, embedded graphics, etc. It would also be nice to generate and incorporate transcriptions from the audio track (and a pony...).

@jonatanklosko jonatanklosko added the kind:feature New feature or request label Jan 9, 2023
@jonatanklosko jonatanklosko changed the title Support image region extraction as an object segmentation task Support object detection Mar 31, 2023
@jonatanklosko
Copy link
Member

I've been looking into this, a couple notes below.

A number of object detection models uses a "backbone" model internally, and for the most part the backbone morel is loaded from the timm python package, which kills it for us. This is the case for DETR.

One model that doesn't rely on external backbone is YOLOS, however it works on dynamic shapes. Specifically, it accepts any height/width-shaped input and the featurizer doesn't target a specific shape either, it only resizes images to fit in certain boundaries.

One related model I noticed is OWL-ViT, which supports zero-shot object detection (i.e. labels are given on the fly) and image-guided object detection (looking for objects similar to a reference image). Looking at the featurizer, it always resizes to a specific size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants