New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support object detection #146
Comments
I've been looking into this, a couple notes below. A number of object detection models uses a "backbone" model internally, and for the most part the backbone morel is loaded from the One model that doesn't rely on external backbone is YOLOS, however it works on dynamic shapes. Specifically, it accepts any height/width-shaped input and the featurizer doesn't target a specific shape either, it only resizes images to fit in certain boundaries. One related model I noticed is OWL-ViT, which supports zero-shot object detection (i.e. labels are given on the fly) and image-guided object detection (looking for objects similar to a reference image). Looking at the featurizer, it always resizes to a specific size. |
Note: I'm posting this issue at Sean Moriarity's (emailed) suggestion. However, I'm not at all sure what needs to be done here, let alone how. So, I'll just summarize the use case.
As discussed here, I'd like there to be a way to scan videos of technical presentations, extract the text and layout of slides, and generate corresponding Markdown files. Aside from making it possible to search the slides, this could help to make the presentations more accessible to blind and visually impaired users.
Let's assume that a video has been downloaded from a web site (e.g., via VLC media player) and that we can extract individual images from the resulting file (e.g., via Membrane).
On edited videos, these images will often contain regions showing the presenter, a slide, and assorted fill. Before we can process the slide (e.g., via Tesseract OCR), we need to extract it from the surrounding image. And, before we can do that, we need to determine its boundaries. According to Sean:
As a side note, various related tasks will need to be addressed. For example, a production system should identify and handle duplicate images, dynamic content, embedded graphics, etc. It would also be nice to generate and incorporate transcriptions from the audio track (and a pony...).
The text was updated successfully, but these errors were encountered: