Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add object detection pipeline #243

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

RUFFY-369
Copy link
Collaborator

@RUFFY-369 RUFFY-369 commented Oct 28, 2024

What does this PR do?

Towards closing /livepeer/bounties/#61

This PR works on adding a new object-detection pipeline for a real real-time 😉 detection and tracking with reduced latency and high accuracy, tested locally with the PekingU/rtdetr_r50vd model. After building the Docker image, the setup has been tested by running it locally on a Uvicorn server.

Expected Results:

Original video:
football

Expected merged pipeline output:

football_out

Bonus expected merged pipeline output for a high speed chase 😉 🥂 :

chase (1) (1)

cc @rickstaa

@RUFFY-369 RUFFY-369 requested a review from rickstaa as a code owner October 28, 2024 12:28
@RUFFY-369 RUFFY-369 marked this pull request as draft October 28, 2024 12:30
@RUFFY-369 RUFFY-369 marked this pull request as ready for review October 29, 2024 16:02
@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Nov 26, 2024

I guess my initial comments did not post here! Not sure what happened, sorry about that.

Overall this PR looks good but would like to adjust some to provide more data back that may assist with analysis of the objects detected and speed up the processing.

  1. The conversion of frames to base64 encoded strings takes a long time. The timing results from test video posted in go-livepeer ObjectDetection PR are below. I think it would be better to return a base64 encoded video file from the ai-runner that will be converted to binary url download at the ai-worker (likely on same machine so bandwidth delays should be negligible) when the ai-worker posts the results back to the Orchestrator.
2024-11-26 21:58:04,036 - app.routes.object_detection - INFO - Decoded video in 3.14 seconds
2024-11-26 21:58:18,198 - app.routes.object_detection - INFO - Detections processed in 14.16 seconds
2024-11-26 22:00:07,278 - app.routes.object_detection - INFO - Annotated frames converted to data URLs in 109.08 seconds, frame count: 266
2024-11-26 22:00:08,900 INFO:     172.17.0.1:52420 - "POST /object-detection HTTP/1.1" 200 OK
  1. Please add parameter to request to make returning the video encoded with the annotated frames optional. Annotating the frames and re-encoding takes the most time so I think should be an optional thing to return with the default being not to return the annotated frames. Naming can be something like return_annotated_video and accept true/false as values.

  2. Add detection boxes to the data returned and the frame pts (I think is available in frame.pts when decoding).

  3. Why was the confidence scores and detection labels returned in separate lists instead of list of detections like example below? I am not sure what value this provides other than just counting the objects detected in the segment. EDIT: this is kind of vague, apologies, was curious on why separate lists were chosen instead of one list and the second sentence was driving at without the time information on the detections it is not possible to know where the detections happened in the segment that can be 2-30+ seconds long.

"detections": [
     {
        "label": "ball",
        "confidence": .95
     },
     {
        "label": "person",
        "confidence": .98
     }
     ...
]
  1. I had some test videos fail which I think may be PyAV related. I will test again tomorrow morning and add notes in separate comment. Found in testing adding , mode="r" to av.open fixed my test videos.

@RUFFY-369
Copy link
Collaborator Author

I guess my initial comments did not post here! Not sure what happened, sorry about that.

Overall this PR looks good but would like to adjust some to provide more data back that may assist with analysis of the objects detected and speed up the processing.

  1. The conversion of frames to base64 encoded strings takes a long time. The timing results from test video posted in go-livepeer ObjectDetection PR are below. I think it would be better to return a base64 encoded video file from the ai-runner that will be converted to binary url download at the ai-worker (likely on same machine so bandwidth delays should be negligible) when the ai-worker posts the results back to the Orchestrator.
2024-11-26 21:58:04,036 - app.routes.object_detection - INFO - Decoded video in 3.14 seconds
2024-11-26 21:58:18,198 - app.routes.object_detection - INFO - Detections processed in 14.16 seconds
2024-11-26 22:00:07,278 - app.routes.object_detection - INFO - Annotated frames converted to data URLs in 109.08 seconds, frame count: 266
2024-11-26 22:00:08,900 INFO:     172.17.0.1:52420 - "POST /object-detection HTTP/1.1" 200 OK
  1. Please add parameter to request to make returning the video encoded with the annotated frames optional. Annotating the frames and re-encoding takes the most time so I think should be an optional thing to return with the default being not to return the annotated frames. Naming can be something like return_annotated_video and accept true/false as values.
  2. Add detection boxes to the data returned and the frame pts (I think is available in frame.pts when decoding).

For poins 1-3, I have pushed the commits

  1. Why was the confidence scores and detection labels returned in separate lists instead of list of detections like example below? I am not sure what value this provides other than just counting the objects detected in the segment. EDIT: this is kind of vague, apologies, was curious on why separate lists were chosen instead of one list and the second sentence was driving at without the time information on the detections it is not possible to know where the detections happened in the segment that can be 2-30+ seconds long.
"detections": [
     {
        "label": "ball",
        "confidence": .95
     },
     {
        "label": "person",
        "confidence": .98
     }
     ...
]

Regarding point no.4, you are correct that it will be difficult to associate individual detections with their counterpart labels and BBs but I think for giving them out as individual lists it may become simpler for users to do batch processing. It may also reduce the overhead in JSON payload due to its structure and also reduce the memory compensation due to data structure being simple lists instead of KV pairs.
What do you think @ad-astra-video

  1. I had some test videos fail which I think may be PyAV related. I will test again tomorrow morning and add notes in separate comment. Found in testing adding , mode="r" to av.open fixed my test videos.

I added it in this code as well in the latest commit to avoid possible failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants