Add management command for deleting duplicate captions from YouTube #2409

pt2302 · 2025-02-04T02:27:07Z

What are the relevant tickets?

Final part of https://github.com/mitodl/hq/issues/1770.

Description (What does it do?)

This PR adds a management command for deleting the legacy (duplicate) captions track from YouTube videos.

How can this be tested?

First, use RC credentials to upload a sample video to YouTube. Then, open Django shell by running docker compose exec web ./manage.py shell, and execute something like the following, optionally changing the captions content and/or order of caption track uploads.

from io import BytesIO
from googleapiclient.errors import HttpError
from googleapiclient.http import MediaIoBaseUpload

from videos.youtube import YouTubeApi

LEGACY_CAPTIONS_NAME = "ocw_studio_upload"
CC_ENGLISH_CAPTIONS_NAME = "CC (English)"

HARD_CODED_VTT = b"""WEBVTT

00:00:00.000 --> 00:00:01.000
10

00:00:01.000 --> 00:00:02.000
9

00:00:02.000 --> 00:00:03.000
8

00:00:03.000 --> 00:00:04.000
7

00:00:04.000 --> 00:00:05.000
6

00:00:05.000 --> 00:00:06.000
5

00:00:06.000 --> 00:00:07.000
4

00:00:07.000 --> 00:00:08.000
3

00:00:08.000 --> 00:00:09.000
2

00:00:09.000 --> 00:00:10.000
1
"""

def upload_captions(video_id: str):
    yt = YouTubeApi()

    try:
        legacy_media = MediaIoBaseUpload(
            BytesIO(HARD_CODED_VTT),
            mimetype="text/vtt",
            chunksize=-1,
            resumable=True
        )
        legacy_response = yt.client.captions().insert(
            part="snippet",
            sync=False,
            body={
                "snippet": {
                    "language": "en",
                    "name": LEGACY_CAPTIONS_NAME,
                    "videoId": video_id,
                }
            },
            media_body=legacy_media,
        ).execute()

        legacy_id = legacy_response["id"]
        
        downloaded_data = yt.client.captions().download(id=legacy_id).execute()

        list_response = yt.client.captions().list(
            part="snippet", videoId=video_id
        ).execute()
        cc_track = [
            item for item in list_response.get("items", [])
            if item["snippet"].get("name") == CC_ENGLISH_CAPTIONS_NAME
        ]

        media_copy = MediaIoBaseUpload(
            BytesIO(downloaded_data),
            mimetype="text/vtt",
            chunksize=-1,
            resumable=True
        )
        if cc_track:
            yt.client.captions().update(
                part="snippet",
                body={"id": cc_track[0]["id"]},
                media_body=media_copy,
            ).execute()
        else:
            yt.client.captions().insert(
                part="snippet",
                sync=False,
                body={
                    "snippet": {
                        "language": "en",
                        "name": CC_ENGLISH_CAPTIONS_NAME,
                        "videoId": video_id,
                    }
                },
                media_body=media_copy,
            ).execute()

    except HttpError as http_err:
        print("YouTube API error:", http_err)

Then, run upload_captions(<youtube video id>). There should now be two captions tracks for the YouTube video, allowing for testing the management command. Next, run docker compose exec web ./manage.py delete_duplicate_captions_youtube --filter <website short id> to test the management command, and verify that the legacy captions track has been deleted on YouTube.

…ideos

umar8hassan · 2025-02-25T13:57:33Z

videos/management/commands/delete_duplicate_captions_youtube.py

+                if not items:
+                    continue
+
+                newest = items[0]


We should get the the newest item that matches either of the two concerned names.

Maybe it's because the setting (or the sample video itself) but we get an auto-translation in Russian as well. in my testing i got the the newest item as below. Since, it did not match any of the expected values in the if-else that followed, it did not work for me.

pt2302 · 2025-02-25T19:12:30Z

/Gemini review

gemini-code-assist

Code Review

This pull request introduces a management command for deleting duplicate captions from YouTube videos. The command iterates through video files associated with specified websites, identifies duplicate caption tracks (specifically 'ocw_studio_upload'), and removes them. The test suite covers scenarios where the newest track is either 'ocw_studio_upload' or 'CC (English)'.

Summary of Findings

Error Handling: The except HttpError block in the management command could benefit from more specific error handling or logging to provide better insights into the nature of the errors encountered during YouTube API interactions.
Clarity of Variable Names: Consider renaming variables like vf to something more descriptive like video_file to improve code readability.
Redundancy in Test Assertions: In the test suite, there are multiple assertions related to API calls. Consider consolidating these assertions to improve test readability and maintainability.

Assessment

The pull request introduces a management command to delete duplicate captions from YouTube videos, which addresses a specific issue related to legacy caption tracks. The code appears well-structured and includes a test suite to verify the command's functionality. Overall, the changes seem reasonable and necessary. I've provided some feedback to further improve the code quality and clarity. Please address these comments before merging, and ensure that others review and approve this code as well.

gemini-code-assist · 2025-02-25T19:12:43Z

videos/management/commands/delete_duplicate_captions_youtube.py

+            except HttpError:
+                log.exception("Error processing video %s", video_id)


Consider logging the specific HttpError that occurred, or handling different types of HttpError differently. This could provide more context for debugging and monitoring. For example, a 404 error might indicate a video was deleted, while a quota error might require a different response.

Suggested change

except HttpError:

log.exception("Error processing video %s", video_id)

except HttpError as e:

log.exception("Error processing video %s: %s", video_id, e)

gemini-code-assist · 2025-02-25T19:12:43Z

videos/management/commands/delete_duplicate_captions_youtube_test.py

+    mock_youtube_api.client.captions.return_value.list.assert_called_with(
+        part="snippet", videoId="dummy_youtube_id"
+    )
+    mock_youtube_api.client.captions.return_value.download.assert_called_once()
+
+    mock_youtube_api.client.captions.return_value.delete.assert_called_once_with(
+        id="caption_id_legacy"
+    )


Consider consolidating the assertions into a single block or using assert_has_calls to verify the sequence of calls. This can improve test readability and maintainability.

calls = [ mocker.call.client.captions.return_value.list(part="snippet", videoId="dummy_youtube_id"), mocker.call.client.captions.return_value.download(), mocker.call.client.captions.return_value.delete(id="caption_id_legacy") ] mock_youtube_api.assert_has_calls(calls)

pt2302 added 3 commits February 3, 2025 21:24

Add management command for deleting duplicate captions from YouTube v…

d06f985

…ideos

Adding test for deleting duplicate captions

e6770b2

Fixing mocking YouTube API in test

a9e7290

pt2302 marked this pull request as ready for review February 20, 2025 06:51

pt2302 added the Needs Review label Feb 20, 2025

umar8hassan self-assigned this Feb 25, 2025

umar8hassan self-requested a review February 25, 2025 10:28

umar8hassan requested changes Feb 25, 2025

View reviewed changes

umar8hassan added Waiting on Author and removed Needs Review labels Feb 25, 2025

umar8hassan assigned pt2302 Feb 25, 2025

gemini-code-assist bot reviewed Feb 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add management command for deleting duplicate captions from YouTube #2409

Add management command for deleting duplicate captions from YouTube #2409

pt2302 commented Feb 4, 2025 •

edited

Loading

umar8hassan Feb 25, 2025

pt2302 commented Feb 25, 2025

gemini-code-assist bot left a comment

gemini-code-assist bot Feb 25, 2025

gemini-code-assist bot Feb 25, 2025

		except HttpError:
		log.exception("Error processing video %s", video_id)

Add management command for deleting duplicate captions from YouTube #2409

Are you sure you want to change the base?

Add management command for deleting duplicate captions from YouTube #2409

Conversation

pt2302 commented Feb 4, 2025 • edited Loading

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

umar8hassan Feb 25, 2025

Choose a reason for hiding this comment

pt2302 commented Feb 25, 2025

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Assessment

gemini-code-assist bot Feb 25, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Feb 25, 2025

Choose a reason for hiding this comment

pt2302 commented Feb 4, 2025 •

edited

Loading