Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add management command for deleting duplicate captions from YouTube #2409

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pt2302
Copy link
Contributor

@pt2302 pt2302 commented Feb 4, 2025

What are the relevant tickets?

Final part of https://github.com/mitodl/hq/issues/1770.

Description (What does it do?)

This PR adds a management command for deleting the legacy (duplicate) captions track from YouTube videos.

How can this be tested?

First, use RC credentials to upload a sample video to YouTube. Then, open Django shell by running docker compose exec web ./manage.py shell, and execute something like the following, optionally changing the captions content and/or order of caption track uploads.

from io import BytesIO
from googleapiclient.errors import HttpError
from googleapiclient.http import MediaIoBaseUpload

from videos.youtube import YouTubeApi

LEGACY_CAPTIONS_NAME = "ocw_studio_upload"
CC_ENGLISH_CAPTIONS_NAME = "CC (English)"

HARD_CODED_VTT = b"""WEBVTT

00:00:00.000 --> 00:00:01.000
10

00:00:01.000 --> 00:00:02.000
9

00:00:02.000 --> 00:00:03.000
8

00:00:03.000 --> 00:00:04.000
7

00:00:04.000 --> 00:00:05.000
6

00:00:05.000 --> 00:00:06.000
5

00:00:06.000 --> 00:00:07.000
4

00:00:07.000 --> 00:00:08.000
3

00:00:08.000 --> 00:00:09.000
2

00:00:09.000 --> 00:00:10.000
1
"""

def upload_captions(video_id: str):
    yt = YouTubeApi()

    try:
        legacy_media = MediaIoBaseUpload(
            BytesIO(HARD_CODED_VTT),
            mimetype="text/vtt",
            chunksize=-1,
            resumable=True
        )
        legacy_response = yt.client.captions().insert(
            part="snippet",
            sync=False,
            body={
                "snippet": {
                    "language": "en",
                    "name": LEGACY_CAPTIONS_NAME,
                    "videoId": video_id,
                }
            },
            media_body=legacy_media,
        ).execute()

        legacy_id = legacy_response["id"]
        
        downloaded_data = yt.client.captions().download(id=legacy_id).execute()

        list_response = yt.client.captions().list(
            part="snippet", videoId=video_id
        ).execute()
        cc_track = [
            item for item in list_response.get("items", [])
            if item["snippet"].get("name") == CC_ENGLISH_CAPTIONS_NAME
        ]

        media_copy = MediaIoBaseUpload(
            BytesIO(downloaded_data),
            mimetype="text/vtt",
            chunksize=-1,
            resumable=True
        )
        if cc_track:
            yt.client.captions().update(
                part="snippet",
                body={"id": cc_track[0]["id"]},
                media_body=media_copy,
            ).execute()
        else:
            yt.client.captions().insert(
                part="snippet",
                sync=False,
                body={
                    "snippet": {
                        "language": "en",
                        "name": CC_ENGLISH_CAPTIONS_NAME,
                        "videoId": video_id,
                    }
                },
                media_body=media_copy,
            ).execute()

    except HttpError as http_err:
        print("YouTube API error:", http_err)

Then, run upload_captions(<youtube video id>). There should now be two captions tracks for the YouTube video, allowing for testing the management command. Next, run docker compose exec web ./manage.py delete_duplicate_captions_youtube --filter <website short id> to test the management command, and verify that the legacy captions track has been deleted on YouTube.

@pt2302 pt2302 marked this pull request as ready for review February 20, 2025 06:51
@umar8hassan umar8hassan self-assigned this Feb 25, 2025
@umar8hassan umar8hassan self-requested a review February 25, 2025 10:28
if not items:
continue

newest = items[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get the the newest item that matches either of the two concerned names.

Maybe it's because the setting (or the sample video itself) but we get an auto-translation in Russian as well. in my testing i got the the newest item as below. Since, it did not match any of the expected values in the if-else that followed, it did not work for me.
Screenshot 2025-02-25 at 5 24 52 PM
Screenshot 2025-02-25 at 6 39 46 PM

@pt2302
Copy link
Contributor Author

pt2302 commented Feb 25, 2025

/Gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a management command for deleting duplicate captions from YouTube videos. The command iterates through video files associated with specified websites, identifies duplicate caption tracks (specifically 'ocw_studio_upload'), and removes them. The test suite covers scenarios where the newest track is either 'ocw_studio_upload' or 'CC (English)'.

Summary of Findings

  • Error Handling: The except HttpError block in the management command could benefit from more specific error handling or logging to provide better insights into the nature of the errors encountered during YouTube API interactions.
  • Clarity of Variable Names: Consider renaming variables like vf to something more descriptive like video_file to improve code readability.
  • Redundancy in Test Assertions: In the test suite, there are multiple assertions related to API calls. Consider consolidating these assertions to improve test readability and maintainability.

Assessment

The pull request introduces a management command to delete duplicate captions from YouTube videos, which addresses a specific issue related to legacy caption tracks. The code appears well-structured and includes a test suite to verify the command's functionality. Overall, the changes seem reasonable and necessary. I've provided some feedback to further improve the code quality and clarity. Please address these comments before merging, and ensure that others review and approve this code as well.

Comment on lines +121 to +122
except HttpError:
log.exception("Error processing video %s", video_id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider logging the specific HttpError that occurred, or handling different types of HttpError differently. This could provide more context for debugging and monitoring. For example, a 404 error might indicate a video was deleted, while a quota error might require a different response.

Suggested change
except HttpError:
log.exception("Error processing video %s", video_id)
except HttpError as e:
log.exception("Error processing video %s: %s", video_id, e)

Comment on lines +66 to +73
mock_youtube_api.client.captions.return_value.list.assert_called_with(
part="snippet", videoId="dummy_youtube_id"
)
mock_youtube_api.client.captions.return_value.download.assert_called_once()

mock_youtube_api.client.captions.return_value.delete.assert_called_once_with(
id="caption_id_legacy"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider consolidating the assertions into a single block or using assert_has_calls to verify the sequence of calls. This can improve test readability and maintainability.

    calls = [
        mocker.call.client.captions.return_value.list(part="snippet", videoId="dummy_youtube_id"),
        mocker.call.client.captions.return_value.download(),
        mocker.call.client.captions.return_value.delete(id="caption_id_legacy")
    ]
    mock_youtube_api.assert_has_calls(calls)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants