Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ie/youtube] Extract comments with or without new format #9775

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

jakeogh
Copy link
Contributor

@jakeogh jakeogh commented Apr 24, 2024

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Youtube comment format changed, this pull attempts to handle the new case where the key frameworkUpdates is present as well as the case without. The original patch is from @minamotorin #9358 (comment)

Fixes #9358

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

@pukkandan
Copy link
Member

cc @shoxie007

@pukkandan pukkandan added the site-bug Issue with a specific website label Apr 24, 2024
@shoxie007
Copy link

shoxie007 commented Apr 24, 2024

Thanks for the effort. I had wanted to do this but was pressed for time. But what you've posted is quite elegant and effective, so it saves me the trouble. I'd like to propose some changes:

  1. Consider obtaining the like_count from key likeCountA11y instead of likeCountNotliked, because likeCountNotliked is rather ambiguous. What does "like count not liked" mean? Also cast the result to an int, otherwise, it will show up as a string with quotes around it in the final comments dict. (If a comment has no likes, the like count shows up as " ".) So maybe this:
    'like_count': str_to_int(self._search_regex(r'^([\d]+)', toolbar_dict.get('likeCountA11y'), 'like_count', fatal=False)),
  2. Also, I believe there are certain coding conventions for yt-dlp which are intended to minimize errors, or at least make them non-fatal: https://github.com/yt-dlp/yt-dlp/blob/master/CONTRIBUTING.md#developer-instructions
    One of these is to avoid addressing a path in a dict directly. Instead, it's best to use intermediary functions like try_get or traverse_obj:
    uploader = traverse_obj(meta, ('user', 'name')) # correct
    uploader = meta['user']['name'] # incorrect
    So in _extract_comment, this line for example:
    'text': entity_payload['properties']['content']['content'],
    could be re-written in two ways that I know of:
    'text': try_get(entity_payload, lambda x: x['properties']['content']['content'], str),
    'text': traverse_obj(entity_payload, ('properties', 'content', 'content', {str}), get_all=False),

@jakeogh
Copy link
Contributor Author

jakeogh commented Apr 24, 2024

Np, thanks for the review. Orig patch author @minamotorin also made a comment about the confusingly named likeCountNotliked: #9358 (comment) afaict it's the same info (for now haha) except one does not need string parsing outside of the conversion to int.

I replaced the dict access with (I think) correctly typed calls to try_get().

@shoxie007
Copy link

I read and re-read @minamotorin's comment and found it intriguing:

likeCountLiked is used when the user click “like button”, and otherwise likeCountNotLiked is used.

I put this definition to the test. I loaded a video with a logged-in Youtube account, as this is the only way that field likeCountLiked can mean anything. I hit the like button for one or more comments, then re-loaded the comments, then studied the JSON response. Here is what the fields in the JSON mean:

  • likeCountLiked = 1 if you've liked ANY comment (not just the one particular comment in question) in the entire comment section, and no one else has liked the comment.
    • But then, likeCountLiked = (1 + likeCountNotLiked) if others have liked the comment. To reiterate, even if you didn't like the particular comment, still 1 is added. The 1 denotes that you liked at least one comment in the entire comments section.
    • What a nonsensical data value! I can't think of a scenario in which it would be meaningful and useful.
  • likeCountNotLiked = total number of likes by other users besides you. So if you've liked a comment, and 9 others have also liked the comment, bringing the total to 10 likes, the value of likeCountNotLiked is actually 9. However, if you're logged out of the Youtube account, likeCountNotLiked will equal the total number of likes, in this case 10.

I tested your extractor as it currently is and it reflected these values.

So I'll repeat what I wrote in my first comment: Please obtain the like_count from key likeCountA11y instead of likeCountNotliked. If likeCountNotliked is used, then an erroneous value will be returned when the user passes a cookie to yt-dlp (ie a logged-in session) and yt-dlp obtains the comments data when logged in. When logged in, the like_count can be one less than it should be. I tested and verified this myself with the extractor as it currently is.

My suggestion:
'like_count': str_to_int(self._search_regex(r'^([\d]+)', entity_payload.get('likeCountA11y'), 'like_count', fatal=False)) or 0,

Also suggest expressing this:
entity_payload = entity['payload']['commentEntityPayload']
differently:
entity_payload = traverse_obj(entity_payload, ('payload', 'commentEntityPayload', {dict}))
entity_payload = try_get(entity_payload, lambda x: x['payload']['commentEntityPayload'], dict),

@jakeogh
Copy link
Contributor Author

jakeogh commented Apr 24, 2024

A Ha! Thanks for sorting that out. I fixed both issues.

@minamotorin
Copy link

minamotorin commented Apr 25, 2024

I think the current implementention does not work with old format.
In such cases, it may stop at the following point, because old format sometimes (not always) does not have both commentThreadRenderer and commentViewModel but has commentRenderer.
This code might needs some more device.

if not is_forced_continuation and not (tracker['est_total'] == 0 and tracker['running_total'] == 0):
check_get_keys = [[*continuation_items_path, ..., (
'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel'))]]
try:
response = self._extract_response(
item_id=None, query=continuation,
ep='next', ytcfg=ytcfg, headers=headers, note=note_prefix,
check_get_keys=check_get_keys)

However, I couldn't get the old format response now so I can't test it now.

@shoxie007
Copy link

I think the current implementention does not work with old format. In such cases, it may stop at the following point, because old format sometimes (not always) does not have both commentThreadRenderer and commentViewModel. This code might needs some more device.

Now that you bring it up, I did get the "Incomplete data received.." a few times when downloading comments for some videos, even with the commentViewModel JSON response. I wonder if it had anything to do with this bit of code. I'll study and test it further myself and see what the issue is...

@minamotorin
Copy link

This fix may work.

-          'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel'))]]
+          'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel', 'commentRenderer'))]]

@bbilly1
Copy link
Contributor

bbilly1 commented Apr 25, 2024

I have been testing this. I wasn't able to access is_favorited. If I'm understanding this correctly, engagementToolbarStateEntityPayload is located in a different entity linked by the toolbarStateKey, unfortunately somewhere in the entity_payloads list.

I've been playing around, something like that should do it:

diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index b4f33e7f7..6b87ab55d 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -3306,7 +3306,7 @@ def _extract_heatmap(self, data):
                 'value': ('intensityScoreNormalized', {float_or_none}),
             })) or None
 
-    def _extract_comment(self, view_model, entity, parent=None):
+    def _extract_comment(self, view_model, entity, entity_payloads, parent=None):
         entity_payload = traverse_obj(entity, ('payload', 'commentEntityPayload', {dict}))
         comment_id = entity_payload.get('properties').get('commentId')
 
@@ -3344,10 +3344,12 @@ def _extract_comment(self, view_model, entity, parent=None):
         if author_is_uploader is not None:
             info['author_is_uploader'] = author_is_uploader
 
-        comment_abr = traverse_obj(
-            entity, ('payload', 'engagementToolbarStateEntityPayload', 'heartState'), expected_type=str)
-        if comment_abr is not None:
-            info['is_favorited'] = comment_abr == 'TOOLBAR_HEART_STATE_HEARTED'
+        toolbar_state_key = entity_payload.get('properties', {}).get('toolbarStateKey')
+        if toolbar_state_key:
+            tool_bar_entity = next((d for d in entity_payloads if d.get('entityKey') == toolbar_state_key), None)
+            if tool_bar_entity:
+                heart_state = traverse_obj(tool_bar_entity, ('payload', 'engagementToolbarStateEntityPayload', 'heartState'))
+                info['is_favorited'] = heart_state == 'TOOLBAR_HEART_STATE_HEARTED'
 
         info['author_is_verified'] = traverse_obj(entity_payload, ('author', 'isVerified')) == 'true'
 
@@ -3470,7 +3472,7 @@ def extract_thread(contents, entity_payloads):
                             entity = entity
                             break
 
-                    comment = self._extract_comment(view_model, entity, parent)
+                    comment = self._extract_comment(view_model, entity, entity_payloads, parent)
 
                 if comment.get('is_pinned'):
                     tracker['pinned_comment_ids'].add(comment_id)

But unfortunately requires some looping around through the array for each comment to find the matching EngagementToolbar.

@shoxie007
Copy link

shoxie007 commented Apr 26, 2024

But unfortunately requires some looping around through the array for each comment to find the matching EngagementToolbar.

In such trying situations, look to our Lord and Savior: traverse_obj

How I would re-write everything:

# In _comment_entries >> extract_thread:
# --------------------------------------
# .... Existing code remains as is

comment_key = view_model.get("commentKey")
toolbar_state_key = view_model.get("toolbarStateKey")

# This usage of traverse_obj returns a list of relevant entities
# - NOTE: This:            v["entityKey"] in [comment_key, toolbar_state_key]
#   is shorthand for:      v["entityKey"] == comment_key or v["entityKey"] == toolbar_state_key
entities = traverse_obj(entity_payloads, lambda _, v: v["entityKey"] in [comment_key, toolbar_state_key])

# Call _extract_comment using "entities" instead of the former "entity"
comment = self._extract_comment(view_model, entities, parent)

# Then in _extract_comment
# ------------------------
def _extract_comment(self, view_model, entities, parent=None):   # change "entity" to "entities"
     comment_entity_payload = traverse_obj(entities, (..., 'payload', 'commentEntityPayload', {dict}), get_all=False)
     toolbar_entity_payload = traverse_obj(entities, (..., 'payload', 'engagementToolbarStateEntityPayload', {dict}), get_all=False)
          
     # ....
     # NOTE: "entity_payload" should be changed to "comment_entity_payload" in existing code
     # ....
     
     if toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED':
          info['is_favorited'] = True

     # ....

@shoxie007
Copy link

shoxie007 commented May 5, 2024

Are you still with us @jakeogh? Would you kindly integrate the changes proposed?:

  1. Add commentRenderer in code for check_get_keys
  2. Modify extract_thread and _extract_comment to take account of the fact that the heartState key is in a different entity

I've tested the code on numerous videos. It's working. Let's get this pull request merged ASAP. People have been asking and wondering about the broken comments extraction.

@jakeogh
Copy link
Contributor Author

jakeogh commented May 5, 2024

@shoxie007

Thanks for the ping, I have been without time this week, but I'll be able to dig back into this tonight. I have tested this on ~1k videos (some very old) and haven't been able to trigger an issue as far as I can tell, but I havent verified this. I need to go back over my results. Is there a way to find an ID that triggers the situation where old format sometimes (not always) does not have both commentThreadRenderer and commentViewModel? What would the result be? Clearly it's not throwing an exception or I would have hit it by now, so I assume the result would be 0 comments on a video that has comments? That's what I intend to look for when I double check my own test results. If you have a moment before I get to it, it would help if you converted your rewrite into a diff off the current pull.

@shoxie007
Copy link

@jakeogh

I have let my scripts test this on thousands of videos (some very old) and haven't been able to trigger an issue as far as I can tell.

Youtube isn't consistent in which videos it uses the commentRenderer model for. It uses it at random. My own experience is that about 10-20% of videos will use the commentRenderer model in the comments-response JSON when downloading an entire channel for example. I think people's experience may vary depending on geo-location. I'm not sure.

Without @minamotorin's fix, when I obtain the comments for an entire channel with hundreds of videos, I routinely get:
WARNING: [youtube] Incomplete data received. Giving up after 3 retries
for the aforementioned 10-20% of videos. And not just for one one or two comment threads. It's a fail for almost every single comment thread for the video in question. At the very end, few or no comments are downloaded.

An example of a terminal output for such a video:

[youtube] Extracting URL: https://www.youtube.com/watch?v=NX7cJD58sUE
[youtube] NX7cJD58sUE: Downloading webpage
[youtube] NX7cJD58sUE: Downloading ios player API JSON
[youtube] NX7cJD58sUE: Downloading android player API JSON
WARNING: [youtube] Skipping player responses from android clients (got player responses for video "aQvGIIdgFDM" instead of "NX7cJD58sUE")
[youtube] NX7cJD58sUE: Downloading m3u8 information
[youtube] Downloading comment section API JSON
[youtube] Downloading ~99 comments
[youtube] Sorting comments by newest first
[youtube] Downloading comment API JSON page 1 (0/~99)
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
WARNING: [youtube] Incomplete data received. Retrying (1/3)...
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
WARNING: [youtube] Incomplete data received. Retrying (2/3)...
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
WARNING: [youtube] Incomplete data received. Retrying (3/3)...
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
WARNING: [youtube] Incomplete data received. Giving up after 3 retries
[youtube]     Downloading comment API JSON reply thread 2 (6/~99)
WARNING: [youtube] Incomplete data received. Retrying (1/3)...
[youtube]     Downloading comment API JSON reply thread 2 (6/~99)
WARNING: [youtube] Incomplete data received. Retrying (2/3)...
[youtube]     Downloading comment API JSON reply thread 2 (6/~99)
WARNING: [youtube] Incomplete data received. Retrying (3/3)...
[youtube]     Downloading comment API JSON reply thread 2 (6/~99)
WARNING: [youtube] Incomplete data received. Giving up after 3 retries
....
....
WARNING: [youtube] Incomplete data received. Retrying (1/3)...
[youtube]     Downloading comment API JSON reply thread 7 (31/~99)
WARNING: [youtube] Incomplete data received. Retrying (2/3)...
[youtube]     Downloading comment API JSON reply thread 7 (31/~99)
WARNING: [youtube] Incomplete data received. Retrying (3/3)...
[youtube]     Downloading comment API JSON reply thread 7 (31/~99)
WARNING: [youtube] Incomplete data received. Giving up after 3 retries
[youtube] Extracted 32 comments
[info] NX7cJD58sUE: Downloading 1 format(s): 247+251
[info] Writing '%(comments)j' to: 04__Video Comments (JSONs)/2023.11.18 - Variables in Russia's Downfall.comments.json

I think you may be able replicate the issue. Here, run this command (using the build which doesn't have the fix suggested above) in Unix to download comments for the first 100 videos of the Whatifalthist channel. This command will format the output, so that it's more readable, and will print the terminal output to a log file for later examination. It might take a while to download everything, for which reason I included a log file which can be studied later:
yt-dlp --no-overwrites --ignore-errors --compat-options filename-sanitization --abort-on-unavailable-fragment --write-comments --no-write-info-json --skip-download --playlist-items 1-100 --print-to-file "%(comments)j" "%(upload_date>%Y.%m.%d)s - %(title)s.comments.json" "https://www.youtube.com/playlist?list=UU5Dw9TFdbPJoTDMSiJdIQTA" 2>&1 | sed -uE 's/^(\[download\] Downloading item [0-9]+ of [0-9]+)/\n\1/' | tee "DL_Log_for_Comments_[$(date +'%Y.%m.%d_%H-%M')].txt"

As for the second fix which I suggested in response to the issue which @bbilly1 raised, it should be obvious why it's necessary. The heartState key (which reveals whether or not a comment has been hearted) is in another entity (containing engagementToolbarStateEntityPayload), not in the entity containing commentEntityPayload. So two entities need to be extracted and passed to _extract_comment.

@jakeogh
Copy link
Contributor Author

jakeogh commented May 6, 2024

@shoxie007 I really appreciate the detailed reproducer and explanation. I've added the fix from @minamotorin, however, before that, I ran the suggested tests and was unable to reproduce the issue. My IP geolocates to AZ, USA:

$ yt-dlp https://www.youtube.com/watch?v=NX7cJD58sUE --write-info-json --write-comments
[youtube] Extracting URL: https://www.youtube.com/watch?v=NX7cJD58sUE
[youtube] NX7cJD58sUE: Downloading webpage
[youtube] NX7cJD58sUE: Downloading ios player API JSON
[youtube] NX7cJD58sUE: Downloading android player API JSON
WARNING: [youtube] Skipping player responses from android clients (got player responses for video "aQvGIIdgFDM" instead of "NX7cJD58sUE")
[youtube] NX7cJD58sUE: Downloading m3u8 information
[youtube] Downloading comment section API JSON
[youtube] Downloading ~99 comments
[youtube] Sorting comments by newest first
[youtube] Downloading comment API JSON page 1 (0/~99)
[youtube]     Downloading comment API JSON reply thread 1 (1/~99)
[youtube]     Downloading comment API JSON reply thread 2 (7/~99)
[youtube]     Downloading comment API JSON reply thread 3 (10/~99)
[youtube]     Downloading comment API JSON reply thread 4 (19/~99)
[youtube]     Downloading comment API JSON reply thread 5 (23/~99)
[youtube]     Downloading comment API JSON reply thread 6 (28/~99)
[youtube]     Downloading comment API JSON reply thread 7 (32/~99)
[youtube]     Downloading comment API JSON reply thread 8 (34/~99)
[youtube]     Downloading comment API JSON reply thread 9 (40/~99)
[youtube] Downloading comment API JSON page 2 (44/~99)
[youtube]     Downloading comment API JSON reply thread 1 (45/~99)
[youtube]     Downloading comment API JSON reply thread 2 (50/~99)
[youtube]     Downloading comment API JSON reply thread 3 (61/~99)
[youtube]     Downloading comment API JSON reply thread 4 (66/~99)
[youtube]     Downloading comment API JSON reply thread 5 (71/~99)
[youtube]     Downloading comment API JSON reply thread 6 (75/~99)
[youtube]     Downloading comment API JSON reply thread 7 (86/~99)
[youtube]        Downloading comment replies API JSON page 1 (96/~99)
[youtube] Extracted 99 comments
[info] NX7cJD58sUE: Downloading 1 format(s): 247+251
[info] Writing video metadata as JSON to: Variables in Russia's Downfall [NX7cJD58sUE].info.json
[download] Destination: Variables in Russia's Downfall [NX7cJD58sUE].f247.webm
[download] 100% of  996.75KiB in 00:00:00 at 1.64MiB/s
[download] Destination: Variables in Russia's Downfall [NX7cJD58sUE].f251.webm
[download] 100% of  721.74KiB in 00:00:00 at 1.17MiB/s
[Merger] Merging formats into "Variables in Russia's Downfall [NX7cJD58sUE].webm"
Deleting original file Variables in Russia's Downfall [NX7cJD58sUE].f251.webm (pass -k to keep)
Deleting original file Variables in Russia's Downfall [NX7cJD58sUE].f247.webm (pass -k to keep)
$ yt-dlp --no-overwrites --ignore-errors --compat-options filename-sanitization --abort-on-unavailable-fragment --write-comments --no-write-info-json --skip-download --playlist-items 1-100 --print-to-file "%(comments)j" "%(upload_date>%Y.%m.%d)s - %(title)s.comments.json" "https://www.youtube.com/playlist?list=UU5Dw9TFdbPJoTDMSiJdIQTA" 2>&1 | sed -uE 's/^(\[download\] Downloading item [0-9]+ of [0-9]+)/\n\1/' | tee "DL_Log_for_Comments_[$(date +'%Y.%m.%d_%H-%M')].txt"

<snip>

[download] Finished downloading playlist: Uploads from Whatifalthist

$ echo $?
0

$ ls *.comments.json | wc -l
100

$ ls DL_Log_for_Comments_*
1.1M -rw-r--r-- 1 user user 1.1M 2024-05-05 21:30:14 'DL_Log_for_Comments_[2024.05.05_20-12].txt'

$ grep WARNING DL_Log_for_Comments_\[2024.05.05_20-12\].txt | grep -v "Skipping player responses from android clients" | wc -l
0

A manual check of the 100 *.comments.json files, I find they have the expected comment count, and the DL_Log_for_Comments log has no unexpected WARNING: messages.

Checking my previous tests, I found no anomalies, so YT does not appear to be serving commentRenderer to my IP. I'll be able to fix the heartState key issue tomorrow.

@shoxie007
Copy link

@jakeogh I'll venture a guess as to why you have such a seamless experience: You're close to Youtube's main servers and have a fast or well-networked connection. Therefore, Youtube delivers the response-JSONs perfectly the first time. You don't get any incomplete responses. Or it could be that Youtube is sending commentRenderer responses to certain geo-locations only, whereas in the US, it only uses commentViewModel. If you're determined to replicate the issue, maybe try a proxy server in another geo-location if you have access to one.

In any case, there is no harm in applying minamotorin's fix. It makes the data-integrity check of the response-JSON more thorough. Honestly, I don't know WHY it works, just that it does. When it's:
check_get_keys = ...... else ('commentThreadRenderer', 'commentViewModel') .....
and _extract_response is called, it checks that either commentThreadRenderer or commentViewModel keys are present. But when it's changed to:
check_get_keys = ...... else ('commentThreadRenderer', 'commentViewModel', 'commentRenderer') .....
it checks that either one of three keys is present. However, if commentThreadRenderer is present, then does it even check that commentRenderer is present too? It seems redundant to add it, but apparently it does help somehow. @minamotorin, if you're arround would you kindly explain how/why adding it improves robustness of data extraction? I can't quite make sense of the chain of code which is executed in the extractor. It gets confusing to me at this line:

if not traverse_obj(response, *variadic(check_get_keys)):

@minamotorin
Copy link

minamotorin commented May 6, 2024

would you kindly explain how/why adding it improves robustness of data extraction?

Okay, the simple reason is

old format sometimes (not always) does not have both commentThreadRenderer and commentViewModel but has commentRenderer

In such cases,

check_get_keys = ...... else ('commentThreadRenderer', 'commentViewModel') ..... 

treats the response as incomplete because the response does not have both commentThreadRenderer and commentViewModel.

To be more specific, here is which keys are used.

reply threads other than replies
old format response commentRenderer commentThreadRenderer
new format response commentViewModel commentThreadRenderer

The reply threads of old format response is when commentRenderer is required in check_get_keys.
If the response has commentRenderer, it won't have commentThreadRenderer.

If I understand correctly, self._extract_response treats the response as incomplete if the response does not have keys of check_get_keys, and ('commentThreadRenderer', 'commentViewModel', 'commentRenderer') means that the response should have at least of one of these keys.

The code

if not traverse_obj(response, *variadic(check_get_keys)):

tests whether traverse_obj(response, *variadic(check_get_keys)) success for the response.
In other words, check_get_keys has the same syntax as traverse_obj to check if the response has keys.

@coletdjnz coletdjnz self-requested a review May 7, 2024 10:08
@minamotorin
Copy link

@bbilly1

Did you encounter any problems when using yt-dlp or is that just an observation in the browser?

This is just an observation in the browser.
My comment was just a sharing of information, not a bug report.

@coletdjnz

Extracting it in another lang would be complex/difficult, so I wouldn't worry about it in this PR.

I agree. This PR seems to work fine now.

@githb123
Copy link

@pukkandan When is this going to be released? It's been over 2 months now since the problem is known. A working pull request sits waiting for a week. I bet many of us see this feature as 2nd critical right after the downloading the videos itself.

@themodfather360
Copy link

@pukkandan When is this going to be released? It's been over 2 months now since the problem is known. A working pull request sits waiting for a week. I bet many of us see this feature as 2nd critical right after the downloading the videos itself.

Seconded. It says 1 pending reviewer. I don't know how to compile the binary myself, otherwise I would.

@shoxie007
Copy link

shoxie007 commented May 14, 2024

Seconded. It says 1 pending reviewer. I don't know how to compile the binary myself, otherwise I would.

If you're on UNIX or have WSL for Windows:

Download the source code:
https://github.com/yt-dlp/yt-dlp-nightly-builds/archive/refs/tags/2024.05.13.232704.zip

Unzip it. Enter the folder. Open the terminal. Run command:
make yt-dlp

It should compile to a binary "yt-dlp", which you can then copy and run in another folder using:
./yt-dlp ...

You could even run yt-dlp without compiling. Just enter the above-mentioned folder (with the source code) and run:
./yt-dlp.sh ....
though obviously this means running yt-dlp in the same folder as the source code.

EDIT: I forgot to include that you have to replace youtube.py in folderpath yt-dlp/extractor with jakeogh's extractor, so that you have the revised comments-extraction code:
https://github.com/yt-dlp/yt-dlp/blob/5a3a4f13d4e55b077bae31674d21a596dbfcd4b9/yt_dlp/extractor/youtube.py

Copy link
Member

@bashonly bashonly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, here's an initial review. There a lot of comments and some are auto-hidden by github, please make sure to address all of them

yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
yt_dlp/extractor/youtube.py Outdated Show resolved Hide resolved
@jakeogh
Copy link
Contributor Author

jakeogh commented May 15, 2024

Whew! @bashonly I sincerely appreciate the detailed and easy to follow review. All of the requested changes were made, the only outstanding issues are:

  1. When the old comment extraction code is used (youtube served me the old format a few times in my testing), I noticed if the like_count is 0, the like_count key is omitted from the .info.json file. Should the like_count == 0 key be omitted in the new code path as well?
  2. My note on the author_is_verified commit above. Should a key be omitted if it is false? author_is_uploader is included in the json output whether it's true or false.
  3. As I noted in [ie/youtube] Extract comments with or without new format #9775 (comment) I am unable to test the commentRenderer code path, so hopefully @shoxie007 and @minamotorin can give it a test.

@shoxie007
Copy link

shoxie007 commented May 15, 2024

@jakeogh Thanks, will look into this. I'm also wondering if @bashonly truly intends that all those keys - author_is_verified, author_is_uploader, like_count etc - should be omitted if false or 0. In the old extractor, like_count and author_is_uploader were present at all times, even if value was 0 or false. But upon further reflection, perhaps it's good to omit certain keys if their value is null, especially if their value is null almost all the time. I still think like_count should stay though, even if 0.

@pukkandan
Copy link
Member

pukkandan commented May 15, 2024

  1. When the old comment extraction code is used (youtube served me the old format a few times in my testing), I noticed if the like_count is 0, the like_count key is omitted from the .info.json file. Should the like_count == 0 key be omitted in the new code path as well?

No. Sounds like a bug in old code. 0 should be returned. Omitting a key or setting it to None means that the field was not extracted. Setting like_count=0 on the other hand means "there are no likes" and is a separate case.

  1. My note on the author_is_verified commit above. Should a key be omitted if it is false? author_is_uploader is included in the json output whether it's true or false.

Similar to above, False is different than None (same as unset) and should be returned.

Comment on lines 3311 to 3358
comment_entity_payload = get_first(entities, ('payload', 'commentEntityPayload', {dict}))
toolbar_entity_payload = get_first(entities, ('payload', 'engagementToolbarStateEntityPayload', {dict})) or {}
comment_id = traverse_obj(comment_entity_payload, ('properties', 'commentId', {str}))
if not comment_id:
return

info = {
'id': comment_id,
'parent': parent or 'root',
**traverse_obj(comment_entity_payload, {
'text': ('properties', 'content', 'content', {str}),
'like_count': ('toolbar', 'likeCountA11y', {parse_count}),
'author_id': ('author', 'channelId', {self.ucid_or_none}),
'author': ('author', 'displayName', {str}),
'author_thumbnail': ('author', 'avatarThumbnailUrl', {url_or_none}),
}),
}

# Timestamp is an estimate calculated from the current time and time_text
time_text = traverse_obj(comment_entity_payload, ('properties', 'publishedTime', {str})) or ''
timestamp = self._parse_time_text(time_text)

info.update({
# FIXME: non-standard, but we need a way of showing that it is an estimate.
'_time_text': time_text,
'timestamp': timestamp,
})

info['author_url'] = urljoin(
'https://www.youtube.com',
traverse_obj(comment_entity_payload, ('author', 'channelCommand', 'innertubeCommand', (
('browseEndpoint', 'canonicalBaseUrl'), ('commandMetadata', 'webCommandMetadata', 'url'))),
expected_type=str, get_all=False))

author_is_uploader = traverse_obj(comment_entity_payload, ('author', 'isCreator', {bool}))
if author_is_uploader is not None:
info['author_is_uploader'] = author_is_uploader

if toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED':
info['is_favorited'] = True

if traverse_obj(comment_entity_payload, ('author', 'isVerified', {bool})):
info['author_is_verified'] = True

if traverse_obj(view_model, 'pinnedText'):
info['is_pinned'] = True

return info
Copy link
Member

@pukkandan pukkandan May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
comment_entity_payload = get_first(entities, ('payload', 'commentEntityPayload', {dict}))
toolbar_entity_payload = get_first(entities, ('payload', 'engagementToolbarStateEntityPayload', {dict})) or {}
comment_id = traverse_obj(comment_entity_payload, ('properties', 'commentId', {str}))
if not comment_id:
return
info = {
'id': comment_id,
'parent': parent or 'root',
**traverse_obj(comment_entity_payload, {
'text': ('properties', 'content', 'content', {str}),
'like_count': ('toolbar', 'likeCountA11y', {parse_count}),
'author_id': ('author', 'channelId', {self.ucid_or_none}),
'author': ('author', 'displayName', {str}),
'author_thumbnail': ('author', 'avatarThumbnailUrl', {url_or_none}),
}),
}
# Timestamp is an estimate calculated from the current time and time_text
time_text = traverse_obj(comment_entity_payload, ('properties', 'publishedTime', {str})) or ''
timestamp = self._parse_time_text(time_text)
info.update({
# FIXME: non-standard, but we need a way of showing that it is an estimate.
'_time_text': time_text,
'timestamp': timestamp,
})
info['author_url'] = urljoin(
'https://www.youtube.com',
traverse_obj(comment_entity_payload, ('author', 'channelCommand', 'innertubeCommand', (
('browseEndpoint', 'canonicalBaseUrl'), ('commandMetadata', 'webCommandMetadata', 'url'))),
expected_type=str, get_all=False))
author_is_uploader = traverse_obj(comment_entity_payload, ('author', 'isCreator', {bool}))
if author_is_uploader is not None:
info['author_is_uploader'] = author_is_uploader
if toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED':
info['is_favorited'] = True
if traverse_obj(comment_entity_payload, ('author', 'isVerified', {bool})):
info['author_is_verified'] = True
if traverse_obj(view_model, 'pinnedText'):
info['is_pinned'] = True
return info
comment_entity_payload = get_first(entities, ('payload', 'commentEntityPayload', {dict}))
if not (comment_id := traverse_obj(comment_entity_payload, ('properties', 'commentId', {str}))):
return
toolbar_entity_payload = get_first(entities, ('payload', 'engagementToolbarStateEntityPayload', {dict}))
time_text = traverse_obj(comment_entity_payload, ('properties', 'publishedTime', {str})) or ''
return {
'id': comment_id,
'parent': parent or 'root',
**traverse_obj(comment_entity_payload, {
'text': ('properties', 'content', 'content', {str}),
'like_count': ('toolbar', 'likeCountA11y', {parse_count}),
'author_id': ('author', 'channelId', {self.ucid_or_none}),
'author': ('author', 'displayName', {str}),
'author_thumbnail': ('author', 'avatarThumbnailUrl', {url_or_none}),
'author_is_uploader': ('author', 'isCreator', {bool}),
'author_is_verified': ('author', 'isVerified', {bool}),
'author_url': ('author', 'channelCommand', 'innertubeCommand', (
('browseEndpoint', 'canonicalBaseUrl'), ('commandMetadata', 'webCommandMetadata', 'url')
), {lambda x: urljoin('https://www.youtube.com', x)}),
}, get_all=False),
'is_favorited': (None if toolbar_entity_payload is None else # TODO(before merge): Is this logic correct?
toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED'),
'is_pinned': traverse_obj(view_model, ('pinnedText', {bool})),
'_time_text': time_text, # FIXME: non-standard, but we need a way of showing that it is an estimate.
'timestamp': self._parse_time_text(time_text),
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per comments above, do we need this?

                'like_count': ('toolbar', 'likeCountA11y', {parse_count}, {lambda x: x or 0}),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain? parse_count returns 0 correctly:

>>> parse_count('0')
    0

Copy link
Member

@bashonly bashonly May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was basing this off the feedback from @shoxie007, who implied that the current code omits like_count instead of returning a 0 value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            'is_pinned': view_model.get('pinnedText'),

someone should confirm this is always a boolean value in the youtube response, or else bool()/bool_or_none() this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

someone should confirm this is always a boolean value in the youtube response, or else bool()/bool_or_none() this

You are right. Even if it's always bool, we should do the type check anyway. Updated my suggestion to

-            'is_pinned': view_model.get('pinnedText'),
+            'is_pinned': traverse_obj(view_model, ('pinnedText', {bool})),

On a side note, view_model is being passed in here only for this one value. It may be better to compute this in the parent function instead.

@bashonly
Copy link
Member

bashonly commented May 15, 2024

I'm also wondering if @bashonly truly intends that all those keys - author_is_verified, author_is_uploader, like_count etc - should be omitted if false or 0. In the old extractor, like_count and author_is_uploader were present at all times, even if value was 0 or false. But upon further reflection, perhaps it's good to omit certain keys if their value is null, especially if their value is null almost all the time. I still think like_count should stay though, even if 0.

@shoxie007

re: like_count, I wasn't aware that the current code would omit the field if we didn't manually fallback to 0 -- can you confirm this is the case?

re: author_is_verified and is_pinned, I was just mirroring old code, which only set them to True or else omitted them.

author_is_uploader and is_favorited were oversights on my part, my bad.

pukkandan's suggestion above should fix everything (except maybe like_count?)

@shoxie007
Copy link

@bashonly

re: like_count, I wasn't aware that the current code would omit the field if we didn't manually fallback to 0 -- can you confirm this is the case?

re: author_is_verified and is_pinned, I was just mirroring old code, which only set them to True or else omitted them.

author_is_uploader and is_favorited were oversights on my part, my bad.

pukkandan's suggestion above should fix everything (except maybe like_count?)

I had downloaded comments in January, before Youtube made changes and the the extractor broke. I had used yt-dlp v2024.01.05.232702. This is what a yt-dlp-generated comment dict looked like:

{
        "id": "Ugyv3aJcyIlwjWWWvPp4AaABAg",
        "text": "Super video!!!!!!! Me gust\u00f3!",
        "like_count": null,
        "author_id": "UCYbXPcWIN9fxGKxCELWjapQ",
        "author": "@fcp1955",
        "author_thumbnail": "https://yt3.ggpht.com/ytc/AIf8zZQBkz1GQX7uX41M8pmfiimGylV2P8rw6ctq15tMGQ=s176-c-k-c0x00ffffff-no-rj",
        "parent": "root",
        "_time_text": "1 month ago",
        "timestamp": 1703376000,
        "author_url": "https://www.youtube.com/channel/UCYbXPcWIN9fxGKxCELWjapQ",
        "author_is_uploader": false,
        "is_favorited": false
}

@bashonly
Copy link
Member

so your comment about like_count was only in reference to old code?

Comment on lines +3333 to +3334
'is_favorited': (None if toolbar_entity_payload is None else # TODO(before merge): Is this logic correct?
toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This todo is to be addressed

@shoxie007
Copy link

shoxie007 commented May 15, 2024

so your comment about like_count was only in reference to old code?

Yes, I meant that the old extractor used to display like_count at all times in the dict, whereas I though you meant that the new extractor should omit keys which resolve to 0 or false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
Status: Release blocker
Development

Successfully merging this pull request may close these issues.

[YouTube] comments are not downloading
9 participants