New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ie/youtube] Extract comments with or without new format #9775
base: master
Are you sure you want to change the base?
Conversation
cc @shoxie007 |
Thanks for the effort. I had wanted to do this but was pressed for time. But what you've posted is quite elegant and effective, so it saves me the trouble. I'd like to propose some changes:
|
Np, thanks for the review. Orig patch author @minamotorin also made a comment about the confusingly named I replaced the dict access with (I think) correctly typed calls to |
I read and re-read @minamotorin's comment and found it intriguing:
I put this definition to the test. I loaded a video with a logged-in Youtube account, as this is the only way that field likeCountLiked can mean anything. I hit the like button for one or more comments, then re-loaded the comments, then studied the JSON response. Here is what the fields in the JSON mean:
I tested your extractor as it currently is and it reflected these values. So I'll repeat what I wrote in my first comment: Please obtain the like_count from key likeCountA11y instead of likeCountNotliked. If likeCountNotliked is used, then an erroneous value will be returned when the user passes a cookie to yt-dlp (ie a logged-in session) and yt-dlp obtains the comments data when logged in. When logged in, the like_count can be one less than it should be. I tested and verified this myself with the extractor as it currently is. My suggestion: Also suggest expressing this: |
A Ha! Thanks for sorting that out. I fixed both issues. |
I think the current implementention does not work with old format. yt-dlp/yt_dlp/extractor/youtube.py Lines 3565 to 3572 in 3ef6517
However, I couldn't get the old format response now so I can't test it now. |
Now that you bring it up, I did get the "Incomplete data received.." a few times when downloading comments for some videos, even with the commentViewModel JSON response. I wonder if it had anything to do with this bit of code. I'll study and test it further myself and see what the issue is... |
This fix may work. - 'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel'))]]
+ 'commentsHeaderRenderer' if is_first_continuation else ('commentThreadRenderer', 'commentViewModel', 'commentRenderer'))]] |
I have been testing this. I wasn't able to access I've been playing around, something like that should do it: diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index b4f33e7f7..6b87ab55d 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -3306,7 +3306,7 @@ def _extract_heatmap(self, data):
'value': ('intensityScoreNormalized', {float_or_none}),
})) or None
- def _extract_comment(self, view_model, entity, parent=None):
+ def _extract_comment(self, view_model, entity, entity_payloads, parent=None):
entity_payload = traverse_obj(entity, ('payload', 'commentEntityPayload', {dict}))
comment_id = entity_payload.get('properties').get('commentId')
@@ -3344,10 +3344,12 @@ def _extract_comment(self, view_model, entity, parent=None):
if author_is_uploader is not None:
info['author_is_uploader'] = author_is_uploader
- comment_abr = traverse_obj(
- entity, ('payload', 'engagementToolbarStateEntityPayload', 'heartState'), expected_type=str)
- if comment_abr is not None:
- info['is_favorited'] = comment_abr == 'TOOLBAR_HEART_STATE_HEARTED'
+ toolbar_state_key = entity_payload.get('properties', {}).get('toolbarStateKey')
+ if toolbar_state_key:
+ tool_bar_entity = next((d for d in entity_payloads if d.get('entityKey') == toolbar_state_key), None)
+ if tool_bar_entity:
+ heart_state = traverse_obj(tool_bar_entity, ('payload', 'engagementToolbarStateEntityPayload', 'heartState'))
+ info['is_favorited'] = heart_state == 'TOOLBAR_HEART_STATE_HEARTED'
info['author_is_verified'] = traverse_obj(entity_payload, ('author', 'isVerified')) == 'true'
@@ -3470,7 +3472,7 @@ def extract_thread(contents, entity_payloads):
entity = entity
break
- comment = self._extract_comment(view_model, entity, parent)
+ comment = self._extract_comment(view_model, entity, entity_payloads, parent)
if comment.get('is_pinned'):
tracker['pinned_comment_ids'].add(comment_id) But unfortunately requires some looping around through the array for each comment to find the matching |
In such trying situations, look to our Lord and Savior: traverse_obj How I would re-write everything: # In _comment_entries >> extract_thread:
# --------------------------------------
# .... Existing code remains as is
comment_key = view_model.get("commentKey")
toolbar_state_key = view_model.get("toolbarStateKey")
# This usage of traverse_obj returns a list of relevant entities
# - NOTE: This: v["entityKey"] in [comment_key, toolbar_state_key]
# is shorthand for: v["entityKey"] == comment_key or v["entityKey"] == toolbar_state_key
entities = traverse_obj(entity_payloads, lambda _, v: v["entityKey"] in [comment_key, toolbar_state_key])
# Call _extract_comment using "entities" instead of the former "entity"
comment = self._extract_comment(view_model, entities, parent)
# Then in _extract_comment
# ------------------------
def _extract_comment(self, view_model, entities, parent=None): # change "entity" to "entities"
comment_entity_payload = traverse_obj(entities, (..., 'payload', 'commentEntityPayload', {dict}), get_all=False)
toolbar_entity_payload = traverse_obj(entities, (..., 'payload', 'engagementToolbarStateEntityPayload', {dict}), get_all=False)
# ....
# NOTE: "entity_payload" should be changed to "comment_entity_payload" in existing code
# ....
if toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED':
info['is_favorited'] = True
# .... |
Are you still with us @jakeogh? Would you kindly integrate the changes proposed?:
I've tested the code on numerous videos. It's working. Let's get this pull request merged ASAP. People have been asking and wondering about the broken comments extraction. |
Thanks for the ping, I have been without time this week, but I'll be able to dig back into this tonight. I have tested this on ~1k videos (some very old) and haven't been able to trigger an issue as far as I can tell, but I havent verified this. I need to go back over my results. Is there a way to find an ID that triggers the situation where |
Youtube isn't consistent in which videos it uses the commentRenderer model for. It uses it at random. My own experience is that about 10-20% of videos will use the commentRenderer model in the comments-response JSON when downloading an entire channel for example. I think people's experience may vary depending on geo-location. I'm not sure. Without @minamotorin's fix, when I obtain the comments for an entire channel with hundreds of videos, I routinely get: An example of a terminal output for such a video:
I think you may be able replicate the issue. Here, run this command (using the build which doesn't have the fix suggested above) in Unix to download comments for the first 100 videos of the Whatifalthist channel. This command will format the output, so that it's more readable, and will print the terminal output to a log file for later examination. It might take a while to download everything, for which reason I included a log file which can be studied later: As for the second fix which I suggested in response to the issue which @bbilly1 raised, it should be obvious why it's necessary. The heartState key (which reveals whether or not a comment has been hearted) is in another entity (containing engagementToolbarStateEntityPayload), not in the entity containing commentEntityPayload. So two entities need to be extracted and passed to _extract_comment. |
@shoxie007 I really appreciate the detailed reproducer and explanation. I've added the fix from @minamotorin, however, before that, I ran the suggested tests and was unable to reproduce the issue. My IP geolocates to AZ, USA:
A manual check of the 100 Checking my previous tests, I found no anomalies, so YT does not appear to be serving |
@jakeogh I'll venture a guess as to why you have such a seamless experience: You're close to Youtube's main servers and have a fast or well-networked connection. Therefore, Youtube delivers the response-JSONs perfectly the first time. You don't get any incomplete responses. Or it could be that Youtube is sending commentRenderer responses to certain geo-locations only, whereas in the US, it only uses commentViewModel. If you're determined to replicate the issue, maybe try a proxy server in another geo-location if you have access to one. In any case, there is no harm in applying minamotorin's fix. It makes the data-integrity check of the response-JSON more thorough. Honestly, I don't know WHY it works, just that it does. When it's: if not traverse_obj(response, *variadic(check_get_keys)): |
Okay, the simple reason is
In such cases, check_get_keys = ...... else ('commentThreadRenderer', 'commentViewModel') ..... treats the response as incomplete because the response does not have both To be more specific, here is which keys are used.
The reply threads of old format response is when If I understand correctly, The code if not traverse_obj(response, *variadic(check_get_keys)): tests whether |
This is just an observation in the browser.
I agree. This PR seems to work fine now. |
@pukkandan When is this going to be released? It's been over 2 months now since the problem is known. A working pull request sits waiting for a week. I bet many of us see this feature as 2nd critical right after the downloading the videos itself. |
Seconded. It says 1 pending reviewer. I don't know how to compile the binary myself, otherwise I would. |
If you're on UNIX or have WSL for Windows: Download the source code: Unzip it. Enter the folder. Open the terminal. Run command: It should compile to a binary "yt-dlp", which you can then copy and run in another folder using: You could even run yt-dlp without compiling. Just enter the above-mentioned folder (with the source code) and run: EDIT: I forgot to include that you have to replace youtube.py in folderpath yt-dlp/extractor with jakeogh's extractor, so that you have the revised comments-extraction code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, here's an initial review. There a lot of comments and some are auto-hidden by github, please make sure to address all of them
Whew! @bashonly I sincerely appreciate the detailed and easy to follow review. All of the requested changes were made, the only outstanding issues are:
|
@jakeogh Thanks, will look into this. I'm also wondering if @bashonly truly intends that all those keys - |
No. Sounds like a bug in old code.
Similar to above, |
yt_dlp/extractor/youtube.py
Outdated
comment_entity_payload = get_first(entities, ('payload', 'commentEntityPayload', {dict})) | ||
toolbar_entity_payload = get_first(entities, ('payload', 'engagementToolbarStateEntityPayload', {dict})) or {} | ||
comment_id = traverse_obj(comment_entity_payload, ('properties', 'commentId', {str})) | ||
if not comment_id: | ||
return | ||
|
||
info = { | ||
'id': comment_id, | ||
'parent': parent or 'root', | ||
**traverse_obj(comment_entity_payload, { | ||
'text': ('properties', 'content', 'content', {str}), | ||
'like_count': ('toolbar', 'likeCountA11y', {parse_count}), | ||
'author_id': ('author', 'channelId', {self.ucid_or_none}), | ||
'author': ('author', 'displayName', {str}), | ||
'author_thumbnail': ('author', 'avatarThumbnailUrl', {url_or_none}), | ||
}), | ||
} | ||
|
||
# Timestamp is an estimate calculated from the current time and time_text | ||
time_text = traverse_obj(comment_entity_payload, ('properties', 'publishedTime', {str})) or '' | ||
timestamp = self._parse_time_text(time_text) | ||
|
||
info.update({ | ||
# FIXME: non-standard, but we need a way of showing that it is an estimate. | ||
'_time_text': time_text, | ||
'timestamp': timestamp, | ||
}) | ||
|
||
info['author_url'] = urljoin( | ||
'https://www.youtube.com', | ||
traverse_obj(comment_entity_payload, ('author', 'channelCommand', 'innertubeCommand', ( | ||
('browseEndpoint', 'canonicalBaseUrl'), ('commandMetadata', 'webCommandMetadata', 'url'))), | ||
expected_type=str, get_all=False)) | ||
|
||
author_is_uploader = traverse_obj(comment_entity_payload, ('author', 'isCreator', {bool})) | ||
if author_is_uploader is not None: | ||
info['author_is_uploader'] = author_is_uploader | ||
|
||
if toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED': | ||
info['is_favorited'] = True | ||
|
||
if traverse_obj(comment_entity_payload, ('author', 'isVerified', {bool})): | ||
info['author_is_verified'] = True | ||
|
||
if traverse_obj(view_model, 'pinnedText'): | ||
info['is_pinned'] = True | ||
|
||
return info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment_entity_payload = get_first(entities, ('payload', 'commentEntityPayload', {dict})) | |
toolbar_entity_payload = get_first(entities, ('payload', 'engagementToolbarStateEntityPayload', {dict})) or {} | |
comment_id = traverse_obj(comment_entity_payload, ('properties', 'commentId', {str})) | |
if not comment_id: | |
return | |
info = { | |
'id': comment_id, | |
'parent': parent or 'root', | |
**traverse_obj(comment_entity_payload, { | |
'text': ('properties', 'content', 'content', {str}), | |
'like_count': ('toolbar', 'likeCountA11y', {parse_count}), | |
'author_id': ('author', 'channelId', {self.ucid_or_none}), | |
'author': ('author', 'displayName', {str}), | |
'author_thumbnail': ('author', 'avatarThumbnailUrl', {url_or_none}), | |
}), | |
} | |
# Timestamp is an estimate calculated from the current time and time_text | |
time_text = traverse_obj(comment_entity_payload, ('properties', 'publishedTime', {str})) or '' | |
timestamp = self._parse_time_text(time_text) | |
info.update({ | |
# FIXME: non-standard, but we need a way of showing that it is an estimate. | |
'_time_text': time_text, | |
'timestamp': timestamp, | |
}) | |
info['author_url'] = urljoin( | |
'https://www.youtube.com', | |
traverse_obj(comment_entity_payload, ('author', 'channelCommand', 'innertubeCommand', ( | |
('browseEndpoint', 'canonicalBaseUrl'), ('commandMetadata', 'webCommandMetadata', 'url'))), | |
expected_type=str, get_all=False)) | |
author_is_uploader = traverse_obj(comment_entity_payload, ('author', 'isCreator', {bool})) | |
if author_is_uploader is not None: | |
info['author_is_uploader'] = author_is_uploader | |
if toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED': | |
info['is_favorited'] = True | |
if traverse_obj(comment_entity_payload, ('author', 'isVerified', {bool})): | |
info['author_is_verified'] = True | |
if traverse_obj(view_model, 'pinnedText'): | |
info['is_pinned'] = True | |
return info | |
comment_entity_payload = get_first(entities, ('payload', 'commentEntityPayload', {dict})) | |
if not (comment_id := traverse_obj(comment_entity_payload, ('properties', 'commentId', {str}))): | |
return | |
toolbar_entity_payload = get_first(entities, ('payload', 'engagementToolbarStateEntityPayload', {dict})) | |
time_text = traverse_obj(comment_entity_payload, ('properties', 'publishedTime', {str})) or '' | |
return { | |
'id': comment_id, | |
'parent': parent or 'root', | |
**traverse_obj(comment_entity_payload, { | |
'text': ('properties', 'content', 'content', {str}), | |
'like_count': ('toolbar', 'likeCountA11y', {parse_count}), | |
'author_id': ('author', 'channelId', {self.ucid_or_none}), | |
'author': ('author', 'displayName', {str}), | |
'author_thumbnail': ('author', 'avatarThumbnailUrl', {url_or_none}), | |
'author_is_uploader': ('author', 'isCreator', {bool}), | |
'author_is_verified': ('author', 'isVerified', {bool}), | |
'author_url': ('author', 'channelCommand', 'innertubeCommand', ( | |
('browseEndpoint', 'canonicalBaseUrl'), ('commandMetadata', 'webCommandMetadata', 'url') | |
), {lambda x: urljoin('https://www.youtube.com', x)}), | |
}, get_all=False), | |
'is_favorited': (None if toolbar_entity_payload is None else # TODO(before merge): Is this logic correct? | |
toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED'), | |
'is_pinned': traverse_obj(view_model, ('pinnedText', {bool})), | |
'_time_text': time_text, # FIXME: non-standard, but we need a way of showing that it is an estimate. | |
'timestamp': self._parse_time_text(time_text), | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per comments above, do we need this?
'like_count': ('toolbar', 'likeCountA11y', {parse_count}, {lambda x: x or 0}),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain? parse_count
returns 0
correctly:
>>> parse_count('0')
0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was basing this off the feedback from @shoxie007, who implied that the current code omits like_count
instead of returning a 0
value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'is_pinned': view_model.get('pinnedText'),
someone should confirm this is always a boolean value in the youtube response, or else bool()
/bool_or_none()
this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
someone should confirm this is always a boolean value in the youtube response, or else
bool()
/bool_or_none()
this
You are right. Even if it's always bool, we should do the type check anyway. Updated my suggestion to
- 'is_pinned': view_model.get('pinnedText'),
+ 'is_pinned': traverse_obj(view_model, ('pinnedText', {bool})),
On a side note, view_model
is being passed in here only for this one value. It may be better to compute this in the parent function instead.
re: re:
pukkandan's suggestion above should fix everything (except maybe |
I had downloaded comments in January, before Youtube made changes and the the extractor broke. I had used yt-dlp v2024.01.05.232702. This is what a yt-dlp-generated comment dict looked like: {
"id": "Ugyv3aJcyIlwjWWWvPp4AaABAg",
"text": "Super video!!!!!!! Me gust\u00f3!",
"like_count": null,
"author_id": "UCYbXPcWIN9fxGKxCELWjapQ",
"author": "@fcp1955",
"author_thumbnail": "https://yt3.ggpht.com/ytc/AIf8zZQBkz1GQX7uX41M8pmfiimGylV2P8rw6ctq15tMGQ=s176-c-k-c0x00ffffff-no-rj",
"parent": "root",
"_time_text": "1 month ago",
"timestamp": 1703376000,
"author_url": "https://www.youtube.com/channel/UCYbXPcWIN9fxGKxCELWjapQ",
"author_is_uploader": false,
"is_favorited": false
} |
so your comment about |
'is_favorited': (None if toolbar_entity_payload is None else # TODO(before merge): Is this logic correct? | ||
toolbar_entity_payload.get('heartState') == 'TOOLBAR_HEART_STATE_HEARTED'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This todo is to be addressed
Yes, I meant that the old extractor used to display like_count at all times in the dict, whereas I though you meant that the new extractor should omit keys which resolve to 0 or false. |
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
Youtube comment format changed, this pull attempts to handle the new case where the key
frameworkUpdates
is present as well as the case without. The original patch is from @minamotorin #9358 (comment)Fixes #9358
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:
What is the purpose of your pull request?