Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ie/taptap] Add new extractor for taptap.cn #9776

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

c-basalt
Copy link
Contributor

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Add support for video extraction from taptap.cn posts and game topic pages

Fixes #9643

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

@seproDev seproDev added the site-request Request to support a new website label Apr 27, 2024
Copy link
Collaborator

@seproDev seproDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review, but more some initial feedback.

Comment on lines 123 to 128
webpage = self._download_webpage(url, list_id)
nuxt_data = self._deserialize_nuxt_data(self._search_json(
r'<script[^>]+\bid=["\']__NUXT_DATA__["\'][^>]*>', webpage,
'nuxt data', list_id, contains_pattern=r'\[(?s:.+)\]'))[1]
x_ua = traverse_obj(nuxt_data, (
'state', '$sbff', ..., {lambda x: parse_qs(x)['X-UA']}, ...), get_all=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of parsing the hydration data it would be better to call the API directly
https://www.taptap.cn/webapiv2/app/v4/detail
https://www.taptap.cn/webapiv2/moment/v3/detail
I haven't looked too much in to the structure of the returned data, but it could make sense to put the _extract_video in to a base class and have separate extractors for moment/app content.

Comment on lines 115 to 116
if '.m3u8' in video['url']:
video['formats'] = self._extract_m3u8_formats(video.pop('url'), video_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract both the AVC and HEVC formats every time. I think that additional request is worth it, or is there any reason like excessive rate limiting not to do this?

yt_dlp/extractor/taptap.py Outdated Show resolved Hide resolved


class TapTapIE(InfoExtractor):
_VALID_URL = r'https?://www\.taptap\.cn/(?P<section>moment|app)/(?P<id>\d+)'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also support taptap.io? The site structure is very similar. I think this should be doable with only minimal adjustments.

@seproDev seproDev added the pending-fixes PR has had changes requested label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending-fixes PR has had changes requested site-request Request to support a new website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Site request: taptap.cn
2 participants