Add new __NEXT_DATA__ parsing for videos on BBC #9701

kylegustavo · 2024-04-16T21:15:52Z

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

I'm requesting a site-specific feature
I've verified that I have updated yt-dlp to nightly or master (update instructions)
I've checked that all provided URLs are playable in a browser with the same IP and same login details
I've searched known issues and the bugtracker for similar issues including closed ones. DO NOT post duplicates
I've read the guidelines for opening an issue
I've read about sharing account credentials and I'm willing to share it if required

Region

United States

Example URLs

https://www.bbc.com/news/uk-68546268
https://www.bbc.com/news/world-middle-east-68778149
https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness

Provide a description that is worded well enough to be understood

BBC.py needs a new clause to capture video from some websites that have video data within a structure called NEXT_DATA in the website data. The current regex/json parsers do not find these videos. In the example, the code fails after the last regex parser fails to locate the link.

I have a fix for this which I will add which captures the new tag from the web data that looks like
<script id="__NEXT_DATA__" type="application/json"> using an added regex parser. From within this, all the parameters for a video are found.

Provide verbose output that clearly demonstrates the problem

Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
If using API, add 'verbose': True to YoutubeDL params instead
Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['-vU', 'https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version [email protected] from yt-dlp/yt-dlp [ff0779267] (source)
[debug] Lazy loading extractors is disabled
[debug] Git HEAD: 315b35442
[debug] Python 3.11.6 (CPython arm64 64bit) - macOS-14.4.1-arm64-arm-64bit (OpenSSL 3.1.4 24 Oct 2023)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.19.0, brotli-1.1.0, certifi-2023.05.07, mutagen-1.47.0, requests-2.31.0, sqlite3-3.44.0, urllib3-2.1.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1810 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: [email protected] from yt-dlp/yt-dlp
yt-dlp is up to date ([email protected] from yt-dlp/yt-dlp)
[bbc] Extracting URL: https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness
[bbc] how-positive-thinking-is-harming-your-happiness: Downloading webpage
ERROR: [bbc] how-positive-thinking-is-harming-your-happiness: Unable to extract playlist data; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/Users/kgonsalves/Repos/yt-dlp/yt_dlp/extractor/common.py", line 734, in extract
    ie_result = self._real_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kgonsalves/Repos/yt-dlp/yt_dlp/extractor/bbc.py", line 1291, in _real_extract
    self._search_regex(
  File "/Users/kgonsalves/Repos/yt-dlp/yt_dlp/extractor/common.py", line 1321, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)

The text was updated successfully, but these errors were encountered:

bashonly · 2024-04-16T21:21:48Z

IIRC, the nextjs data is only present when from requesting the webpages from a US IP address. If you are going to open a PR, make sure that none of the current extraction strategies are removed/replaced -- this should only be done in addition to them. And use the _search_nextjs_data helper method

dirkf · 2024-04-17T13:35:44Z

Indeed, no NEXT_DATA in the page as seen from the UK. The page has a giant __INITIAL_DATA__.

As BBC content shown outside UK can be ad-supported, there may be a different front-end to the asset store, perhaps even developed separately from the one seen at home.

It might be helpful for OP to post an example of the NEXT_DATA as these differences could affect CI download tests

kylegustavo added site-enhancement Feature request for some website triage Untriaged issue labels Apr 16, 2024

bashonly added site-bug Issue with a specific website and removed site-enhancement Feature request for some website triage Untriaged issue labels Apr 16, 2024

kylegustavo linked a pull request Apr 17, 2024 that will close this issue

Issue 9701: NEXT_DATA field video extraction for bbc US website #9705

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new __NEXT_DATA__ parsing for videos on BBC #9701

Add new __NEXT_DATA__ parsing for videos on BBC #9701

kylegustavo commented Apr 16, 2024

bashonly commented Apr 16, 2024

dirkf commented Apr 17, 2024

Add new __NEXT_DATA__ parsing for videos on BBC #9701

Add new __NEXT_DATA__ parsing for videos on BBC #9701

Comments

kylegustavo commented Apr 16, 2024

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Region

Example URLs

Provide a description that is worded well enough to be understood

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

bashonly commented Apr 16, 2024

dirkf commented Apr 17, 2024