Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 9701: NEXT_DATA field video extraction for bbc US website #9705

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

kylegustavo
Copy link
Contributor

@kylegustavo kylegustavo commented Apr 17, 2024

IMPORTANT: PRs without the template will be CLOSED

NEXT_DATA field video extraction for BBC

Some bbc articles with an embedded video have the data for them within a json structure tagged with NEXT_DATA.
Some other websites also use this kind of structure, so we can utilize the existing helper function _search_nextjs_data
to get this information. For BBC, this is used if the article was accessed in the US. Add a parser for this case.

Also patching an old test case's title to still be correct.

Fixes #9701

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Some bbc articles with embedded video have the data
for them within a json structure tagged with NEXT_DATA.
Add a parser for this case.

Links tested:
https://www.bbc.com/news/uk-68546268
https://www.bbc.com/news/world-middle-east-68778149
https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
@bashonly bashonly added pending-fixes PR has had changes requested site-bug Issue with a specific website labels Apr 17, 2024
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
@Grub4K Grub4K added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Apr 19, 2024
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
@bashonly bashonly added pending-fixes PR has had changes requested and removed pending-review PR needs a review labels Apr 20, 2024
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
@pukkandan pukkandan removed the pending-fixes PR has had changes requested label Apr 22, 2024
@dirkf
Copy link
Contributor

dirkf commented Apr 22, 2024

Transcript of successful TestDownload.test_BBC_all from UK
$ python3 test/test_download.py TestDownload.test_BBC_all
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/news/world-europe-32668511
[bbc] world-europe-32668511: Downloading webpage
[bbc] p02r0327: Downloading media selection JSON
[bbc] p02r0327: Downloading media selection JSON
[bbc] p02r0l9y: Downloading media selection JSON
[bbc] p02r0l9y: Downloading media selection JSON
[download] Downloading playlist: Russia stages massive WW2 parade despite Western boycott
[bbc] Playlist Russia stages massive WW2 parade despite Western boycott: Downloading 2 items of 2
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 2
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02r0327: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_p02r0327.info.json
[download] Downloading item 2 of 2
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02r0l9y: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_p02r0l9y.info.json
[download] Finished downloading playlist: Russia stages massive WW2 parade despite Western boycott
Skipping BBC: Save time
Skipped test_BBC_1
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.co.uk/blogs/adamcurtis/entries/3662a707-0af9-3149-963f-47bea720b460
[bbc] 3662a707-0af9-3149-963f-47bea720b460: Downloading webpage
[download] Downloading playlist: BUGGER
[bbc] Playlist BUGGER: Downloading 18 items of 18
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 18
[download] Downloading item 2 of 18
[download] Downloading item 3 of 18
[download] Downloading item 4 of 18
[download] Downloading item 5 of 18
[download] Downloading item 6 of 18
[download] Downloading item 7 of 18
[download] Downloading item 8 of 18
[download] Downloading item 9 of 18
[download] Downloading item 10 of 18
[download] Downloading item 11 of 18
[download] Downloading item 12 of 18
[download] Downloading item 13 of 18
[download] Downloading item 14 of 18
[download] Downloading item 15 of 18
[download] Downloading item 16 of 18
[download] Downloading item 17 of 18
[download] Downloading item 18 of 18
[download] Finished downloading playlist: BUGGER
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/news/world-europe-32041533
[bbc] world-europe-32041533: Downloading webpage
[bbc] p02mprgb: Downloading media selection JSON
[bbc] p02mprgb: Downloading media selection JSON
[download] Downloading playlist: Germanwings plane crash site in aerial video
[bbc] Playlist Germanwings plane crash site in aerial video: Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02mprgb: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_3_p02mprgb.info.json
[download] Finished downloading playlist: Germanwings plane crash site in aerial video
Skipping BBC: now SIMORGH_DATA with no video
Skipped test_BBC_4
Skipping BBC: video extraction failed
Skipped test_BBC_5
Skipping BBC: 404 Not Found
Skipped test_BBC_6
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/travel/story/20150625-sri-lankas-spicy-secret
[bbc] 20150625-sri-lankas-spicy-secret: Downloading webpage
[bbc] p02q6gc4: Downloading media selection JSON
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading media selection JSON
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02q6gc4: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_7_p02q6gc4.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_sd/876d80-p02q6gc4/vf_p02q6gc4_3dcec2b6-f4c6-4c2f-8b6a-f7a324c2c9c1.ism/vf_p02q6gc4_3dcec2b6-f4c6-4c2f-8b6a-f7a324c2c9c1-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 34
[download] Destination: test_BBC_7_p02q6gc4.mp4
[download] 100% of    1.58MiB in 00:00:01 at 1.08MiB/s
Skipping BBC: redirects to TopGear home page
Skipped test_BBC_8
Skipping BBC: Video no longer in page
Skipped test_BBC_9
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/sport/0/football/33653409
[bbc] 33653409: Downloading webpage
[bbc] p02xycnp: Downloading media selection JSON
[bbc] p02xycnp: Downloading media selection JSON
[download] Downloading playlist: Transfers: Cristiano Ronaldo to Man Utd, Arsenal to spend?
[bbc] Playlist Transfers: Cristiano Ronaldo to Man Utd, Arsenal to spend?: Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02xycnp: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_10_p02xycnp.info.json
[debug] Invoking http downloader on "https://vod-pro-ww-live.akamaized.net/mps_h264_hi/public/sport/football/1182000/1182292_h264_1500k.mp4?__gda__=1713830178_3637759492b847bdefa74e131c40b760"
[download] Destination: test_BBC_10_p02xycnp.mp4
[download] 100% of   10.00KiB in 00:00:00 at 30.64KiB/s
[download] Finished downloading playlist: Transfers: Cristiano Ronaldo to Man Utd, Arsenal to spend?
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/sport/0/football/34475836
[bbc] 34475836: Downloading webpage
[bbc] p034ppnv: Downloading media selection JSON
[bbc] p034ppnv: Downloading media selection JSON
[bbc] Downloading playlist 34475836 - add --no-playlist to download just the video p034ppnv
[bbc] p034pzyp: Downloading media selection JSON
[bbc] p034pzyp: Downloading media selection JSON
[bbc] p018lnzb: Downloading media selection JSON
[bbc] p018lnzb: Downloading media selection JSON
[download] Downloading playlist: Jurgen Klopp: Furious football from a witty and winning coach
[bbc] Playlist Jurgen Klopp: Furious football from a witty and winning coach: Downloading 3 items of 3
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 3
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p034ppnv: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_11_p034ppnv.info.json
[download] Downloading item 2 of 3
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p034pzyp: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_11_p034pzyp.info.json
[download] Downloading item 3 of 3
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p018lnzb: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_11_p018lnzb.info.json
[download] Finished downloading playlist: Jurgen Klopp: Furious football from a witty and winning coach
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/sport/0/football/34475836
[bbc] 34475836: Downloading webpage
[bbc] p034ppnv: Downloading media selection JSON
[bbc] p034ppnv: Downloading media selection JSON
[bbc] Downloading just the video p034ppnv because of --no-playlist
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p034ppnv: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_12_p034ppnv.info.json
[debug] Invoking http downloader on "https://vod-pro-ww-live.akamaized.net/mps_h264_hi/public/sport/football/1199000/1199036_h264_1500k.mp4?__gda__=1713830185_c7da95063abc356f9c4b85dfe06e9307"
[download] Destination: test_BBC_12_p034ppnv.mp4
[download] 100% of   10.00KiB in 00:00:00 at 88.56KiB/s
Skipping BBC: redirects to Young Reporter home page https://www.bbc.co.uk/news/topics/cg41ylwv43pt
Skipped test_BBC_13
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.co.uk/news/science-environment-33661876
[bbc] science-environment-33661876: Downloading webpage
[bbc] p02xzws1: Downloading media selection JSON
[bbc] p02xzws1: Downloading media selection JSON
[download] Downloading playlist: New Horizons probe: Pluto may have 'nitrogen glaciers' - Nasa
[bbc] Playlist New Horizons probe: Pluto may have 'nitrogen glaciers' - Nasa: Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02xzws1: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_14_p02xzws1.info.json
[debug] Invoking http downloader on "https://vod-pro-ww-live.akamaized.net/mps_h264_hi/public/news/science_environment/1182000/1182405_h264_1500k.mp4?__gda__=1713830186_96b6f09383f6d3bec1e5c613f4ec4eeb"
[download] Destination: test_BBC_14_p02xzws1.mp4
[download] 100% of   10.00KiB in 00:00:00 at 76.69KiB/s
[download] Finished downloading playlist: New Horizons probe: Pluto may have 'nitrogen glaciers' - Nasa
[debug] Loaded 1855 extractors
[bbc] Extracting URL: https://www.bbc.com/news/av/world-europe-59468682
[bbc] world-europe-59468682: Downloading webpage
[bbc] p0b779gc: Downloading media selection JSON
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading media selection JSON
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[download] Downloading playlist: Why France is declaring Josephine Baker a national hero
[bbc] Playlist Why France is declaring Josephine Baker a national hero: Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p0b779gc: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_15_p0b779gc.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_hd/77c6b7-p0b779gc/vf_p0b779gc_dcca2be6-4e96-49cf-a657-b913b0dee50b.ism/vf_p0b779gc_dcca2be6-4e96-49cf-a657-b913b0dee50b-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 17
[download] Destination: test_BBC_15_p0b779gc.mp4
[download] 100% of    1.43MiB in 00:00:01 at 1.22MiB/s
[download] Finished downloading playlist: Why France is declaring Josephine Baker a national hero
[debug] Loaded 1855 extractors
[bbc] Extracting URL: https://www.bbc.com/news/uk-68546268
[bbc] uk-68546268: Downloading webpage
[bbc] p0hj0lq7: Downloading media selection JSON
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading media selection JSON
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[download] Downloading playlist: Israel-Gaza war: David Cameron says BBC report into Nasser hospital raid 'very disturbing'
[bbc] Playlist Israel-Gaza war: David Cameron says BBC report into Nasser hospital raid 'very disturbing': Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p0hj0lq7: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_16_p0hj0lq7.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_hd/c6a059-p0hj0lq7/vf_p0hj0lq7_899c8d45-62be-44e3-ad46-adf717b33d81.ism/vf_p0hj0lq7_899c8d45-62be-44e3-ad46-adf717b33d81-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 14
[download] Destination: test_BBC_16_p0hj0lq7.mp4
[download] 100% of    1.66MiB in 00:00:01 at 1.05MiB/s
[download] Finished downloading playlist: Israel-Gaza war: David Cameron says BBC report into Nasser hospital raid 'very disturbing'
[debug] Loaded 1855 extractors
[bbc] Extracting URL: https://www.bbc.co.uk/bbcthree/clip/73d0bbd0-abc3-4cea-b3c0-cdae21905eb1
[bbc] 73d0bbd0-abc3-4cea-b3c0-cdae21905eb1: Downloading webpage
[bbc] p06556y7: Downloading media selection JSON
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading media selection JSON
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p06556y7: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_17_p06556y7.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_sd/9e2770-p06556y7/vf_p06556y7_56902990-bf2e-4b2d-becf-2f71f5ce9fcb.ism/vf_p06556y7_56902990-bf2e-4b2d-becf-2f71f5ce9fcb-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 44
[download] Destination: test_BBC_17_p06556y7.mp4
[download] 100% of    1.74MiB in 00:00:01 at 1.21MiB/s
Skipping BBC: 404 Not Found
Skipped test_BBC_18
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.co.uk/learningenglish/chinese/features/lingohack/ep-181227
[bbc] ep-181227: Downloading webpage
[bbc.co.uk] Extracting URL: https://www.bbc.co.uk/programmes/p06w9twp
[bbc.co.uk] p06w9twp: Downloading video page
[bbc.co.uk] p06w9tws: Downloading media selection JSON
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading media selection JSON
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p06w9tws: Downloading 1 format(s): mf_cloudfront-1802-1
[info] Writing video metadata as JSON to: test_BBC_19_p06w9tws.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_sd/877dad-p06w9tws/vf_p06w9tws_591fefab-58d0-4f81-8e5e-63483172ff76.ism/vf_p06w9tws_591fefab-58d0-4f81-8e5e-63483172ff76-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 24
[download] Destination: test_BBC_19_p06w9tws.mp4
[download] 100% of    1.56MiB in 00:00:01 at 1.09MiB/s
[debug] Loaded 1855 extractors
[bbc] Extracting URL: https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness
[bbc] how-positive-thinking-is-harming-your-happiness: Downloading webpage
[bbc] p07c6sb9: Downloading media selection JSON
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading media selection JSON
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p07c6sb9: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_20_p07c6sb9.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_sd/bd909e-p07c6sb9/vf_p07c6sb9_82310bda-114a-4e05-b503-8dc0985cde66.ism/vf_p07c6sb9_82310bda-114a-4e05-b503-8dc0985cde66-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 31
[download] Destination: test_BBC_20_p07c6sb9.mp4
[download] 100% of    1.50MiB in 00:00:01 at 1.19MiB/s
Skipping BBC: 404 Not Found
Skipped test_BBC_21
.
----------------------------------------------------------------------
Ran 1 test in 79.212s

OK
$

Skipped tests:

  • test_BBC_4: now SIMORGH_DATA with no video
  • test_BBC_5: video extraction failed (now in .pageData.promo.media of SIMORGH_DATA)
  • test_BBC_6: 404 Not Found
  • test_BBC_8: redirects to TopGear home page
  • test_BBC_9: Video no longer in page
  • test_BBC_13: redirects to Young Reporter home page https://www.bbc.co.uk/news/topics/cg41ylwv43pt
  • test_BBC_18: 404 Not Found
  • test_BBC_21: 404 Not Found
patch for above extractor version
--- 6ef8990/yt-dlp/yt_dlp/extractor/bbc.py
+++ new/yt-dlp/yt_dlp/extractor/bbc.py
@@ -17,6 +17,7 @@
     int_or_none,
     join_nonempty,
     js_to_json,
+    merge_dicts,
     parse_duration,
     parse_iso8601,
     parse_qs,
@@ -43,6 +44,7 @@
                             iplayer(?:/[^/]+)?/(?:episode/|playlist/)|
                             music/(?:clips|audiovideo/popular)[/#]|
                             radio/player/|
+                            sounds/play/|
                             events/[^/]+/play/[^/]+/
                         )
                         (?P<id>%s)(?!/(?:episodes|broadcasts|clips))
@@ -623,6 +625,7 @@
         'info_dict': {
             'id': '3662a707-0af9-3149-963f-47bea720b460',
             'title': 'BUGGER',
+            'description': r're:BUGGER  The recent revelations by the whistleblower Edward Snowden were fascinating. .{211}\.{3}$',
         },
         'playlist_count': 18,
     }, {
@@ -631,14 +634,14 @@
         'info_dict': {
             'id': 'p02mprgb',
             'ext': 'mp4',
-            'title': 'Aerial footage showed the site of the crash in the Alps - courtesy BFM TV',
-            'description': 'md5:2868290467291b37feda7863f7a83f54',
-            'duration': 47,
+            'title': 'Germanwings crash site aerial video',
+            'description': r're:(?s)Aerial video showed the site where the Germanwings flight 4U 9525, .{156} BFM TV\.$',
+            'duration': None,  # 47,
             'timestamp': 1427219242,
             'upload_date': '20150324',
+            'thumbnail': 'https://ichef.bbci.co.uk/news/1024/media/images/81879000/jpg/_81879090_81879089.jpg',
         },
         'params': {
-            # rtmp download
             'skip_download': True,
         }
     }, {
@@ -656,7 +659,8 @@
         },
         'params': {
             'skip_download': True,
-        }
+        },
+        'skip': 'now SIMORGH_DATA with no video',
     }, {
         # single video embedded with data-playable containing XML playlists (regional section)
         'url': 'http://www.bbc.com/mundo/video_fotos/2015/06/150619_video_honduras_militares_hospitales_corrupcion_aw',
@@ -670,7 +674,9 @@
         },
         'params': {
             'skip_download': True,
-        }
+        },
+        # TODO: now in .pageData.promo.media of SIMORGH_DATA
+        'skip': 'video extraction failed',
     }, {
         # single video from video playlist embedded with vxp-playlist-data JSON
         'url': 'http://www.bbc.com/news/video_and_audio/must_see/33376376',
@@ -683,22 +689,22 @@
         },
         'params': {
             'skip_download': True,
-        }
+        },
+        'skip': '404 Not Found',
     }, {
         # single video story with digitalData
         'url': 'http://www.bbc.com/travel/story/20150625-sri-lankas-spicy-secret',
         'info_dict': {
             'id': 'p02q6gc4',
-            'ext': 'flv',
-            'title': 'Sri Lanka’s spicy secret',
-            'description': 'As a new train line to Jaffna opens up the country’s north, travellers can experience a truly distinct slice of Tamil culture.',
-            'timestamp': 1437674293,
-            'upload_date': '20150723',
-        },
-        'params': {
-            # rtmp download
-            'skip_download': True,
-        }
+            'ext': 'mp4',
+            # page title: 'Sri Lanka’s spicy secret',
+            'title': 'Tasting the spice of life in Jaffna',
+            # page description: 'As a new train line to Jaffna opens up the country’s north, travellers can experience a truly distinct slice of Tamil culture.',
+            'description': r're:(?s)BBC Travel Show’s Henry Golding explores the city of Jaffna .{149} aftertaste\.$',
+            'timestamp': 1437935638,   # was: 1437674293,
+            'upload_date': '20150726',
+            'duration': 255,
+        },
     }, {
         # single video story without digitalData
         'url': 'http://www.bbc.com/autos/story/20130513-hyundais-rock-star',
@@ -710,12 +716,10 @@
             'timestamp': 1415867444,
             'upload_date': '20141113',
         },
-        'params': {
-            # rtmp download
-            'skip_download': True,
-        }
+        'skip': 'redirects to TopGear home page',
     }, {
         # single video embedded with Morph
+        # TODO: replacement test page
         'url': 'http://www.bbc.co.uk/sport/live/olympics/36895975',
         'info_dict': {
             'id': 'p041vhd0',
@@ -726,27 +730,22 @@
             'uploader': 'BBC Sport',
             'uploader_id': 'bbc_sport',
         },
-        'params': {
-            # m3u8 download
-            'skip_download': True,
-        },
-        'skip': 'Georestricted to UK',
-    }, {
-        # single video with playlist.sxml URL in playlist param
+        'skip': 'Video no longer in page',
+    }, {
+        # single video in __INITIAL_DATA__ (was: playlist.sxml URL in playlist param)
         'url': 'http://www.bbc.com/sport/0/football/33653409',
         'info_dict': {
             'id': 'p02xycnp',
             'ext': 'mp4',
-            'title': 'Transfers: Cristiano Ronaldo to Man Utd, Arsenal to spend?',
-            'description': 'BBC Sport\'s David Ornstein has the latest transfer gossip, including rumours of a Manchester United return for Cristiano Ronaldo.',
-            'duration': 140,
-        },
-        'params': {
-            # rtmp download
-            'skip_download': True,
-        }
-    }, {
-        # article with multiple videos embedded with playlist.sxml in playlist param
+            'title': 'Ronaldo to Man Utd, Arsenal to spend?',
+            'description': r'''re:(?s)BBC Sport's David Ornstein rounds up the latest transfer reports, .{359} here\.$''',
+            'timestamp': 1437750175,
+            'upload_date': '20150724',
+            'thumbnail': 'https://news.bbcimg.co.uk/media/images/69320000/png/_69320754_mmgossipcolumnextraaugust18.png',
+            'duration': None,  # 140,
+        },
+    }, {
+        # article with multiple videos embedded with Morph.setPayload
         'url': 'http://www.bbc.com/sport/0/football/34475836',
         'info_dict': {
             'id': '34475836',
@@ -755,6 +754,21 @@
         },
         'playlist_count': 3,
     }, {
+        # lead item from above playlist
+        'url': 'http://www.bbc.com/sport/0/football/34475836',
+        'info_dict': {
+            'id': 'p034ppnv',
+            'ext': 'mp4',
+            'title': 'All you need to know about Jurgen Klopp',
+            'timestamp': 1444335081,
+            'upload_date': '20151008',
+            'duration': 122.0,
+            'thumbnail': 'https://ichef.bbci.co.uk/onesport/cps/976/cpsprodpb/7542/production/_85981003_klopp.jpg',
+        },
+        'params': {
+            'noplaylist': True,
+        },
+    }, {
         # school report article with single video
         'url': 'http://www.bbc.co.uk/schoolreport/35744779',
         'info_dict': {
@@ -762,6 +776,7 @@
             'title': 'School which breaks down barriers in Jerusalem',
         },
         'playlist_count': 1,
+        'skip': 'redirects to Young Reporter home page https://www.bbc.co.uk/news/topics/cg41ylwv43pt',
     }, {
         # single video with playlist URL from weather section
         'url': 'http://www.bbc.com/weather/features/33601775',
@@ -783,10 +798,10 @@
         # video with window.__INITIAL_DATA__ and value as JSON string
         'url': 'https://www.bbc.com/news/av/world-europe-59468682',
         'info_dict': {
-            'id': 'p0b71qth',
+            'id': 'p0b779gc',  # was 'p0b71qth',
             'ext': 'mp4',
             'title': 'Why France is making this woman a national hero',
-            'description': 'md5:7affdfab80e9c3a1f976230a1ff4d5e4',
+            'description': r're:(?s)France is honouring the US-born 20th Century singer and activist Josephine .{291} Casseville$',
             'thumbnail': r're:https?://.+/.+\.jpg',
             'timestamp': 1638230731,
             'upload_date': '20211130',
@@ -830,6 +845,7 @@
             'uploader': 'Radio 3',
             'uploader_id': 'bbc_radio_three',
         },
+        'skip': '404 Not Found',
     }, {
         'url': 'http://www.bbc.co.uk/learningenglish/chinese/features/lingohack/ep-181227',
         'info_dict': {
@@ -837,6 +853,7 @@
             'ext': 'mp4',
             'title': 'md5:2fabf12a726603193a2879a055f72514',
             'description': 'Learn English words and phrases from this story',
+            'thumbnail': 'https://ichef.bbci.co.uk/images/ic/1200x675/p06pq9gk.jpg',
         },
         'add_ie': [BBCCoUkIE.ie_key()],
     }, {
@@ -849,7 +866,7 @@
             'alt_title': 'The downsides of positive thinking',
             'description': 'md5:fad74b31da60d83b8265954ee42d85b4',
             'duration': 235,
-            'thumbnail': r're:https?://.+/p07c9dsr.jpg',
+            'thumbnail': r're:https?://.+/p07c9dsr\.(?:jpg|webp|png)',
             'upload_date': '20190604',
             'categories': ['Psychology'],
         },
@@ -867,6 +884,7 @@
             'duration': 1800,
             'uploader_id': 'bbc_radio_three',
         },
+        'skip': '404 Not Found',
     }, {  # onion routes
         'url': 'https://www.bbcnewsd73hkzno2ini43t4gblxvycyac5aw4gnv7t2rccijh7745uqd.onion/news/av/world-europe-63208576',
         'only_matching': True,
@@ -1082,83 +1100,141 @@
                 }
 
         # Morph based embed (e.g. http://www.bbc.co.uk/sport/live/olympics/36895975)
-        # There are several setPayload calls may be present but the video
-        # seems to be always related to the first one
-        morph_payload = self._parse_json(
-            self._search_regex(
-                r'Morph\.setPayload\([^,]+,\s*({.+?})\);',
-                webpage, 'morph payload', default='{}'),
-            playlist_id, fatal=False)
+        # Several setPayload calls may be present but the video(s)
+        # should be in one that mentions leadMedia or videoData
+        morph_payload = self._search_json(
+            r'\bMorph\s*\.\s*setPayload\s*\([^,]+,', webpage, 'morph payload', playlist_id,
+            contains_pattern=r'\{(?:(?!</script>)[\s\S])+?(?:"leadMedia"|\\"videoData\\")\s*:(?:(?!</script>)[\s\S])+\}',
+            default={})
         if morph_payload:
-            components = try_get(morph_payload, lambda x: x['body']['components'], list) or []
-            for component in components:
-                if not isinstance(component, dict):
-                    continue
-                lead_media = try_get(component, lambda x: x['props']['leadMedia'], dict)
-                if not lead_media:
-                    continue
-                identifiers = lead_media.get('identifiers')
-                if not identifiers or not isinstance(identifiers, dict):
-                    continue
-                programme_id = identifiers.get('vpid') or identifiers.get('playablePid')
+            for component in traverse_obj(morph_payload, (
+                    'body', 'components', lambda _, v: v['props']['leadMedia']['identifiers'])):
+                lead_media = component['props']['leadMedia']
+                programme_id = traverse_obj(lead_media['identifiers'], 'vpid', 'playablePid', expected_type=str)
                 if not programme_id:
                     continue
                 title = lead_media.get('title') or self._og_search_title(webpage)
                 formats, subtitles = self._download_media_selector(programme_id)
-                description = lead_media.get('summary')
-                uploader = lead_media.get('masterBrand')
-                uploader_id = lead_media.get('mid')
-                duration = None
-                duration_d = lead_media.get('duration')
-                if isinstance(duration_d, dict):
-                    duration = parse_duration(dict_get(
-                        duration_d, ('rawDuration', 'formattedDuration', 'spokenDuration')))
                 return {
                     'id': programme_id,
                     'title': title,
-                    'description': description,
-                    'duration': duration,
-                    'uploader': uploader,
-                    'uploader_id': uploader_id,
+                    **traverse_obj(lead_media, {
+                        'description': ('summary', {str}),
+                        'duration': ('duration', ('rawDuration', 'formattedDuration', 'spokenDuration'), {parse_duration}),
+                        'uploader': ('masterBrand', {str}),
+                        'uploader_id': ('mid', {str}),
+                    }),
                     'formats': formats,
                     'subtitles': subtitles,
                 }
-
-        preload_state = self._parse_json(self._search_regex(
-            r'window\.__PRELOADED_STATE__\s*=\s*({.+?});', webpage,
-            'preload state', default='{}'), playlist_id, fatal=False)
-        if preload_state:
-            current_programme = preload_state.get('programmes', {}).get('current') or {}
-            programme_id = current_programme.get('id')
-            if current_programme and programme_id and current_programme.get('type') == 'playable_item':
-                title = current_programme.get('titles', {}).get('tertiary') or playlist_title
+            body = traverse_obj(morph_payload, (
+                'body', 'content', 'article', 'body',
+                {lambda s: self._parse_json(s, playlist_id, fatal=False)}))
+            added = False
+            for video_data in traverse_obj(body, (Ellipsis, 'videoData', {lambda v: v.get('pid') and v})):
+                if video_data.get('vpid'):
+                    video_id = video_data['vpid']
+                    formats, subtitles = self._download_media_selector(video_id)
+                    entry = {
+                        'id': video_id,
+                        'formats': formats,
+                        'subtitles': subtitles,
+                    }
+                else:
+                    video_id = video_data['pid']
+                    entry = self.url_result(
+                        'https://www.bbc.co.uk/programmes/%s' % video_id, BBCCoUkIE.ie_key(),
+                        video_id, url_transparent=True)
+                entry = merge_dicts(
+                    traverse_obj(morph_payload, (
+                        'body', 'content', 'article', {
+                            'timestamp': ('dateTimeInfo', 'dateTime', {parse_iso8601}),
+                        })), traverse_obj(video_data, {
+                            'thumbnail': (('iChefImage', 'image'), {url_or_none}, any),
+                            'title': (('title', 'caption'), {str}, any),
+                            'duration': ('duration', {parse_duration}),
+                        }), entry)
+                if video_data.get('isLead') and not self._yes_playlist(playlist_id, video_id):
+                    return entry
+                entries.append(entry)
+                added = True
+            if added:
+                playlist_title = traverse_obj(morph_payload, (
+                    'body', 'content', 'article', 'headline', {str})) or playlist_title
+                return self.playlist_result(
+                    entries, playlist_id, playlist_title, playlist_description)
+
+        # various PRELOADED_STATE JSON
+        preload_state = self._search_json(
+            r'window\.__(?:PWA_)?PRELOADED_STATE__\s*=', webpage,
+            'preload state', playlist_id, transform_source=js_to_json, default={})
+        # PRELOADED_STATE with current programmme
+        current_programme = traverse_obj(preload_state, (
+            'programmes', 'current', {dict}))
+        if current_programme:
+            programme_id = traverse_obj(current_programme, ('id', {str}))
+            if programme_id and current_programme.get('type') == 'playable_item':
+                title = traverse_obj(current_programme, ('titles', 'tertiary', {str})) or playlist_title
                 formats, subtitles = self._download_media_selector(programme_id)
-                synopses = current_programme.get('synopses') or {}
-                network = current_programme.get('network') or {}
-                duration = int_or_none(
-                    current_programme.get('duration', {}).get('value'))
-                thumbnail = None
-                image_url = current_programme.get('image_url')
-                if image_url:
-                    thumbnail = image_url.replace('{recipe}', 'raw')
                 return {
                     'id': programme_id,
                     'title': title,
-                    'description': dict_get(synopses, ('long', 'medium', 'short')),
-                    'thumbnail': thumbnail,
-                    'duration': duration,
-                    'uploader': network.get('short_title'),
-                    'uploader_id': network.get('id'),
+                    **traverse_obj(current_programme, {
+                        'description': ('synopses', ('long', 'medium', 'short'), {str}, any),
+                        'thumbnail': ('image_url', {lambda u: url_or_none(u.replace('{recipe}', 'raw'))}),
+                        'duration': ('duration', 'value', {int_or_none}),
+                        'uploader': ('network', 'short_title', {str}),
+                        'uploader_id': ('network', 'id', {str}),
+                    }),
                     'formats': formats,
                     'subtitles': subtitles,
-                    'chapters': traverse_obj(preload_state, (
-                        'tracklist', 'tracks', lambda _, v: float_or_none(v['offset']['start']), {
-                            'title': ('titles', {lambda x: join_nonempty(
-                                'primary', 'secondary', 'tertiary', delim=' - ', from_dict=x)}),
-                            'start_time': ('offset', 'start', {float_or_none}),
-                            'end_time': ('offset', 'end', {float_or_none}),
-                        })) or None,
+                    **traverse_obj(preload_state, {
+                        'chapters': (
+                            'tracklist', 'tracks', lambda _, v: float_or_none(v['offset']['start']), {
+                                'title': ('titles', {lambda x: join_nonempty(
+                                    'primary', 'secondary', 'tertiary', delim=' - ', from_dict=x)}),
+                                'start_time': ('offset', 'start', {float_or_none}),
+                                'end_time': ('offset', 'end', {float_or_none}),
+                            }
+                        )
+                    }),
                 }
+
+        # PWA_PRELOADED_STATE with article video asset
+        asset_id = traverse_obj(preload_state, (
+            'entities', 'articles', lambda k, _: k.rsplit('/', 1)[-1] == playlist_id,
+            'assetVideo', 0, {str}, any))
+        if asset_id:
+            video_id = traverse_obj(preload_state, ('entities', 'videos', asset_id, 'vpid', {str}))
+            if video_id:
+                article = traverse_obj(preload_state, (
+                    'entities', 'articles', lambda _, v: v['assetVideo'][0] == asset_id, any))
+
+                def image_url(image_id):
+                    return traverse_obj(preload_state, (
+                        'entities', 'images', image_id, 'url',
+                        {lambda u: url_or_none(u.replace('$recipe', 'raw'))}))
+
+                formats, subtitles = self._download_media_selector(video_id)
+                return {
+                    'id': video_id,
+                    **traverse_obj(preload_state, ('entities', 'videos', asset_id, {
+                        'title': ('title', {str}),
+                        'description': (('synopsisLong', 'synopsisMedium', 'synopsisShort'), {str}, any),
+                        'thumbnail': (0, {image_url}),
+                        'duration': ('duration', {int_or_none}),
+                    })),
+                    'formats': formats,
+                    'subtitles': subtitles,
+                    **traverse_obj(article, {
+                        'timestamp': ('displayDate', {parse_iso8601}),
+                    }),
+                }
+            else:
+                return self.url_result(
+                    'https://www.bbc.co.uk/programmes/%s' % asset_id, BBCCoUkIE.ie_key(),
+                    asset_id, playlist_title, display_id=playlist_id,
+                    description=playlist_description)
 
         bbc3_config = self._parse_json(
             self._search_regex(
@@ -1204,6 +1280,28 @@
                 return self.playlist_result(
                     entries, playlist_id, playlist_title, playlist_description)
 
+        k_int_or_none = functools.partial(int_or_none, scale=1000)
+
+        def parse_model(model):
+            '''Extract single video from model structure'''
+            item_id = traverse_obj(model, ('versions', 0, 'versionId', {str}))
+            if not item_id:
+                return
+            formats, subtitles = self._download_media_selector(item_id)
+            return {
+                'id': item_id,
+                'formats': formats,
+                'subtitles': subtitles,
+                **traverse_obj(model, {
+                    'title': ('title', {str}),
+                    'thumbnail': ('imageUrl', {lambda u: urljoin(url, u.replace('$recipe', 'raw'))}),
+                    'description': (
+                        'synopses', ('long', 'medium', 'short'), {str}, any),
+                    'duration': ('versions', 0, 'duration', {int}),
+                    'timestamp': ('versions', 0, 'availableFrom', {k_int_or_none}),
+                })
+            }
+
         initial_data = self._search_regex(
             r'window\.__INITIAL_DATA__\s*=\s*("{.+?}")\s*;', webpage,
             'quoted preload state', default=None)
@@ -1215,6 +1313,21 @@
             initial_data = self._parse_json(initial_data or '"{}"', playlist_id, fatal=False)
         initial_data = self._parse_json(initial_data, playlist_id, fatal=False)
         if initial_data:
+            added = False
+            for video_data in traverse_obj(initial_data, (
+                    'stores', 'article', 'articleBodyContent', lambda _, v: v['type'] == 'video')):
+                model = traverse_obj(video_data, (
+                    'model', 'blocks', lambda _, v: v['type'] == 'aresMedia',
+                    'model', 'blocks', lambda _, v: v['type'] == 'aresMediaMetadata',
+                    'model', {dict}, any))
+                entry = parse_model(model)
+                if entry:
+                    entries.append(entry)
+                    added = True
+            if added:
+                return self.playlist_result(
+                    entries, playlist_id, playlist_title, playlist_description)
+
             def parse_media(media):
                 if not media:
                     return
@@ -1248,18 +1361,17 @@
                         'timestamp': item_time,
                         'description': strip_or_none(item_desc),
                     })
-            for resp in (initial_data.get('data') or {}).values():
-                name = resp.get('name')
+
+            for resp in traverse_obj(initial_data, ('data', lambda _, v: v.get('name'))):
+                name = resp['name']
                 if name == 'media-experience':
                     parse_media(try_get(resp, lambda x: x['data']['initialItem']['mediaItem'], dict))
                 elif name == 'article':
-                    for block in (try_get(resp,
-                                          (lambda x: x['data']['blocks'],
-                                           lambda x: x['data']['content']['model']['blocks'],),
-                                          list) or []):
-                        if block.get('type') not in ['media', 'video']:
-                            continue
-                        parse_media(block.get('model'))
+                    for block in traverse_obj(resp, ('data', (
+                            None, ('content', 'model')), 'blocks',
+                            lambda _, v: v.get('type') in {'media', 'video'},
+                            'model', {dict})):
+                        parse_media(block)
             return self.playlist_result(
                 entries, playlist_id, playlist_title, playlist_description)
 
@@ -1267,26 +1379,6 @@
             return list(filter(None, map(
                 lambda s: self._parse_json(s, playlist_id, fatal=False),
                 re.findall(pattern, webpage))))
-
-        def parse_model(model):
-            '''Extract single video from model structure'''
-            item_id = traverse_obj(model, ('versions', 0, 'versionId', {str}))
-            if not item_id:
-                return
-            formats, subtitles = self._download_media_selector(item_id)
-            return {
-                'id': item_id,
-                'formats': formats,
-                'subtitles': subtitles,
-                **traverse_obj(model, {
-                    'title': ('title', {str}),
-                    'thumbnail': ('imageUrl', {lambda u: urljoin(url, u.replace('$recipe', 'raw'))}),
-                    'description': (
-                        'synopses', ('long', 'medium', 'short'), {str}, any),
-                    'duration': ('versions', 0, 'duration', {int}),
-                    'timestamp': ('versions', 0, 'availableFrom', {lambda x: int_or_none(x, scale=1000)}),
-                })
-            }
 
         # US accessed article with single embedded video (e.g.
         # https://www.bbc.com/news/uk-68546268)
@@ -1303,7 +1395,7 @@
                 if entry.get('timestamp') is None:
                     entry['timestamp'] = traverse_obj(next_data, (
                         ..., 'contents', lambda _, v: v['type'] == 'timestamp',
-                        'model', 'timestamp', {functools.partial(int_or_none, scale=1000)}, any))
+                        'model', 'timestamp', {k_int_or_none}, any))
                 entries.append(entry)
                 return self.playlist_result(
                     entries, playlist_id, playlist_title, playlist_description)
The yt-dlp branch where I ran this has the old networking and doesn't have the `all`, `any` traversal keys or the updated `_search_nextjs_data()`, but I believe I've reverted any resulting tweaks that were in the test version.

@kylegustavo
Copy link
Contributor Author

kylegustavo commented Apr 23, 2024

Looks like in the failing test, BBCIE and BBCCoUkIE match the same URL. We can modify the suitable() function to avoid the duplicate.

@kylegustavo
Copy link
Contributor Author

We do not need the 'sound' matcher in the BBCCoUkIE. This creates a duplicate match that fails the duplicate test, and anyway this is a skipped test in BBCIE at the moment. If we want that change as part of a BBC improvement for UK links with 'sound', that can be included in a subsequent PR.

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
@dirkf
Copy link
Contributor

dirkf commented Apr 24, 2024

We do not need the 'sound' matcher in the BBCCoUkIE. ...

Quite right. The Sounds pages have the __PRELOADED_STATE__ hydration JSON and BBCIE works fine in the UK. You can check a current show https://www.bbc.co.uk/sounds/play/w3ct5rgx. This could replace the broken test.

BBCCoUkIE also happens to work, but it doesn't get as much metadata (using _download_playlist()).

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work from kylegustavo!

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Show resolved Hide resolved
yt_dlp/extractor/bbc.py Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
Copy link
Member

@pukkandan pukkandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls don't mark comments as resolved. The reviewer will decide whether the conversation is resolved

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
Kyle Gonsalves added 2 commits April 26, 2024 12:47
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
@pukkandan
Copy link
Member

Everything LGTM now. Pls make sure to test all the code branches one last time

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved
@kylegustavo
Copy link
Contributor Author

Checking in, what's the process of getting this change merged now? Do I need approval from @bashonly, and who has the write privileges to squash and merge this change?

@kylegustavo
Copy link
Contributor Author

Hello, just checking in again. @pukkandan do you have write privileges?

@bashonly
Copy link
Member

bashonly commented May 6, 2024

@kylegustavo please be patient, I would like to give this a final review, but have not found the time yet. I will do so as soon as I can. Also, pukkandan is the owner of the repo

@kylegustavo
Copy link
Contributor Author

Ok, thanks @bashonly, take your time. Just checking to see that this review is in the pipeline and not going stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new __NEXT_DATA__ parsing for videos on BBC
5 participants