Issue 9701: NEXT_DATA field video extraction for bbc US website #9705

kylegustavo · 2024-04-17T18:18:04Z

IMPORTANT: PRs without the template will be CLOSED

NEXT_DATA field video extraction for BBC

Some bbc articles with an embedded video have the data for them within a json structure tagged with NEXT_DATA.
Some other websites also use this kind of structure, so we can utilize the existing helper function _search_nextjs_data
to get this information. For BBC, this is used if the article was accessed in the US. Add a parser for this case.

Also patching an old test case's title to still be correct.

Fixes #9701

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Some bbc articles with embedded video have the data for them within a json structure tagged with NEXT_DATA. Add a parser for this case. Links tested: https://www.bbc.com/news/uk-68546268 https://www.bbc.com/news/world-middle-east-68778149 https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness

yt_dlp/extractor/bbc.py

dirkf · 2024-04-22T18:34:31Z

Transcript of successful TestDownload.test_BBC_all from UK

$ python3 test/test_download.py TestDownload.test_BBC_all
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/news/world-europe-32668511
[bbc] world-europe-32668511: Downloading webpage
[bbc] p02r0327: Downloading media selection JSON
[bbc] p02r0327: Downloading media selection JSON
[bbc] p02r0l9y: Downloading media selection JSON
[bbc] p02r0l9y: Downloading media selection JSON
[download] Downloading playlist: Russia stages massive WW2 parade despite Western boycott
[bbc] Playlist Russia stages massive WW2 parade despite Western boycott: Downloading 2 items of 2
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 2
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02r0327: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_p02r0327.info.json
[download] Downloading item 2 of 2
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02r0l9y: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_p02r0l9y.info.json
[download] Finished downloading playlist: Russia stages massive WW2 parade despite Western boycott
Skipping BBC: Save time
Skipped test_BBC_1
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.co.uk/blogs/adamcurtis/entries/3662a707-0af9-3149-963f-47bea720b460
[bbc] 3662a707-0af9-3149-963f-47bea720b460: Downloading webpage
[download] Downloading playlist: BUGGER
[bbc] Playlist BUGGER: Downloading 18 items of 18
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 18
[download] Downloading item 2 of 18
[download] Downloading item 3 of 18
[download] Downloading item 4 of 18
[download] Downloading item 5 of 18
[download] Downloading item 6 of 18
[download] Downloading item 7 of 18
[download] Downloading item 8 of 18
[download] Downloading item 9 of 18
[download] Downloading item 10 of 18
[download] Downloading item 11 of 18
[download] Downloading item 12 of 18
[download] Downloading item 13 of 18
[download] Downloading item 14 of 18
[download] Downloading item 15 of 18
[download] Downloading item 16 of 18
[download] Downloading item 17 of 18
[download] Downloading item 18 of 18
[download] Finished downloading playlist: BUGGER
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/news/world-europe-32041533
[bbc] world-europe-32041533: Downloading webpage
[bbc] p02mprgb: Downloading media selection JSON
[bbc] p02mprgb: Downloading media selection JSON
[download] Downloading playlist: Germanwings plane crash site in aerial video
[bbc] Playlist Germanwings plane crash site in aerial video: Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02mprgb: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_3_p02mprgb.info.json
[download] Finished downloading playlist: Germanwings plane crash site in aerial video
Skipping BBC: now SIMORGH_DATA with no video
Skipped test_BBC_4
Skipping BBC: video extraction failed
Skipped test_BBC_5
Skipping BBC: 404 Not Found
Skipped test_BBC_6
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/travel/story/20150625-sri-lankas-spicy-secret
[bbc] 20150625-sri-lankas-spicy-secret: Downloading webpage
[bbc] p02q6gc4: Downloading media selection JSON
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading media selection JSON
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading m3u8 information
[bbc] p02q6gc4: Downloading MPD manifest
[bbc] p02q6gc4: Downloading MPD manifest
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02q6gc4: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_7_p02q6gc4.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_sd/876d80-p02q6gc4/vf_p02q6gc4_3dcec2b6-f4c6-4c2f-8b6a-f7a324c2c9c1.ism/vf_p02q6gc4_3dcec2b6-f4c6-4c2f-8b6a-f7a324c2c9c1-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 34
[download] Destination: test_BBC_7_p02q6gc4.mp4
[download] 100% of    1.58MiB in 00:00:01 at 1.08MiB/s
Skipping BBC: redirects to TopGear home page
Skipped test_BBC_8
Skipping BBC: Video no longer in page
Skipped test_BBC_9
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/sport/0/football/33653409
[bbc] 33653409: Downloading webpage
[bbc] p02xycnp: Downloading media selection JSON
[bbc] p02xycnp: Downloading media selection JSON
[download] Downloading playlist: Transfers: Cristiano Ronaldo to Man Utd, Arsenal to spend?
[bbc] Playlist Transfers: Cristiano Ronaldo to Man Utd, Arsenal to spend?: Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02xycnp: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_10_p02xycnp.info.json
[debug] Invoking http downloader on "https://vod-pro-ww-live.akamaized.net/mps_h264_hi/public/sport/football/1182000/1182292_h264_1500k.mp4?__gda__=1713830178_3637759492b847bdefa74e131c40b760"
[download] Destination: test_BBC_10_p02xycnp.mp4
[download] 100% of   10.00KiB in 00:00:00 at 30.64KiB/s
[download] Finished downloading playlist: Transfers: Cristiano Ronaldo to Man Utd, Arsenal to spend?
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/sport/0/football/34475836
[bbc] 34475836: Downloading webpage
[bbc] p034ppnv: Downloading media selection JSON
[bbc] p034ppnv: Downloading media selection JSON
[bbc] Downloading playlist 34475836 - add --no-playlist to download just the video p034ppnv
[bbc] p034pzyp: Downloading media selection JSON
[bbc] p034pzyp: Downloading media selection JSON
[bbc] p018lnzb: Downloading media selection JSON
[bbc] p018lnzb: Downloading media selection JSON
[download] Downloading playlist: Jurgen Klopp: Furious football from a witty and winning coach
[bbc] Playlist Jurgen Klopp: Furious football from a witty and winning coach: Downloading 3 items of 3
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 3
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p034ppnv: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_11_p034ppnv.info.json
[download] Downloading item 2 of 3
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p034pzyp: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_11_p034pzyp.info.json
[download] Downloading item 3 of 3
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p018lnzb: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_11_p018lnzb.info.json
[download] Finished downloading playlist: Jurgen Klopp: Furious football from a witty and winning coach
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.com/sport/0/football/34475836
[bbc] 34475836: Downloading webpage
[bbc] p034ppnv: Downloading media selection JSON
[bbc] p034ppnv: Downloading media selection JSON
[bbc] Downloading just the video p034ppnv because of --no-playlist
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p034ppnv: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_12_p034ppnv.info.json
[debug] Invoking http downloader on "https://vod-pro-ww-live.akamaized.net/mps_h264_hi/public/sport/football/1199000/1199036_h264_1500k.mp4?__gda__=1713830185_c7da95063abc356f9c4b85dfe06e9307"
[download] Destination: test_BBC_12_p034ppnv.mp4
[download] 100% of   10.00KiB in 00:00:00 at 88.56KiB/s
Skipping BBC: redirects to Young Reporter home page https://www.bbc.co.uk/news/topics/cg41ylwv43pt
Skipped test_BBC_13
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.co.uk/news/science-environment-33661876
[bbc] science-environment-33661876: Downloading webpage
[bbc] p02xzws1: Downloading media selection JSON
[bbc] p02xzws1: Downloading media selection JSON
[download] Downloading playlist: New Horizons probe: Pluto may have 'nitrogen glaciers' - Nasa
[bbc] Playlist New Horizons probe: Pluto may have 'nitrogen glaciers' - Nasa: Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p02xzws1: Downloading 1 format(s): mf_akamai_world_plain_https-1
[info] Writing video metadata as JSON to: test_BBC_14_p02xzws1.info.json
[debug] Invoking http downloader on "https://vod-pro-ww-live.akamaized.net/mps_h264_hi/public/news/science_environment/1182000/1182405_h264_1500k.mp4?__gda__=1713830186_96b6f09383f6d3bec1e5c613f4ec4eeb"
[download] Destination: test_BBC_14_p02xzws1.mp4
[download] 100% of   10.00KiB in 00:00:00 at 76.69KiB/s
[download] Finished downloading playlist: New Horizons probe: Pluto may have 'nitrogen glaciers' - Nasa
[debug] Loaded 1855 extractors
[bbc] Extracting URL: https://www.bbc.com/news/av/world-europe-59468682
[bbc] world-europe-59468682: Downloading webpage
[bbc] p0b779gc: Downloading media selection JSON
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading media selection JSON
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading m3u8 information
[bbc] p0b779gc: Downloading MPD manifest
[bbc] p0b779gc: Downloading MPD manifest
[download] Downloading playlist: Why France is declaring Josephine Baker a national hero
[bbc] Playlist Why France is declaring Josephine Baker a national hero: Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p0b779gc: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_15_p0b779gc.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_hd/77c6b7-p0b779gc/vf_p0b779gc_dcca2be6-4e96-49cf-a657-b913b0dee50b.ism/vf_p0b779gc_dcca2be6-4e96-49cf-a657-b913b0dee50b-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 17
[download] Destination: test_BBC_15_p0b779gc.mp4
[download] 100% of    1.43MiB in 00:00:01 at 1.22MiB/s
[download] Finished downloading playlist: Why France is declaring Josephine Baker a national hero
[debug] Loaded 1855 extractors
[bbc] Extracting URL: https://www.bbc.com/news/uk-68546268
[bbc] uk-68546268: Downloading webpage
[bbc] p0hj0lq7: Downloading media selection JSON
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading media selection JSON
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading m3u8 information
[bbc] p0hj0lq7: Downloading MPD manifest
[bbc] p0hj0lq7: Downloading MPD manifest
[download] Downloading playlist: Israel-Gaza war: David Cameron says BBC report into Nasser hospital raid 'very disturbing'
[bbc] Playlist Israel-Gaza war: David Cameron says BBC report into Nasser hospital raid 'very disturbing': Downloading 1 items of 1
[debug] The information of all playlist entries will be held in memory
[download] Downloading item 1 of 1
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p0hj0lq7: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_16_p0hj0lq7.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_hd/c6a059-p0hj0lq7/vf_p0hj0lq7_899c8d45-62be-44e3-ad46-adf717b33d81.ism/vf_p0hj0lq7_899c8d45-62be-44e3-ad46-adf717b33d81-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 14
[download] Destination: test_BBC_16_p0hj0lq7.mp4
[download] 100% of    1.66MiB in 00:00:01 at 1.05MiB/s
[download] Finished downloading playlist: Israel-Gaza war: David Cameron says BBC report into Nasser hospital raid 'very disturbing'
[debug] Loaded 1855 extractors
[bbc] Extracting URL: https://www.bbc.co.uk/bbcthree/clip/73d0bbd0-abc3-4cea-b3c0-cdae21905eb1
[bbc] 73d0bbd0-abc3-4cea-b3c0-cdae21905eb1: Downloading webpage
[bbc] p06556y7: Downloading media selection JSON
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading media selection JSON
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading m3u8 information
[bbc] p06556y7: Downloading MPD manifest
[bbc] p06556y7: Downloading MPD manifest
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p06556y7: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_17_p06556y7.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_sd/9e2770-p06556y7/vf_p06556y7_56902990-bf2e-4b2d-becf-2f71f5ce9fcb.ism/vf_p06556y7_56902990-bf2e-4b2d-becf-2f71f5ce9fcb-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 44
[download] Destination: test_BBC_17_p06556y7.mp4
[download] 100% of    1.74MiB in 00:00:01 at 1.21MiB/s
Skipping BBC: 404 Not Found
Skipped test_BBC_18
[debug] Loaded 1855 extractors
[bbc] Extracting URL: http://www.bbc.co.uk/learningenglish/chinese/features/lingohack/ep-181227
[bbc] ep-181227: Downloading webpage
[bbc.co.uk] Extracting URL: https://www.bbc.co.uk/programmes/p06w9twp
[bbc.co.uk] p06w9twp: Downloading video page
[bbc.co.uk] p06w9tws: Downloading media selection JSON
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading media selection JSON
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading m3u8 information
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[bbc.co.uk] p06w9tws: Downloading MPD manifest
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p06w9tws: Downloading 1 format(s): mf_cloudfront-1802-1
[info] Writing video metadata as JSON to: test_BBC_19_p06w9tws.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_sd/877dad-p06w9tws/vf_p06w9tws_591fefab-58d0-4f81-8e5e-63483172ff76.ism/vf_p06w9tws_591fefab-58d0-4f81-8e5e-63483172ff76-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 24
[download] Destination: test_BBC_19_p06w9tws.mp4
[download] 100% of    1.56MiB in 00:00:01 at 1.09MiB/s
[debug] Loaded 1855 extractors
[bbc] Extracting URL: https://www.bbc.com/reel/video/p07c6sb6/how-positive-thinking-is-harming-your-happiness
[bbc] how-positive-thinking-is-harming-your-happiness: Downloading webpage
[bbc] p07c6sb9: Downloading media selection JSON
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading media selection JSON
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading m3u8 information
[bbc] p07c6sb9: Downloading MPD manifest
[bbc] p07c6sb9: Downloading MPD manifest
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id
[info] p07c6sb9: Downloading 1 format(s): mf_cloudfront-1802-3
[info] Writing video metadata as JSON to: test_BBC_20_p07c6sb9.info.json
[debug] Invoking hlsnative downloader on "https://vod-hls-uk.live.cf.md.bbci.co.uk/usp/auth/vod/piff_abr_full_sd/bd909e-p07c6sb9/vf_p07c6sb9_82310bda-114a-4e05-b503-8dc0985cde66.ism/vf_p07c6sb9_82310bda-114a-4e05-b503-8dc0985cde66-audio_eng=96000-video=1604000.m3u8"
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 31
[download] Destination: test_BBC_20_p07c6sb9.mp4
[download] 100% of    1.50MiB in 00:00:01 at 1.19MiB/s
Skipping BBC: 404 Not Found
Skipped test_BBC_21
.
----------------------------------------------------------------------
Ran 1 test in 79.212s

OK
$

Skipped tests:

test_BBC_4: now SIMORGH_DATA with no video
test_BBC_5: video extraction failed (now in .pageData.promo.media of SIMORGH_DATA)
test_BBC_6: 404 Not Found
test_BBC_8: redirects to TopGear home page
test_BBC_9: Video no longer in page
test_BBC_13: redirects to Young Reporter home page https://www.bbc.co.uk/news/topics/cg41ylwv43pt
test_BBC_18: 404 Not Found
test_BBC_21: 404 Not Found

patch for above extractor version

--- 6ef8990/yt-dlp/yt_dlp/extractor/bbc.py
+++ new/yt-dlp/yt_dlp/extractor/bbc.py
@@ -17,6 +17,7 @@
     int_or_none,
     join_nonempty,
     js_to_json,
+    merge_dicts,
     parse_duration,
     parse_iso8601,
     parse_qs,
@@ -43,6 +44,7 @@
                             iplayer(?:/[^/]+)?/(?:episode/|playlist/)|
                             music/(?:clips|audiovideo/popular)[/#]|
                             radio/player/|
+                            sounds/play/|
                             events/[^/]+/play/[^/]+/
                         )
                         (?P<id>%s)(?!/(?:episodes|broadcasts|clips))
@@ -623,6 +625,7 @@
         'info_dict': {
             'id': '3662a707-0af9-3149-963f-47bea720b460',
             'title': 'BUGGER',
+            'description': r're:BUGGER  The recent revelations by the whistleblower Edward Snowden were fascinating. .{211}\.{3}$',
         },
         'playlist_count': 18,
     }, {
@@ -631,14 +634,14 @@
         'info_dict': {
             'id': 'p02mprgb',
             'ext': 'mp4',
-            'title': 'Aerial footage showed the site of the crash in the Alps - courtesy BFM TV',
-            'description': 'md5:2868290467291b37feda7863f7a83f54',
-            'duration': 47,
+            'title': 'Germanwings crash site aerial video',
+            'description': r're:(?s)Aerial video showed the site where the Germanwings flight 4U 9525, .{156} BFM TV\.$',
+            'duration': None,  # 47,
             'timestamp': 1427219242,
             'upload_date': '20150324',
+            'thumbnail': 'https://ichef.bbci.co.uk/news/1024/media/images/81879000/jpg/_81879090_81879089.jpg',
         },
         'params': {
-            # rtmp download
             'skip_download': True,
         }
     }, {
@@ -656,7 +659,8 @@
         },
         'params': {
             'skip_download': True,
-        }
+        },
+        'skip': 'now SIMORGH_DATA with no video',
     }, {
         # single video embedded with data-playable containing XML playlists (regional section)
         'url': 'http://www.bbc.com/mundo/video_fotos/2015/06/150619_video_honduras_militares_hospitales_corrupcion_aw',
@@ -670,7 +674,9 @@
         },
         'params': {
             'skip_download': True,
-        }
+        },
+        # TODO: now in .pageData.promo.media of SIMORGH_DATA
+        'skip': 'video extraction failed',
     }, {
         # single video from video playlist embedded with vxp-playlist-data JSON
         'url': 'http://www.bbc.com/news/video_and_audio/must_see/33376376',
@@ -683,22 +689,22 @@
         },
         'params': {
             'skip_download': True,
-        }
+        },
+        'skip': '404 Not Found',
     }, {
         # single video story with digitalData
         'url': 'http://www.bbc.com/travel/story/20150625-sri-lankas-spicy-secret',
         'info_dict': {
             'id': 'p02q6gc4',
-            'ext': 'flv',
-            'title': 'Sri Lanka’s spicy secret',
-            'description': 'As a new train line to Jaffna opens up the country’s north, travellers can experience a truly distinct slice of Tamil culture.',
-            'timestamp': 1437674293,
-            'upload_date': '20150723',
-        },
-        'params': {
-            # rtmp download
-            'skip_download': True,
-        }
+            'ext': 'mp4',
+            # page title: 'Sri Lanka’s spicy secret',
+            'title': 'Tasting the spice of life in Jaffna',
+            # page description: 'As a new train line to Jaffna opens up the country’s north, travellers can experience a truly distinct slice of Tamil culture.',
+            'description': r're:(?s)BBC Travel Show’s Henry Golding explores the city of Jaffna .{149} aftertaste\.$',
+            'timestamp': 1437935638,   # was: 1437674293,
+            'upload_date': '20150726',
+            'duration': 255,
+        },
     }, {
         # single video story without digitalData
         'url': 'http://www.bbc.com/autos/story/20130513-hyundais-rock-star',
@@ -710,12 +716,10 @@
             'timestamp': 1415867444,
             'upload_date': '20141113',
         },
-        'params': {
-            # rtmp download
-            'skip_download': True,
-        }
+        'skip': 'redirects to TopGear home page',
     }, {
         # single video embedded with Morph
+        # TODO: replacement test page
         'url': 'http://www.bbc.co.uk/sport/live/olympics/36895975',
         'info_dict': {
             'id': 'p041vhd0',
@@ -726,27 +730,22 @@
             'uploader': 'BBC Sport',
             'uploader_id': 'bbc_sport',
         },
-        'params': {
-            # m3u8 download
-            'skip_download': True,
-        },
-        'skip': 'Georestricted to UK',
-    }, {
-        # single video with playlist.sxml URL in playlist param
+        'skip': 'Video no longer in page',
+    }, {
+        # single video in __INITIAL_DATA__ (was: playlist.sxml URL in playlist param)
         'url': 'http://www.bbc.com/sport/0/football/33653409',
         'info_dict': {
             'id': 'p02xycnp',
             'ext': 'mp4',
-            'title': 'Transfers: Cristiano Ronaldo to Man Utd, Arsenal to spend?',
-            'description': 'BBC Sport\'s David Ornstein has the latest transfer gossip, including rumours of a Manchester United return for Cristiano Ronaldo.',
-            'duration': 140,
-        },
-        'params': {
-            # rtmp download
-            'skip_download': True,
-        }
-    }, {
-        # article with multiple videos embedded with playlist.sxml in playlist param
+            'title': 'Ronaldo to Man Utd, Arsenal to spend?',
+            'description': r'''re:(?s)BBC Sport's David Ornstein rounds up the latest transfer reports, .{359} here\.$''',
+            'timestamp': 1437750175,
+            'upload_date': '20150724',
+            'thumbnail': 'https://news.bbcimg.co.uk/media/images/69320000/png/_69320754_mmgossipcolumnextraaugust18.png',
+            'duration': None,  # 140,
+        },
+    }, {
+        # article with multiple videos embedded with Morph.setPayload
         'url': 'http://www.bbc.com/sport/0/football/34475836',
         'info_dict': {
             'id': '34475836',
@@ -755,6 +754,21 @@
         },
         'playlist_count': 3,
     }, {
+        # lead item from above playlist
+        'url': 'http://www.bbc.com/sport/0/football/34475836',
+        'info_dict': {
+            'id': 'p034ppnv',
+            'ext': 'mp4',
+            'title': 'All you need to know about Jurgen Klopp',
+            'timestamp': 1444335081,
+            'upload_date': '20151008',
+            'duration': 122.0,
+            'thumbnail': 'https://ichef.bbci.co.uk/onesport/cps/976/cpsprodpb/7542/production/_85981003_klopp.jpg',
+        },
+        'params': {
+            'noplaylist': True,
+        },
+    }, {
         # school report article with single video
         'url': 'http://www.bbc.co.uk/schoolreport/35744779',
         'info_dict': {
@@ -762,6 +776,7 @@
             'title': 'School which breaks down barriers in Jerusalem',
         },
         'playlist_count': 1,
+        'skip': 'redirects to Young Reporter home page https://www.bbc.co.uk/news/topics/cg41ylwv43pt',
     }, {
         # single video with playlist URL from weather section
         'url': 'http://www.bbc.com/weather/features/33601775',
@@ -783,10 +798,10 @@
         # video with window.__INITIAL_DATA__ and value as JSON string
         'url': 'https://www.bbc.com/news/av/world-europe-59468682',
         'info_dict': {
-            'id': 'p0b71qth',
+            'id': 'p0b779gc',  # was 'p0b71qth',
             'ext': 'mp4',
             'title': 'Why France is making this woman a national hero',
-            'description': 'md5:7affdfab80e9c3a1f976230a1ff4d5e4',
+            'description': r're:(?s)France is honouring the US-born 20th Century singer and activist Josephine .{291} Casseville$',
             'thumbnail': r're:https?://.+/.+\.jpg',
             'timestamp': 1638230731,
             'upload_date': '20211130',
@@ -830,6 +845,7 @@
             'uploader': 'Radio 3',
             'uploader_id': 'bbc_radio_three',
         },
+        'skip': '404 Not Found',
     }, {
         'url': 'http://www.bbc.co.uk/learningenglish/chinese/features/lingohack/ep-181227',
         'info_dict': {
@@ -837,6 +853,7 @@
             'ext': 'mp4',
             'title': 'md5:2fabf12a726603193a2879a055f72514',
             'description': 'Learn English words and phrases from this story',
+            'thumbnail': 'https://ichef.bbci.co.uk/images/ic/1200x675/p06pq9gk.jpg',
         },
         'add_ie': [BBCCoUkIE.ie_key()],
     }, {
@@ -849,7 +866,7 @@
             'alt_title': 'The downsides of positive thinking',
             'description': 'md5:fad74b31da60d83b8265954ee42d85b4',
             'duration': 235,
-            'thumbnail': r're:https?://.+/p07c9dsr.jpg',
+            'thumbnail': r're:https?://.+/p07c9dsr\.(?:jpg|webp|png)',
             'upload_date': '20190604',
             'categories': ['Psychology'],
         },
@@ -867,6 +884,7 @@
             'duration': 1800,
             'uploader_id': 'bbc_radio_three',
         },
+        'skip': '404 Not Found',
     }, {  # onion routes
         'url': 'https://www.bbcnewsd73hkzno2ini43t4gblxvycyac5aw4gnv7t2rccijh7745uqd.onion/news/av/world-europe-63208576',
         'only_matching': True,
@@ -1082,83 +1100,141 @@
                 }
 
         # Morph based embed (e.g. http://www.bbc.co.uk/sport/live/olympics/36895975)
-        # There are several setPayload calls may be present but the video
-        # seems to be always related to the first one
-        morph_payload = self._parse_json(
-            self._search_regex(
-                r'Morph\.setPayload\([^,]+,\s*({.+?})\);',
-                webpage, 'morph payload', default='{}'),
-            playlist_id, fatal=False)
+        # Several setPayload calls may be present but the video(s)
+        # should be in one that mentions leadMedia or videoData
+        morph_payload = self._search_json(
+            r'\bMorph\s*\.\s*setPayload\s*\([^,]+,', webpage, 'morph payload', playlist_id,
+            contains_pattern=r'\{(?:(?!</script>)[\s\S])+?(?:"leadMedia"|\\"videoData\\")\s*:(?:(?!</script>)[\s\S])+\}',
+            default={})
         if morph_payload:
-            components = try_get(morph_payload, lambda x: x['body']['components'], list) or []
-            for component in components:
-                if not isinstance(component, dict):
-                    continue
-                lead_media = try_get(component, lambda x: x['props']['leadMedia'], dict)
-                if not lead_media:
-                    continue
-                identifiers = lead_media.get('identifiers')
-                if not identifiers or not isinstance(identifiers, dict):
-                    continue
-                programme_id = identifiers.get('vpid') or identifiers.get('playablePid')
+            for component in traverse_obj(morph_payload, (
+                    'body', 'components', lambda _, v: v['props']['leadMedia']['identifiers'])):
+                lead_media = component['props']['leadMedia']
+                programme_id = traverse_obj(lead_media['identifiers'], 'vpid', 'playablePid', expected_type=str)
                 if not programme_id:
                     continue
                 title = lead_media.get('title') or self._og_search_title(webpage)
                 formats, subtitles = self._download_media_selector(programme_id)
-                description = lead_media.get('summary')
-                uploader = lead_media.get('masterBrand')
-                uploader_id = lead_media.get('mid')
-                duration = None
-                duration_d = lead_media.get('duration')
-                if isinstance(duration_d, dict):
-                    duration = parse_duration(dict_get(
-                        duration_d, ('rawDuration', 'formattedDuration', 'spokenDuration')))
                 return {
                     'id': programme_id,
                     'title': title,
-                    'description': description,
-                    'duration': duration,
-                    'uploader': uploader,
-                    'uploader_id': uploader_id,
+                    **traverse_obj(lead_media, {
+                        'description': ('summary', {str}),
+                        'duration': ('duration', ('rawDuration', 'formattedDuration', 'spokenDuration'), {parse_duration}),
+                        'uploader': ('masterBrand', {str}),
+                        'uploader_id': ('mid', {str}),
+                    }),
                     'formats': formats,
                     'subtitles': subtitles,
                 }
-
-        preload_state = self._parse_json(self._search_regex(
-            r'window\.__PRELOADED_STATE__\s*=\s*({.+?});', webpage,
-            'preload state', default='{}'), playlist_id, fatal=False)
-        if preload_state:
-            current_programme = preload_state.get('programmes', {}).get('current') or {}
-            programme_id = current_programme.get('id')
-            if current_programme and programme_id and current_programme.get('type') == 'playable_item':
-                title = current_programme.get('titles', {}).get('tertiary') or playlist_title
+            body = traverse_obj(morph_payload, (
+                'body', 'content', 'article', 'body',
+                {lambda s: self._parse_json(s, playlist_id, fatal=False)}))
+            added = False
+            for video_data in traverse_obj(body, (Ellipsis, 'videoData', {lambda v: v.get('pid') and v})):
+                if video_data.get('vpid'):
+                    video_id = video_data['vpid']
+                    formats, subtitles = self._download_media_selector(video_id)
+                    entry = {
+                        'id': video_id,
+                        'formats': formats,
+                        'subtitles': subtitles,
+                    }
+                else:
+                    video_id = video_data['pid']
+                    entry = self.url_result(
+                        'https://www.bbc.co.uk/programmes/%s' % video_id, BBCCoUkIE.ie_key(),
+                        video_id, url_transparent=True)
+                entry = merge_dicts(
+                    traverse_obj(morph_payload, (
+                        'body', 'content', 'article', {
+                            'timestamp': ('dateTimeInfo', 'dateTime', {parse_iso8601}),
+                        })), traverse_obj(video_data, {
+                            'thumbnail': (('iChefImage', 'image'), {url_or_none}, any),
+                            'title': (('title', 'caption'), {str}, any),
+                            'duration': ('duration', {parse_duration}),
+                        }), entry)
+                if video_data.get('isLead') and not self._yes_playlist(playlist_id, video_id):
+                    return entry
+                entries.append(entry)
+                added = True
+            if added:
+                playlist_title = traverse_obj(morph_payload, (
+                    'body', 'content', 'article', 'headline', {str})) or playlist_title
+                return self.playlist_result(
+                    entries, playlist_id, playlist_title, playlist_description)
+
+        # various PRELOADED_STATE JSON
+        preload_state = self._search_json(
+            r'window\.__(?:PWA_)?PRELOADED_STATE__\s*=', webpage,
+            'preload state', playlist_id, transform_source=js_to_json, default={})
+        # PRELOADED_STATE with current programmme
+        current_programme = traverse_obj(preload_state, (
+            'programmes', 'current', {dict}))
+        if current_programme:
+            programme_id = traverse_obj(current_programme, ('id', {str}))
+            if programme_id and current_programme.get('type') == 'playable_item':
+                title = traverse_obj(current_programme, ('titles', 'tertiary', {str})) or playlist_title
                 formats, subtitles = self._download_media_selector(programme_id)
-                synopses = current_programme.get('synopses') or {}
-                network = current_programme.get('network') or {}
-                duration = int_or_none(
-                    current_programme.get('duration', {}).get('value'))
-                thumbnail = None
-                image_url = current_programme.get('image_url')
-                if image_url:
-                    thumbnail = image_url.replace('{recipe}', 'raw')
                 return {
                     'id': programme_id,
                     'title': title,
-                    'description': dict_get(synopses, ('long', 'medium', 'short')),
-                    'thumbnail': thumbnail,
-                    'duration': duration,
-                    'uploader': network.get('short_title'),
-                    'uploader_id': network.get('id'),
+                    **traverse_obj(current_programme, {
+                        'description': ('synopses', ('long', 'medium', 'short'), {str}, any),
+                        'thumbnail': ('image_url', {lambda u: url_or_none(u.replace('{recipe}', 'raw'))}),
+                        'duration': ('duration', 'value', {int_or_none}),
+                        'uploader': ('network', 'short_title', {str}),
+                        'uploader_id': ('network', 'id', {str}),
+                    }),
                     'formats': formats,
                     'subtitles': subtitles,
-                    'chapters': traverse_obj(preload_state, (
-                        'tracklist', 'tracks', lambda _, v: float_or_none(v['offset']['start']), {
-                            'title': ('titles', {lambda x: join_nonempty(
-                                'primary', 'secondary', 'tertiary', delim=' - ', from_dict=x)}),
-                            'start_time': ('offset', 'start', {float_or_none}),
-                            'end_time': ('offset', 'end', {float_or_none}),
-                        })) or None,
+                    **traverse_obj(preload_state, {
+                        'chapters': (
+                            'tracklist', 'tracks', lambda _, v: float_or_none(v['offset']['start']), {
+                                'title': ('titles', {lambda x: join_nonempty(
+                                    'primary', 'secondary', 'tertiary', delim=' - ', from_dict=x)}),
+                                'start_time': ('offset', 'start', {float_or_none}),
+                                'end_time': ('offset', 'end', {float_or_none}),
+                            }
+                        )
+                    }),
                 }
+
+        # PWA_PRELOADED_STATE with article video asset
+        asset_id = traverse_obj(preload_state, (
+            'entities', 'articles', lambda k, _: k.rsplit('/', 1)[-1] == playlist_id,
+            'assetVideo', 0, {str}, any))
+        if asset_id:
+            video_id = traverse_obj(preload_state, ('entities', 'videos', asset_id, 'vpid', {str}))
+            if video_id:
+                article = traverse_obj(preload_state, (
+                    'entities', 'articles', lambda _, v: v['assetVideo'][0] == asset_id, any))
+
+                def image_url(image_id):
+                    return traverse_obj(preload_state, (
+                        'entities', 'images', image_id, 'url',
+                        {lambda u: url_or_none(u.replace('$recipe', 'raw'))}))
+
+                formats, subtitles = self._download_media_selector(video_id)
+                return {
+                    'id': video_id,
+                    **traverse_obj(preload_state, ('entities', 'videos', asset_id, {
+                        'title': ('title', {str}),
+                        'description': (('synopsisLong', 'synopsisMedium', 'synopsisShort'), {str}, any),
+                        'thumbnail': (0, {image_url}),
+                        'duration': ('duration', {int_or_none}),
+                    })),
+                    'formats': formats,
+                    'subtitles': subtitles,
+                    **traverse_obj(article, {
+                        'timestamp': ('displayDate', {parse_iso8601}),
+                    }),
+                }
+            else:
+                return self.url_result(
+                    'https://www.bbc.co.uk/programmes/%s' % asset_id, BBCCoUkIE.ie_key(),
+                    asset_id, playlist_title, display_id=playlist_id,
+                    description=playlist_description)
 
         bbc3_config = self._parse_json(
             self._search_regex(
@@ -1204,6 +1280,28 @@
                 return self.playlist_result(
                     entries, playlist_id, playlist_title, playlist_description)
 
+        k_int_or_none = functools.partial(int_or_none, scale=1000)
+
+        def parse_model(model):
+            '''Extract single video from model structure'''
+            item_id = traverse_obj(model, ('versions', 0, 'versionId', {str}))
+            if not item_id:
+                return
+            formats, subtitles = self._download_media_selector(item_id)
+            return {
+                'id': item_id,
+                'formats': formats,
+                'subtitles': subtitles,
+                **traverse_obj(model, {
+                    'title': ('title', {str}),
+                    'thumbnail': ('imageUrl', {lambda u: urljoin(url, u.replace('$recipe', 'raw'))}),
+                    'description': (
+                        'synopses', ('long', 'medium', 'short'), {str}, any),
+                    'duration': ('versions', 0, 'duration', {int}),
+                    'timestamp': ('versions', 0, 'availableFrom', {k_int_or_none}),
+                })
+            }
+
         initial_data = self._search_regex(
             r'window\.__INITIAL_DATA__\s*=\s*("{.+?}")\s*;', webpage,
             'quoted preload state', default=None)
@@ -1215,6 +1313,21 @@
             initial_data = self._parse_json(initial_data or '"{}"', playlist_id, fatal=False)
         initial_data = self._parse_json(initial_data, playlist_id, fatal=False)
         if initial_data:
+            added = False
+            for video_data in traverse_obj(initial_data, (
+                    'stores', 'article', 'articleBodyContent', lambda _, v: v['type'] == 'video')):
+                model = traverse_obj(video_data, (
+                    'model', 'blocks', lambda _, v: v['type'] == 'aresMedia',
+                    'model', 'blocks', lambda _, v: v['type'] == 'aresMediaMetadata',
+                    'model', {dict}, any))
+                entry = parse_model(model)
+                if entry:
+                    entries.append(entry)
+                    added = True
+            if added:
+                return self.playlist_result(
+                    entries, playlist_id, playlist_title, playlist_description)
+
             def parse_media(media):
                 if not media:
                     return
@@ -1248,18 +1361,17 @@
                         'timestamp': item_time,
                         'description': strip_or_none(item_desc),
                     })
-            for resp in (initial_data.get('data') or {}).values():
-                name = resp.get('name')
+
+            for resp in traverse_obj(initial_data, ('data', lambda _, v: v.get('name'))):
+                name = resp['name']
                 if name == 'media-experience':
                     parse_media(try_get(resp, lambda x: x['data']['initialItem']['mediaItem'], dict))
                 elif name == 'article':
-                    for block in (try_get(resp,
-                                          (lambda x: x['data']['blocks'],
-                                           lambda x: x['data']['content']['model']['blocks'],),
-                                          list) or []):
-                        if block.get('type') not in ['media', 'video']:
-                            continue
-                        parse_media(block.get('model'))
+                    for block in traverse_obj(resp, ('data', (
+                            None, ('content', 'model')), 'blocks',
+                            lambda _, v: v.get('type') in {'media', 'video'},
+                            'model', {dict})):
+                        parse_media(block)
             return self.playlist_result(
                 entries, playlist_id, playlist_title, playlist_description)
 
@@ -1267,26 +1379,6 @@
             return list(filter(None, map(
                 lambda s: self._parse_json(s, playlist_id, fatal=False),
                 re.findall(pattern, webpage))))
-
-        def parse_model(model):
-            '''Extract single video from model structure'''
-            item_id = traverse_obj(model, ('versions', 0, 'versionId', {str}))
-            if not item_id:
-                return
-            formats, subtitles = self._download_media_selector(item_id)
-            return {
-                'id': item_id,
-                'formats': formats,
-                'subtitles': subtitles,
-                **traverse_obj(model, {
-                    'title': ('title', {str}),
-                    'thumbnail': ('imageUrl', {lambda u: urljoin(url, u.replace('$recipe', 'raw'))}),
-                    'description': (
-                        'synopses', ('long', 'medium', 'short'), {str}, any),
-                    'duration': ('versions', 0, 'duration', {int}),
-                    'timestamp': ('versions', 0, 'availableFrom', {lambda x: int_or_none(x, scale=1000)}),
-                })
-            }
 
         # US accessed article with single embedded video (e.g.
         # https://www.bbc.com/news/uk-68546268)
@@ -1303,7 +1395,7 @@
                 if entry.get('timestamp') is None:
                     entry['timestamp'] = traverse_obj(next_data, (
                         ..., 'contents', lambda _, v: v['type'] == 'timestamp',
-                        'model', 'timestamp', {functools.partial(int_or_none, scale=1000)}, any))
+                        'model', 'timestamp', {k_int_or_none}, any))
                 entries.append(entry)
                 return self.playlist_result(
                     entries, playlist_id, playlist_title, playlist_description)

The yt-dlp branch where I ran this has the old networking and doesn't have the `all`, `any` traversal keys or the updated `_search_nextjs_data()`, but I believe I've reverted any resulting tweaks that were in the test version.

kylegustavo · 2024-04-23T16:41:21Z

Looks like in the failing test, BBCIE and BBCCoUkIE match the same URL. We can modify the suitable() function to avoid the duplicate.

kylegustavo · 2024-04-23T17:11:42Z

We do not need the 'sound' matcher in the BBCCoUkIE. This creates a duplicate match that fails the duplicate test, and anyway this is a skipped test in BBCIE at the moment. If we want that change as part of a BBC improvement for UK links with 'sound', that can be included in a subsequent PR.

yt_dlp/extractor/bbc.py

dirkf · 2024-04-24T03:27:01Z

We do not need the 'sound' matcher in the BBCCoUkIE. ...

Quite right. The Sounds pages have the __PRELOADED_STATE__ hydration JSON and BBCIE works fine in the UK. You can check a current show https://www.bbc.co.uk/sounds/play/w3ct5rgx. This could replace the broken test.

BBCCoUkIE also happens to work, but it doesn't get as much metadata (using _download_playlist()).

dirkf

Nice work from kylegustavo!

yt_dlp/extractor/bbc.py

This reverts commit b0593ec.

yt_dlp/extractor/bbc.py

pukkandan

Pls don't mark comments as resolved. The reviewer will decide whether the conversation is resolved

yt_dlp/extractor/bbc.py

Co-authored-by: dirkf <[email protected]>

yt_dlp/extractor/bbc.py

pukkandan · 2024-04-27T07:26:14Z

Everything LGTM now. Pls make sure to test all the code branches one last time

yt_dlp/extractor/bbc.py

kylegustavo · 2024-05-01T15:34:16Z

Checking in, what's the process of getting this change merged now? Do I need approval from @bashonly, and who has the write privileges to squash and merge this change?

kylegustavo · 2024-05-06T16:19:19Z

Hello, just checking in again. @pukkandan do you have write privileges?

bashonly · 2024-05-06T17:15:53Z

@kylegustavo please be patient, I would like to give this a final review, but have not found the time yet. I will do so as soon as I can. Also, pukkandan is the owner of the repo

kylegustavo · 2024-05-06T17:32:19Z

Ok, thanks @bashonly, take your time. Just checking to see that this review is in the pipeline and not going stale.

bashonly requested changes Apr 17, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

bashonly added pending-fixes PR has had changes requested site-bug Issue with a specific website labels Apr 17, 2024

Using traverse_obj

5f35e17

dirkf reviewed Apr 19, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

Kyle Gonsalves added 3 commits April 19, 2024 10:04

one more tranverse

89eaee2

nit, style

b9af6bf

more streamlining

9dbd9fc

kylegustavo requested a review from bashonly April 19, 2024 18:57

Grub4K added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Apr 19, 2024

bashonly requested changes Apr 20, 2024

View reviewed changes

bashonly added pending-fixes PR has had changes requested and removed pending-review PR needs a review labels Apr 20, 2024

dirkf reviewed Apr 20, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

dirkf reviewed Apr 20, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

Kyle Gonsalves added 2 commits April 21, 2024 16:22

Making the parse_model function, address comments

e2ae76e

flake8 check

4b9a54b

dirkf reviewed Apr 22, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

dirkf reviewed Apr 22, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

dirkf reviewed Apr 22, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

different solution for traversal issues

6ef8990

pukkandan removed the pending-fixes PR has had changes requested label Apr 22, 2024

kylegustavo requested a review from bashonly April 22, 2024 16:10

Incorporating changes for UK accessed articles

fba5c8f

Removing unnecessary URL regex matcher clause

bdfeb43

pukkandan requested changes Apr 23, 2024

View reviewed changes

Kyle Gonsalves added 2 commits April 23, 2024 15:47

address comments

d5b48c0

function for lambda

1d851a6

dirkf reviewed Apr 24, 2024

View reviewed changes

Kyle Gonsalves added 2 commits April 24, 2024 11:48

dirk's updates

ab1cfa3

flake

b0593ec

kylegustavo requested a review from pukkandan April 24, 2024 20:55

Revert to use merge_dicts and fix flake

bb87baf

This reverts commit b0593ec.

pukkandan requested changes Apr 25, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

comments

fd43ff2

pukkandan reviewed Apr 25, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

Kyle Gonsalves added 2 commits April 26, 2024 12:47

suggestions

388bc9c

flake

8d78a0f

dirkf reviewed Apr 27, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

Update yt_dlp/extractor/bbc.py

221a3c6

Co-authored-by: dirkf <[email protected]>

pukkandan approved these changes Apr 27, 2024

View reviewed changes

style nitpick

706272e

Simplifications

604ada1

pukkandan reviewed Apr 30, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

Update yt_dlp/extractor/bbc.py

de108ee

pukkandan reviewed Apr 30, 2024

View reviewed changes

yt_dlp/extractor/bbc.py Outdated Show resolved Hide resolved

oops

6608d38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 9701: NEXT_DATA field video extraction for bbc US website #9705

Issue 9701: NEXT_DATA field video extraction for bbc US website #9705

kylegustavo commented Apr 17, 2024 •

edited

dirkf commented Apr 22, 2024

kylegustavo commented Apr 23, 2024 •

edited

kylegustavo commented Apr 23, 2024

dirkf commented Apr 24, 2024

dirkf left a comment

pukkandan left a comment

pukkandan commented Apr 27, 2024

kylegustavo commented May 1, 2024

kylegustavo commented May 6, 2024

bashonly commented May 6, 2024

kylegustavo commented May 6, 2024

Issue 9701: NEXT_DATA field video extraction for bbc US website #9705

Are you sure you want to change the base?

Issue 9701: NEXT_DATA field video extraction for bbc US website #9705

Conversation

kylegustavo commented Apr 17, 2024 • edited

NEXT_DATA field video extraction for BBC

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

dirkf commented Apr 22, 2024

kylegustavo commented Apr 23, 2024 • edited

kylegustavo commented Apr 23, 2024

dirkf commented Apr 24, 2024

dirkf left a comment

Choose a reason for hiding this comment

pukkandan left a comment

Choose a reason for hiding this comment

pukkandan commented Apr 27, 2024

kylegustavo commented May 1, 2024

kylegustavo commented May 6, 2024

bashonly commented May 6, 2024

kylegustavo commented May 6, 2024

kylegustavo commented Apr 17, 2024 •

edited

kylegustavo commented Apr 23, 2024 •

edited