CNN Extractor CNNArticleIE is not working as expected #9719

kylegustavo · 2024-04-18T22:34:04Z

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

I'm reporting that yt-dlp is broken on a supported site
I've verified that I have updated yt-dlp to nightly or master (update instructions)
I've checked that all provided URLs are playable in a browser with the same IP and same login details
I've checked that all URLs and arguments with special characters are properly quoted or escaped
I've searched known issues and the bugtracker for similar issues including closed ones. DO NOT post duplicates
I've read the guidelines for opening an issue
I've read about sharing account credentials and I'm willing to share it if required

Region

United States

Provide a description that is worded well enough to be understood

In the cnn.py extractor, the CNNArticleIE InfoExtractor is not working. The example test URL, 'https://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/' fails at trying to match the regex when using the given example URL, and also the same for the CNN landing page, https://www.cnn.com/ , which has a live video on the page that could be captured. The regex matcher currently looks for the pattern r"video:\s*'([^']+)'" on the whole webpage to try to find the video URL.

I'm currently looking around at the raw web data, as captured by the line
webpage = self._download_webpage(url, url_basename(url))
to see where we need to look for the appropriate data. There does seem to be some data for URL here. We have to see how to extract it. These video_resource fields appear multiple times in the web data, including in one long JSON line that begins with <script type="application/ld+json">{"@type":"NewsArticle". Various video resources are available in tags like those below, too.

<div data-uri="archive.cms.cnn.com/_components/video-resource/instances/h_6d28ba6b29615817f3e0a7ad032f1930-h_e3a8724294bf41bdabb86bcb98d64f58-pageTop@published"
  data-component-name="video-resource"
  data-editable="settings"
  class="video-resource"
  data-fixed-ratio="16x9"
  data-index="idx-0"  
  data-canonical-url-path="/videos/bestoftv/2014/12/21/ip-north-korea-obama.cnn"
  data-video-instance="archive.cms.cnn.com/_components/video-resource/instances/h_6d28ba6b29615817f3e0a7ad032f1930-h_e3a8724294bf41bdabb86bcb98d64f58-pageTop@published"
  data-parent-uri="archive.cms.cnn.com/_components/video-resource/instances/h_6d28ba6b29615817f3e0a7ad032f1930@published"
  data-video-id="bestoftv/2014/12/21/ip-north-korea-obama.cnn"
  data-live=""        
  data-headline="Obama: Cyberattack not an act of war"
  data-description="John King, Margaret Talev, Jonathan Martin, Robert Costa &amp; Molly Ball on Pres. Obama&#x27;s response to the Sony cyberattack."
  data-duration="02:19"
  data-source-html="<span class=&quot;video-resource__source&quot;> - Source:
                <a href=&quot;https://www.cnn.com/&quot; class=&quot;video-resource__source-url&quot;>CNN</a>
    </span>"          
  data-fave-thumbnails="{&quot;big&quot;: { &quot;uri&quot;: &quot;https://media.cnn.com/api/v1/images/stellar/prod/141221092147-ip-north-korea-obama-00003823.jpg?q&#x3D;x_0,y_0,h_720,w_1280,c_fill/h_540,w_960&quot; }, &quot;small&quot;: { &quot;uri&quot;: &quot;https://media.cnn.com/api/v1/images/stellar/prod/141221092147-ip-north-korea-obama-00003823.jpg?q&#x3D;x_0,y_0,h_720,w_1280,c_fill/h_540,w_960&quot; }  }"
  data-vr-video="false"
  data-show-html="<!-- unable to render partial show without a supplied context -->"
  data-byline-html="<div data-uri=&quot;archive.cms.cnn.com/_components/byline/instances/byline_h_6d28ba6b29615817f3e0a7ad032f1930-video-resource@published&quot; class=&quot;byline&quot; data-editable=&quot;settings&quot;>
</div>"

Provide verbose output that clearly demonstrates the problem

Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
If using API, add 'verbose': True to YoutubeDL params instead
Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['-vU', 'https://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version [email protected] from yt-dlp/yt-dlp [ff0779267] (pip)
[debug] Python 3.11.6 (CPython arm64 64bit) - macOS-14.4.1-arm64-arm-64bit (OpenSSL 3.1.4 24 Oct 2023)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.19.0, brotli-1.1.0, certifi-2023.05.07, mutagen-1.47.0, requests-2.31.0, sqlite3-3.44.0, urllib3-2.1.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1810 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: [email protected] from yt-dlp/yt-dlp
yt-dlp is up to date ([email protected] from yt-dlp/yt-dlp)
[CNNArticle] Extracting URL: https://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/
[CNNArticle] obama-north-koreas-hack-not-war-but-cyber-vandalism: Downloading webpage
ERROR: [CNNArticle] Unable to extract cnn url; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/opt/homebrew/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 734, in extract
    ie_result = self._real_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/yt_dlp/extractor/cnn.py", line 142, in _real_extract
    cnn_url = self._html_search_regex(r"video:\s*'([^']+)'", webpage, 'cnn url')
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 1357, in _html_search_regex
    res = self._search_regex(pattern, string, name, default, fatal, flags, group)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 1321, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)

The text was updated successfully, but these errors were encountered:

kylegustavo added site-bug Issue with a specific website triage Untriaged issue labels Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNN Extractor CNNArticleIE is not working as expected #9719

CNN Extractor CNNArticleIE is not working as expected #9719

kylegustavo commented Apr 18, 2024

CNN Extractor CNNArticleIE is not working as expected #9719

CNN Extractor CNNArticleIE is not working as expected #9719

Comments

kylegustavo commented Apr 18, 2024

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Region

Provide a description that is worded well enough to be understood

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output