You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Provide a description that is worded well enough to be understood
In the cnn.py extractor, the CNNArticleIE InfoExtractor is not working. The example test URL, 'https://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/' fails at trying to match the regex when using the given example URL, and also the same for the CNN landing page, https://www.cnn.com/ , which has a live video on the page that could be captured. The regex matcher currently looks for the pattern r"video:\s*'([^']+)'" on the whole webpage to try to find the video URL.
I'm currently looking around at the raw web data, as captured by the line webpage = self._download_webpage(url, url_basename(url))
to see where we need to look for the appropriate data. There does seem to be some data for URL here. We have to see how to extract it. These video_resource fields appear multiple times in the web data, including in one long JSON line that begins with <script type="application/ld+json">{"@type":"NewsArticle". Various video resources are available in tags like those below, too.
<div data-uri="archive.cms.cnn.com/_components/video-resource/instances/h_6d28ba6b29615817f3e0a7ad032f1930-h_e3a8724294bf41bdabb86bcb98d64f58-pageTop@published"
data-component-name="video-resource"
data-editable="settings"
class="video-resource"
data-fixed-ratio="16x9"
data-index="idx-0"
data-canonical-url-path="/videos/bestoftv/2014/12/21/ip-north-korea-obama.cnn"
data-video-instance="archive.cms.cnn.com/_components/video-resource/instances/h_6d28ba6b29615817f3e0a7ad032f1930-h_e3a8724294bf41bdabb86bcb98d64f58-pageTop@published"
data-parent-uri="archive.cms.cnn.com/_components/video-resource/instances/h_6d28ba6b29615817f3e0a7ad032f1930@published"
data-video-id="bestoftv/2014/12/21/ip-north-korea-obama.cnn"
data-live=""
data-headline="Obama: Cyberattack not an act of war"
data-description="John King, Margaret Talev, Jonathan Martin, Robert Costa & Molly Ball on Pres. Obama's response to the Sony cyberattack."
data-duration="02:19"
data-source-html="<span class="video-resource__source"> - Source:
<a href="https://www.cnn.com/" class="video-resource__source-url">CNN</a>
</span>"
data-fave-thumbnails="{"big": { "uri": "https://media.cnn.com/api/v1/images/stellar/prod/141221092147-ip-north-korea-obama-00003823.jpg?q=x_0,y_0,h_720,w_1280,c_fill/h_540,w_960" }, "small": { "uri": "https://media.cnn.com/api/v1/images/stellar/prod/141221092147-ip-north-korea-obama-00003823.jpg?q=x_0,y_0,h_720,w_1280,c_fill/h_540,w_960" } }"
data-vr-video="false"
data-show-html="<!-- unable to render partial show without a supplied context -->"
data-byline-html="<div data-uri="archive.cms.cnn.com/_components/byline/instances/byline_h_6d28ba6b29615817f3e0a7ad032f1930-video-resource@published" class="byline" data-editable="settings">
</div>"
Provide verbose output that clearly demonstrates the problem
Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
If using API, add 'verbose': True to YoutubeDL params instead
Copy the WHOLE output (starting with [debug] Command-line config) and insert it below
Complete Verbose Output
[debug] Command-line config: ['-vU', 'https://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version [email protected] from yt-dlp/yt-dlp [ff0779267] (pip)
[debug] Python 3.11.6 (CPython arm64 64bit) - macOS-14.4.1-arm64-arm-64bit (OpenSSL 3.1.4 24 Oct 2023)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.19.0, brotli-1.1.0, certifi-2023.05.07, mutagen-1.47.0, requests-2.31.0, sqlite3-3.44.0, urllib3-2.1.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1810 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: [email protected] from yt-dlp/yt-dlp
yt-dlp is up to date ([email protected] from yt-dlp/yt-dlp)
[CNNArticle] Extracting URL: https://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/
[CNNArticle] obama-north-koreas-hack-not-war-but-cyber-vandalism: Downloading webpage
ERROR: [CNNArticle] Unable to extract cnn url; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
File "/opt/homebrew/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 734, in extract
ie_result = self._real_extract(url)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/yt_dlp/extractor/cnn.py", line 142, in _real_extract
cnn_url = self._html_search_regex(r"video:\s*'([^']+)'", webpage, 'cnn url')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 1357, in _html_search_regex
res = self._search_regex(pattern, string, name, default, fatal, flags, group)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 1321, in _search_regex
raise RegexNotFoundError('Unable to extract %s' % _name)
The text was updated successfully, but these errors were encountered:
DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE
Checklist
Region
United States
Provide a description that is worded well enough to be understood
In the cnn.py extractor, the CNNArticleIE InfoExtractor is not working. The example test URL, 'https://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/' fails at trying to match the regex when using the given example URL, and also the same for the CNN landing page, https://www.cnn.com/ , which has a live video on the page that could be captured. The regex matcher currently looks for the pattern
r"video:\s*'([^']+)'"
on the whole webpage to try to find the video URL.I'm currently looking around at the raw web data, as captured by the line
webpage = self._download_webpage(url, url_basename(url))
to see where we need to look for the appropriate data. There does seem to be some data for URL here. We have to see how to extract it. These video_resource fields appear multiple times in the web data, including in one long JSON line that begins with
<script type="application/ld+json">{"@type":"NewsArticle"
. Various video resources are available in tags like those below, too.Provide verbose output that clearly demonstrates the problem
yt-dlp -vU <your command line>
)'verbose': True
toYoutubeDL
params instead[debug] Command-line config
) and insert it belowComplete Verbose Output
The text was updated successfully, but these errors were encountered: