v1.21.3-at.20231213.01
Wget-AT 20231213.01 (Wget 1.21.3-at.20231213.01) Release Notes
This release adds the recording of minimal SSL/TLS information in the WARC of the connection used to send and receive data. Next to this, the release allows Wget-AT to keep track of used protocols.
WARC-Cipher-Suite
header
Recording of any details of a used SSL/TLS connection did not happen before this release. The only information about this was stored in the URI which would either start with https
(indicating a secure connection with the SSL or TLS protocol) or http
. Websites may (and it is confirmed that some will) return different HTTP responses depending on the details of the secure connections, which creates an urgency of storing such information in WARCs.
There is no information in the WARC format (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) on storing secure connection information. A discussion is taking place in the issue at iipc/warc-specifications#86 for adding support. This issue originally suggested the use of a WARC-TLS-Cipher-Suite
WARC header, but also notes WARC-Cipher-Suite
as an option.
WARC records
The cipher suite WARC header is written on records with WARC-Type
value request
and response
when request and response data for a HTTPS URL is recorded. It will be written on a revisit
record as well when this is written due to deduplicating a response
record.
A resource
record may also have this header written to it, for example when a FTPS connection is used. Currently this is not being done as the FTP WARC record writing functionality is Wget-AT is not optimal, and a major overhaul of this is being worked on, at which point this header will be written for data transferred over an FTPS connection.
This header is not currently written on a metadata
record in Wget-AT, while supported on that record type according to iipc/warc-specifications#86.
Allowed header values
The value of this field is the IANA defined cipher suite name (https://www.iana.org/assignments/tls-parameters/tls-parameters.txt) for TLS, or for SSL a name as defined in RFC 6101 for SSLv3 or "The SSL Protocol" (https://www.ietf.org/archive/id/draft-hickman-netscape-ssl-00.txt) for SSLv2 (SSL version 0.2).
The recorded value should be the cipher suite that is in use for the connection. It should not be the cipher suite presented in the client hello step of the handshake, or any other value before a cipher suite is agreed on and application data is being transferred.
Using header name WARC-Cipher-Suite
over WARC-TLS-Cipher-Suite
Having TLS
in the WARC header restrict one to store only TLS cipher suites. The obsolete SSLv3 protocol uses SSL cipher suites that have a name starting with SSL_*
defined. If a TLS connection is used, the cipher suite name starts with TLS_*
. Leaving TLS
out of the WARC header allows both SSL and TLS cipher suites to be stored, while it is still clear from the first three bytes of the header value if a SSL or TLS record was used.
RFC 5246 notes that "cipher suite values { 0x00, 0x1C }
and { 0x00, 0x1D }
are reserved to avoid collision with Fortezza-based cipher suites in SSL 3." These two cipher suite values are defined in RFC 6101 as respectively SSL_FORTEZZA_KEA_WITH_NULL_SHA
and SSL_FORTEZZA_KEA_WITH_FORTEZZA_CBC_SHA
. While most SSL_*
cipher suites have been assigned a similar TLS_*
name for further use in the TLS protocol (for example SSL_DHE_DSS_WITH_DES_CBC_SHA
was named TLS_DHE_DSS_WITH_DES_CBC_SHA
), some have not and are only assigned by their SSL_*
defined names.
Value { 0x00,0x1E }
was defined with name TLS_KRB5_WITH_DES_CBC_SHA
in RFC 2712, while it holds an entirely different name and definition in RFC 6101 with SSL_FORTEZZA_KEA_WITH_RC4_128_SHA
. Some cipher suites like SSL_CK_RC4_128_WITH_MD5
(used in SSL version 0.2, see "The SSL Protocol") do not have a TLS_*
defined name at all. These are examples that show how not all SSL_*
cipher suites can simply be represented by their TLS_*
cipher suites, and why cipher suites should be written with their SSL_*
defined name when the SSL protocol is used. While the SSL protocol is obsolete, it is technically possible to be used, and should be accounted for in the WARC headers.
Next to the above, a "cipher suite" is well defined and widely accepted as being either a TLS or SSL cipher suite, making WARC-Cipher-Suite
a more minimal representation of the type of data to store than WARC-TLS-Cipher-Suite
. Any future set of cipher suites that are neither SSL nor TLS cipher suites can also be written under the WARC-Cipher-Suite
header.
WARC-Protocol
header
Next to recording the used cipher suite, the SSL/TLS version should be recorded, for the same reason as given in the previous section. The WARC format currently does not define a way to store this version, but the issue at iipc/warc-specifications#42 discusses this. The proposed definition in this issue is implemented in this release.
The WARC-Protocol
header is allowed to be written on all records the WARC-Cipher-Suite
is allowed on.
Allowed header values
The allowed header values are as defined in the issue at iipc/warc-specifications#42 of which a subset is used in Wget-AT as of this release:
http/0.9
http/1.0
http/1.1
ssl/2
ssl/3
tls/1.0
tls/1.1
tls/1.2
tls/1.3
The value ftp
is not currently in use for the same reasons the WARC-Cipher-Suite
header is not yet written on WARC records for FTPS URLs as explained in the previous section.
As with WARC-Cipher-Suite
, the value should be that of the connection that is used to actually transfer data over, not anything used during negotiations.
Only one of the http/*
values is always written on the request and response records of a HTTP(S) URL, while one of the ssl/*
or tls/*
values is written only on a record of a HTTPS URL. An example of the written WARC-Protocol
headers on a record with a HTTP/1.1 payload with data transferred over a TLSv1.3 connection is
WARC-Protocol: http/1.1
WARC-Protocol: tls/1.3
Minor features
One minor feature was added in this release:
- The Dockerfile now uses Debian bookworm instead of Debian bullseye.
Bug fixes
Three bugs have been fixed in this release:
- A bug is fixed that prevented the use of option
--warc-cdx
. - The manual of Wget writes that specifying a protocol of
SSLv2
,SSLv3
,TLSv1
,TLSv1_1
,TLSv1_2
, orTLSv1_3
to option--secure-protocol
, forces the use of this protocol. In practice this was not the case, the protocol would be set as minimum version. If--secure-protocol=TLSv1_1
was given, one ofTLSv1_1
,TLSv1_2
, orTLSv1_3
would be used after negotiation. This is now fixed to follow the manual. - If a URL would be transformed from a HTTP to HTTPS URL due to HSTS, the HTTP version of the URL would still be written in the WARC headers, while the HTTPS URL was used for data transfer. This is now fixed.