New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle non-unicode payload in Logstash. #16072
Conversation
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a number of comments in-line, filed a formal issue upstream (guyboertje/jrjackson#95), and provided an alternate based strongly on the original proof-of-concept utf8-coerce
script from the linked issue.
It is regrettable that we need to walk and effectively create a deep clone of every object we are serializing, but without the upstream issue being fixed, I can see no other way.
We can avoid some copies by chaining non-destructive methods, and don't need to create our own Encoding::Converter
instances since Ruby's String#encode
handles things nicely with pre-defined conversions.
We also need to be very careful to not mutate input in a serialization operation, and it is possible to achieve what we are looking to do without relying on exceptions for flow control.
begin | ||
# non expensive `force_encoding` operation which changes the encoding metadata | ||
# otherwise unicode normalization rejects | ||
input_string = input_string.force_encoding(Encoding::UTF_8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the source string_data
was not frozen, then input_string
is a reference to the same object and this will mutate the object we were given
input_string = input_string.force_encoding(Encoding::UTF_8) | ||
# force UTF-8 encoding as data might also have invalid bytes | ||
# we try to normalize first, use replacement char with `scrub` if invalid bytes found | ||
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unicode normalized form is not a requirement of the upstream issue. We only need for the value to be valid unicode and do not need to do the extra work to make the forms consistent.
We also don't want to mutate the object we were given.
rescue => e | ||
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}") | ||
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.") | ||
input_string.scrub! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a destructive change and in most cases will mutate the object we were given
def self.normalize_string_encoding(string_data) | ||
input_string = string_data.dup if string_data.frozen? | ||
input_string = string_data unless string_data.frozen? | ||
|
||
if input_string.encoding != Encoding::UTF_8 | ||
encoding_converter = Encoding::Converter.new(input_string.encoding, Encoding::UTF_8) | ||
conversion_error, utf8_string = false, nil | ||
begin | ||
utf8_string = encoding_converter.convert(input_string).freeze | ||
rescue => e | ||
# we mostly get Encoding::UndefinedConversionError but let's do not expect surprise crashes | ||
logger.trace? && logger.trace("Could not convert, #{e.inspect}") | ||
conversion_error = true | ||
ensure | ||
# if we cannot convert with a standard way | ||
# we let normalize and replace invalid unicode bytes | ||
return utf8_string unless conversion_error | ||
end | ||
end | ||
|
||
begin | ||
# non expensive `force_encoding` operation which changes the encoding metadata | ||
# otherwise unicode normalization rejects | ||
input_string = input_string.force_encoding(Encoding::UTF_8) | ||
# force UTF-8 encoding as data might also have invalid bytes | ||
# we try to normalize first, use replacement char with `scrub` if invalid bytes found | ||
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters | ||
rescue => e | ||
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}") | ||
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.") | ||
input_string.scrub! | ||
end | ||
input_string | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied the basic premise of my proof-of-concept utf8-coerce script, with the only change being that when an incoming string is flagged as BINARY
we operate on a copy that has been force-encoded UTF-8
so that we can avoid moji-baking valid utf-8 sequences in binary-flagged strings.
Notably, this:
- avoids mutating the incoming
string_data
- avoids copy operations on all valid unicode input
- avoids using exceptions for flow control
- passes all tests
def self.normalize_string_encoding(string_data) | |
input_string = string_data.dup if string_data.frozen? | |
input_string = string_data unless string_data.frozen? | |
if input_string.encoding != Encoding::UTF_8 | |
encoding_converter = Encoding::Converter.new(input_string.encoding, Encoding::UTF_8) | |
conversion_error, utf8_string = false, nil | |
begin | |
utf8_string = encoding_converter.convert(input_string).freeze | |
rescue => e | |
# we mostly get Encoding::UndefinedConversionError but let's do not expect surprise crashes | |
logger.trace? && logger.trace("Could not convert, #{e.inspect}") | |
conversion_error = true | |
ensure | |
# if we cannot convert with a standard way | |
# we let normalize and replace invalid unicode bytes | |
return utf8_string unless conversion_error | |
end | |
end | |
begin | |
# non expensive `force_encoding` operation which changes the encoding metadata | |
# otherwise unicode normalization rejects | |
input_string = input_string.force_encoding(Encoding::UTF_8) | |
# force UTF-8 encoding as data might also have invalid bytes | |
# we try to normalize first, use replacement char with `scrub` if invalid bytes found | |
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters | |
rescue => e | |
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}") | |
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.") | |
input_string.scrub! | |
end | |
input_string | |
end | |
def self.normalize_string_encoding(string_data) | |
intermediate = string_data | |
# when given BINARY-flagged string, assume it is UTF-8 so that | |
# subsequent cleanup retains valid UTF-8 sequences | |
intermediate = intermediate.dup.force_encoding(Encoding::UTF_8) if intermediate.encoding == Encoding::BINARY | |
lossy_conversion = nil | |
replace_and_flag = -> (_){ lossy_conversion = true; UTF8_REPLACEMENT_CHAR } | |
normalized = intermediate.scrub(&replace_and_flag) | |
.encode(Encoding::UTF_8, fallback: replace_and_flag) | |
if lossy_conversion && logger.trace? | |
inspection = { | |
encoding: string_data.encoding, | |
valid_encoding: string_data.valid_encoding?, | |
bytes: string_data.bytes, | |
} | |
logger.trace("LOSSY UTF-8 CONVERSION: #{inspection}") | |
end | |
return normalized | |
end |
Where:
UTF8_REPLACEMENT_CHAR = "\u{FFFD}".force_encoding('UTF-8').freeze
Notably, the bulk of the above is to enable trace-level logging, which I believe to be overkill. The simplified form is:
def self.normalize_string_encoding(string_data) | |
input_string = string_data.dup if string_data.frozen? | |
input_string = string_data unless string_data.frozen? | |
if input_string.encoding != Encoding::UTF_8 | |
encoding_converter = Encoding::Converter.new(input_string.encoding, Encoding::UTF_8) | |
conversion_error, utf8_string = false, nil | |
begin | |
utf8_string = encoding_converter.convert(input_string).freeze | |
rescue => e | |
# we mostly get Encoding::UndefinedConversionError but let's do not expect surprise crashes | |
logger.trace? && logger.trace("Could not convert, #{e.inspect}") | |
conversion_error = true | |
ensure | |
# if we cannot convert with a standard way | |
# we let normalize and replace invalid unicode bytes | |
return utf8_string unless conversion_error | |
end | |
end | |
begin | |
# non expensive `force_encoding` operation which changes the encoding metadata | |
# otherwise unicode normalization rejects | |
input_string = input_string.force_encoding(Encoding::UTF_8) | |
# force UTF-8 encoding as data might also have invalid bytes | |
# we try to normalize first, use replacement char with `scrub` if invalid bytes found | |
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters | |
rescue => e | |
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}") | |
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.") | |
input_string.scrub! | |
end | |
input_string | |
end | |
def self.normalize_string_encoding(string_data) | |
# when given BINARY-flagged string, assume it is UTF-8 so that | |
# subsequent cleanup retains valid UTF-8 sequences | |
string_data = string_data.dup.force_encoding(Encoding::UTF_8) if string_data.encoding == Encoding::BINARY | |
string_data.scrub | |
.encode(Encoding::UTF_8, invalid: :replace, undef: :replace) | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplified form without any copy operations using String#encode
's ability to take a source encoding instead of assuming the one that the string is flagged as:
def self.normalize_string_encoding(string_data) | |
input_string = string_data.dup if string_data.frozen? | |
input_string = string_data unless string_data.frozen? | |
if input_string.encoding != Encoding::UTF_8 | |
encoding_converter = Encoding::Converter.new(input_string.encoding, Encoding::UTF_8) | |
conversion_error, utf8_string = false, nil | |
begin | |
utf8_string = encoding_converter.convert(input_string).freeze | |
rescue => e | |
# we mostly get Encoding::UndefinedConversionError but let's do not expect surprise crashes | |
logger.trace? && logger.trace("Could not convert, #{e.inspect}") | |
conversion_error = true | |
ensure | |
# if we cannot convert with a standard way | |
# we let normalize and replace invalid unicode bytes | |
return utf8_string unless conversion_error | |
end | |
end | |
begin | |
# non expensive `force_encoding` operation which changes the encoding metadata | |
# otherwise unicode normalization rejects | |
input_string = input_string.force_encoding(Encoding::UTF_8) | |
# force UTF-8 encoding as data might also have invalid bytes | |
# we try to normalize first, use replacement char with `scrub` if invalid bytes found | |
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters | |
rescue => e | |
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}") | |
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.") | |
input_string.scrub! | |
end | |
input_string | |
end | |
def self.normalize_string_encoding(string_data) | |
# when given BINARY-flagged string, assume it is UTF-8 so that | |
# subsequent cleanup retains valid UTF-8 sequences | |
source_encoding = string_data.encoding | |
source_encoding = Encoding::UTF_8 if source_encoding == Encoding::BINARY | |
string_data.encode(Encoding::UTF_8, source_encoding, invalid: :replace, undef: :replace).scrub | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Nice, TIL that
force_encoding
->unicode_normalize
behavior can be achieved withString#encode
👍 Thank you! - One thing I am not getting is enforcing to UTF-8 when source encoding is a binary (
ASCII-8BIT
) format. - It seems we don't need
.scrub
since we usedinvalid: :replace, undef: :replace
options. Let me know if I missed anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The String#encode
does not normalize the representation of code-points unless the source and destination encodings are different from each other, and then only does so as defined by the specific transcoder between those encodings. We don't need to normalize the different ways of representing a code-point because doing so unnecessarily changes bytes whose values are already equivalent. The normalization is usually useful to make binary comparisons possible, such as when using the sequence as a unique identifier.
When the encoding is labeled as BINARY
/ASCII-8BIT
, any sequence of bytes is considered "valid", and when that is converted to UTF-8 any single byte above the 7-bit ASCII plane cannot be converted UTF-8
because there is no mapping, so each of those bytes will be replaced with the unicode replacement character. If we "guess" that a BINARY
-flagged sequence might actually be UTF-8, then when the conversion is done all 7-bit characters will translate exactly the same, but any actually-valid multibyte UTF-8 sequences will be preserved. It is never more lossy, and can be less lossy in some circumstances.
The String#scrub
seems silly, but is necessary because String#encode
is a no-op when the source and destination encodings are identical; a string that is flagged as UTF-8
but is invalid will remain invalid when encoded to UTF-8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with all points mentioned here! Thanks so so much for explaining in detail. ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion is no anymore active (probably after my rebase and force push), I have added this change with my recent commit and added you in the co-aouthers list. Thank you so much!
Simplification came through code review. Co-authored-by: Ry Biesemeyer <[email protected]>
8663f80
to
8d36f38
Compare
…ce with replace option. Co-authored-by: Ry Biesemeyer <[email protected]>
Quality Gate passedIssues Measures |
💚 Build Succeeded
History
cc @mashhurs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Can you backport to 8.14 too?
@logstashmachine backport 8.14 |
* A logic to handle non-unicode payload in Logstash. * Well tested and code organized version of the logic. Co-authored-by: Ry Biesemeyer <[email protected]> * Upgrade jrjackson to 0.4.20 * Code review: simplify the logic with a standard String#encode interface with replace option. Co-authored-by: Ry Biesemeyer <[email protected]> --------- Co-authored-by: Ry Biesemeyer <[email protected]> Co-authored-by: Ry Biesemeyer <[email protected]> (cherry picked from commit 979d30d)
* A logic to handle non-unicode payload in Logstash. Co-authored-by: Ry Biesemeyer <[email protected]> * Upgrade jrjackson to 0.4.20 * Code review: simplify the logic with a standard String#encode interface with replace option. Co-authored-by: Ry Biesemeyer <[email protected]> --------- Co-authored-by: Ry Biesemeyer <[email protected]> Co-authored-by: Ry Biesemeyer <[email protected]> (cherry picked from commit 979d30d)
Release notes
[rn:skip]
What does this PR do?
Logstash's source pieces/tools to handle when invalid unicode payload passess (through
LogStash::Json.dump(invalid_unicode_payload)
), doesn't deal with input encoding or normalize the unicode bytestream if force encoded (metadata vs actual bytes may mismatch in ruby String).This PR tries to properly
Encoding::Converter
to make a correct representationforce_encoding(Encoding::UTF_8)
)unicode_normalize
) to make sure\uFFFD
)Why is it important/What is the impact to the user?
Users who are ingesting data as a non-unicode stream, they may see the strange encoding behavior or if they are using
elasticsearch-output
<=11.22.2 versions, ES rejects the events.Checklist
[] I have made corresponding change to the default configuration files (and/or docker env variables)Author's Checklist
scrub
(applying replacement char)How to test this PR locally
See the unit tests or pull this change and treat any invalid/valid unicode payloads (for now only _String_s please)
bin/logstash -f your-config.conf --log.level=trace
to see the pipeline unicode handling outputsRelated issues
Use cases
Screenshots
Logs