Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-unicode payload in Logstash. #16072

Merged
merged 6 commits into from May 16, 2024

Conversation

mashhurs
Copy link
Contributor

@mashhurs mashhurs commented Apr 8, 2024

Release notes

[rn:skip]

What does this PR do?

Logstash's source pieces/tools to handle when invalid unicode payload passess (through LogStash::Json.dump(invalid_unicode_payload)), doesn't deal with input encoding or normalize the unicode bytestream if force encoded (metadata vs actual bytes may mismatch in ruby String).
This PR tries to properly

  • treat input payload encoding and use ruby Encoding::Converter to make a correct representation
  • if encoding converters, do not understand, Logstash tries to keep the same bytestream as received and treats it as unicode (force_encoding(Encoding::UTF_8))
  • validates the unicode claimed payload with ruby's unicode normalization (unicode_normalize) to make sure
  • replaces the invalid unicode bytes with the replacement char (\uFFFD)

Why is it important/What is the impact to the user?

Users who are ingesting data as a non-unicode stream, they may see the strange encoding behavior or if they are using elasticsearch-output <=11.22.2 versions, ES rejects the events.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • [] I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

Author's Checklist

  • Currently, for testing purpose only considered String, need to handle also any other data structures like hash, etc..
  • Pseudo source applied for now. Separate the logic into two: conversion + normalization, think about how to brightly place the scrub (applying replacement char)
  • Tests with Hash & Array data structure use-cases

How to test this PR locally

See the unit tests or pull this change and treat any invalid/valid unicode payloads (for now only _String_s please)

  • The easy way is to emit invalid unicode events with ruby filter, example by using following config
input { generator { count => 1 } }
filter { ruby { code => 'str = "\xAC"; event.set("message", str)' } }
output {
  elasticsearch {
    cloud_id => "your_cloud_id"
    cloud_auth => "elastic:PWD"
  }
  stdout { }
}
  • Run Logstash with trace mode: bin/logstash -f your-config.conf --log.level=trace to see the pipeline unicode handling outputs

Related issues

Use cases

Screenshots

Logs

[2024-04-10T13:16:15,103][TRACE][logstash.unicodenormalizer][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Could not normalize to unicode, #<ArgumentError: invalid byte sequence in UTF-8>
[2024-04-10T13:16:15,103][TRACE][logstash.unicodenormalizer][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Replacing invalid non-utf bytes with replacement char.
[2024-04-10T13:16:15,106][DEBUG][logstash.outputs.elasticsearch][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Sending final bulk request for batch. {:action_count=>1, :payload_size=>313, :content_length=>253, :batch_offset=>0}
{
      "@version" => "1",
          "host" => {
        "name" => "Mashhurs.local.host"
    },
    "@timestamp" => 2024-04-10T20:16:14.946669Z,
       "message" => "�",
         "event" => {
        "original" => "Hello world!",
        "sequence" => 0
    }
}

@mashhurs mashhurs added the >bug label Apr 8, 2024
@mashhurs mashhurs self-assigned this Apr 8, 2024
@mashhurs mashhurs changed the title Very basic plain logic to handle non-unicode payload in Logstash. Pseudo logic to handle non-unicode payload in Logstash. Apr 8, 2024
@mashhurs mashhurs changed the title Pseudo logic to handle non-unicode payload in Logstash. Handle non-unicode payload in Logstash. Apr 10, 2024
@mashhurs mashhurs marked this pull request as ready for review April 10, 2024 20:30
Copy link

Quality Gate passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No Coverage information No data about Coverage
No Duplication information No data about Duplication

See analysis details on SonarQube

Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a number of comments in-line, filed a formal issue upstream (guyboertje/jrjackson#95), and provided an alternate based strongly on the original proof-of-concept utf8-coerce script from the linked issue.

It is regrettable that we need to walk and effectively create a deep clone of every object we are serializing, but without the upstream issue being fixed, I can see no other way.

We can avoid some copies by chaining non-destructive methods, and don't need to create our own Encoding::Converter instances since Ruby's String#encode handles things nicely with pre-defined conversions.

We also need to be very careful to not mutate input in a serialization operation, and it is possible to achieve what we are looking to do without relying on exceptions for flow control.

logstash-core/lib/logstash/json.rb Outdated Show resolved Hide resolved
begin
# non expensive `force_encoding` operation which changes the encoding metadata
# otherwise unicode normalization rejects
input_string = input_string.force_encoding(Encoding::UTF_8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the source string_data was not frozen, then input_string is a reference to the same object and this will mutate the object we were given

input_string = input_string.force_encoding(Encoding::UTF_8)
# force UTF-8 encoding as data might also have invalid bytes
# we try to normalize first, use replacement char with `scrub` if invalid bytes found
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unicode normalized form is not a requirement of the upstream issue. We only need for the value to be valid unicode and do not need to do the extra work to make the forms consistent.

We also don't want to mutate the object we were given.

rescue => e
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}")
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.")
input_string.scrub!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a destructive change and in most cases will mutate the object we were given

logstash-core/lib/logstash/util/unicode_normalizer.rb Outdated Show resolved Hide resolved
Comment on lines 31 to 59
def self.normalize_string_encoding(string_data)
input_string = string_data.dup if string_data.frozen?
input_string = string_data unless string_data.frozen?

if input_string.encoding != Encoding::UTF_8
encoding_converter = Encoding::Converter.new(input_string.encoding, Encoding::UTF_8)
conversion_error, utf8_string = false, nil
begin
utf8_string = encoding_converter.convert(input_string).freeze
rescue => e
# we mostly get Encoding::UndefinedConversionError but let's do not expect surprise crashes
logger.trace? && logger.trace("Could not convert, #{e.inspect}")
conversion_error = true
ensure
# if we cannot convert with a standard way
# we let normalize and replace invalid unicode bytes
return utf8_string unless conversion_error
end
end

begin
# non expensive `force_encoding` operation which changes the encoding metadata
# otherwise unicode normalization rejects
input_string = input_string.force_encoding(Encoding::UTF_8)
# force UTF-8 encoding as data might also have invalid bytes
# we try to normalize first, use replacement char with `scrub` if invalid bytes found
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters
rescue => e
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}")
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.")
input_string.scrub!
end
input_string
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied the basic premise of my proof-of-concept utf8-coerce script, with the only change being that when an incoming string is flagged as BINARY we operate on a copy that has been force-encoded UTF-8 so that we can avoid moji-baking valid utf-8 sequences in binary-flagged strings.

Notably, this:

  • avoids mutating the incoming string_data
  • avoids copy operations on all valid unicode input
  • avoids using exceptions for flow control
  • passes all tests
Suggested change
def self.normalize_string_encoding(string_data)
input_string = string_data.dup if string_data.frozen?
input_string = string_data unless string_data.frozen?
if input_string.encoding != Encoding::UTF_8
encoding_converter = Encoding::Converter.new(input_string.encoding, Encoding::UTF_8)
conversion_error, utf8_string = false, nil
begin
utf8_string = encoding_converter.convert(input_string).freeze
rescue => e
# we mostly get Encoding::UndefinedConversionError but let's do not expect surprise crashes
logger.trace? && logger.trace("Could not convert, #{e.inspect}")
conversion_error = true
ensure
# if we cannot convert with a standard way
# we let normalize and replace invalid unicode bytes
return utf8_string unless conversion_error
end
end
begin
# non expensive `force_encoding` operation which changes the encoding metadata
# otherwise unicode normalization rejects
input_string = input_string.force_encoding(Encoding::UTF_8)
# force UTF-8 encoding as data might also have invalid bytes
# we try to normalize first, use replacement char with `scrub` if invalid bytes found
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters
rescue => e
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}")
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.")
input_string.scrub!
end
input_string
end
def self.normalize_string_encoding(string_data)
intermediate = string_data
# when given BINARY-flagged string, assume it is UTF-8 so that
# subsequent cleanup retains valid UTF-8 sequences
intermediate = intermediate.dup.force_encoding(Encoding::UTF_8) if intermediate.encoding == Encoding::BINARY
lossy_conversion = nil
replace_and_flag = -> (_){ lossy_conversion = true; UTF8_REPLACEMENT_CHAR }
normalized = intermediate.scrub(&replace_and_flag)
.encode(Encoding::UTF_8, fallback: replace_and_flag)
if lossy_conversion && logger.trace?
inspection = {
encoding: string_data.encoding,
valid_encoding: string_data.valid_encoding?,
bytes: string_data.bytes,
}
logger.trace("LOSSY UTF-8 CONVERSION: #{inspection}")
end
return normalized
end

Where:

    UTF8_REPLACEMENT_CHAR = "\u{FFFD}".force_encoding('UTF-8').freeze

Notably, the bulk of the above is to enable trace-level logging, which I believe to be overkill. The simplified form is:

Suggested change
def self.normalize_string_encoding(string_data)
input_string = string_data.dup if string_data.frozen?
input_string = string_data unless string_data.frozen?
if input_string.encoding != Encoding::UTF_8
encoding_converter = Encoding::Converter.new(input_string.encoding, Encoding::UTF_8)
conversion_error, utf8_string = false, nil
begin
utf8_string = encoding_converter.convert(input_string).freeze
rescue => e
# we mostly get Encoding::UndefinedConversionError but let's do not expect surprise crashes
logger.trace? && logger.trace("Could not convert, #{e.inspect}")
conversion_error = true
ensure
# if we cannot convert with a standard way
# we let normalize and replace invalid unicode bytes
return utf8_string unless conversion_error
end
end
begin
# non expensive `force_encoding` operation which changes the encoding metadata
# otherwise unicode normalization rejects
input_string = input_string.force_encoding(Encoding::UTF_8)
# force UTF-8 encoding as data might also have invalid bytes
# we try to normalize first, use replacement char with `scrub` if invalid bytes found
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters
rescue => e
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}")
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.")
input_string.scrub!
end
input_string
end
def self.normalize_string_encoding(string_data)
# when given BINARY-flagged string, assume it is UTF-8 so that
# subsequent cleanup retains valid UTF-8 sequences
string_data = string_data.dup.force_encoding(Encoding::UTF_8) if string_data.encoding == Encoding::BINARY
string_data.scrub
.encode(Encoding::UTF_8, invalid: :replace, undef: :replace)
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplified form without any copy operations using String#encode's ability to take a source encoding instead of assuming the one that the string is flagged as:

Suggested change
def self.normalize_string_encoding(string_data)
input_string = string_data.dup if string_data.frozen?
input_string = string_data unless string_data.frozen?
if input_string.encoding != Encoding::UTF_8
encoding_converter = Encoding::Converter.new(input_string.encoding, Encoding::UTF_8)
conversion_error, utf8_string = false, nil
begin
utf8_string = encoding_converter.convert(input_string).freeze
rescue => e
# we mostly get Encoding::UndefinedConversionError but let's do not expect surprise crashes
logger.trace? && logger.trace("Could not convert, #{e.inspect}")
conversion_error = true
ensure
# if we cannot convert with a standard way
# we let normalize and replace invalid unicode bytes
return utf8_string unless conversion_error
end
end
begin
# non expensive `force_encoding` operation which changes the encoding metadata
# otherwise unicode normalization rejects
input_string = input_string.force_encoding(Encoding::UTF_8)
# force UTF-8 encoding as data might also have invalid bytes
# we try to normalize first, use replacement char with `scrub` if invalid bytes found
input_string.unicode_normalize! # use default :NFC normalization since decompositions may result multiple characters
rescue => e
logger.trace? && logger.trace("Could not normalize to unicode, #{e.inspect}")
logger.trace? && logger.trace("Replacing invalid non-utf bytes with replacement char.")
input_string.scrub!
end
input_string
end
def self.normalize_string_encoding(string_data)
# when given BINARY-flagged string, assume it is UTF-8 so that
# subsequent cleanup retains valid UTF-8 sequences
source_encoding = string_data.encoding
source_encoding = Encoding::UTF_8 if source_encoding == Encoding::BINARY
string_data.encode(Encoding::UTF_8, source_encoding, invalid: :replace, undef: :replace).scrub
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Nice, TIL that force_encoding -> unicode_normalize behavior can be achieved with String#encode 👍 Thank you!
  • One thing I am not getting is enforcing to UTF-8 when source encoding is a binary (ASCII-8BIT) format.
  • It seems we don't need .scrub since we used invalid: :replace, undef: :replace options. Let me know if I missed anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The String#encode does not normalize the representation of code-points unless the source and destination encodings are different from each other, and then only does so as defined by the specific transcoder between those encodings. We don't need to normalize the different ways of representing a code-point because doing so unnecessarily changes bytes whose values are already equivalent. The normalization is usually useful to make binary comparisons possible, such as when using the sequence as a unique identifier.

When the encoding is labeled as BINARY/ASCII-8BIT, any sequence of bytes is considered "valid", and when that is converted to UTF-8 any single byte above the 7-bit ASCII plane cannot be converted UTF-8 because there is no mapping, so each of those bytes will be replaced with the unicode replacement character. If we "guess" that a BINARY-flagged sequence might actually be UTF-8, then when the conversion is done all 7-bit characters will translate exactly the same, but any actually-valid multibyte UTF-8 sequences will be preserved. It is never more lossy, and can be less lossy in some circumstances.

The String#scrub seems silly, but is necessary because String#encode is a no-op when the source and destination encodings are identical; a string that is flagged as UTF-8 but is invalid will remain invalid when encoded to UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all points mentioned here! Thanks so so much for explaining in detail. ❤️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is no anymore active (probably after my rebase and force push), I have added this change with my recent commit and added you in the co-aouthers list. Thank you so much!

Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @mashhurs

@mashhurs mashhurs requested a review from yaauie May 14, 2024 15:34
Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Can you backport to 8.14 too?

@mashhurs mashhurs merged commit 979d30d into elastic:main May 16, 2024
6 checks passed
@mashhurs
Copy link
Contributor Author

@logstashmachine backport 8.14

mashhurs added a commit to mashhurs/logstash that referenced this pull request May 16, 2024
* A logic to handle non-unicode payload in Logstash.

* Well tested and code organized version of the logic.

Co-authored-by: Ry Biesemeyer <[email protected]>

* Upgrade jrjackson to 0.4.20

* Code review: simplify the logic with a standard String#encode interface with replace option.

Co-authored-by: Ry Biesemeyer <[email protected]>

---------

Co-authored-by: Ry Biesemeyer <[email protected]>
Co-authored-by: Ry Biesemeyer <[email protected]>
(cherry picked from commit 979d30d)
mashhurs added a commit that referenced this pull request May 21, 2024
* A logic to handle non-unicode payload in Logstash.

Co-authored-by: Ry Biesemeyer <[email protected]>

* Upgrade jrjackson to 0.4.20

* Code review: simplify the logic with a standard String#encode interface with replace option.

Co-authored-by: Ry Biesemeyer <[email protected]>

---------

Co-authored-by: Ry Biesemeyer <[email protected]>
Co-authored-by: Ry Biesemeyer <[email protected]>
(cherry picked from commit 979d30d)
@yaauie yaauie added the v8.15.0 label May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings
4 participants