Handle non-unicode payload in Logstash. #16072

mashhurs · 2024-04-08T21:00:47Z

Release notes

[rn:skip]

What does this PR do?

Logstash's source pieces/tools to handle when invalid unicode payload passess (through LogStash::Json.dump(invalid_unicode_payload)), doesn't deal with input encoding or normalize the unicode bytestream if force encoded (metadata vs actual bytes may mismatch in ruby String).
This PR tries to properly

treat input payload encoding and use ruby Encoding::Converter to make a correct representation
if encoding converters, do not understand, Logstash tries to keep the same bytestream as received and treats it as unicode (force_encoding(Encoding::UTF_8))
validates the unicode claimed payload with ruby's unicode normalization (unicode_normalize) to make sure
replaces the invalid unicode bytes with the replacement char (\uFFFD)

Why is it important/What is the impact to the user?

Users who are ingesting data as a non-unicode stream, they may see the strange encoding behavior or if they are using elasticsearch-output <=11.22.2 versions, ES rejects the events.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~[] I have made corresponding change to the default configuration files (and/or docker env variables)~~
I have added tests that prove my fix is effective or that my feature works

Author's Checklist

Currently, for testing purpose only considered String, need to handle also any other data structures like hash, etc..
Pseudo source applied for now. Separate the logic into two: conversion + normalization, think about how to brightly place the scrub (applying replacement char)
Tests with Hash & Array data structure use-cases

How to test this PR locally

See the unit tests or pull this change and treat any invalid/valid unicode payloads (for now only _String_s please)

The easy way is to emit invalid unicode events with ruby filter, example by using following config

input { generator { count => 1 } }
filter { ruby { code => 'str = "\xAC"; event.set("message", str)' } }
output {
  elasticsearch {
    cloud_id => "your_cloud_id"
    cloud_auth => "elastic:PWD"
  }
  stdout { }
}

Run Logstash with trace mode: bin/logstash -f your-config.conf --log.level=trace to see the pipeline unicode handling outputs

Related issues

Closes Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833

Use cases

Screenshots

Logs

[2024-04-10T13:16:15,103][TRACE][logstash.unicodenormalizer][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Could not normalize to unicode, #<ArgumentError: invalid byte sequence in UTF-8>
[2024-04-10T13:16:15,103][TRACE][logstash.unicodenormalizer][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Replacing invalid non-utf bytes with replacement char.
[2024-04-10T13:16:15,106][DEBUG][logstash.outputs.elasticsearch][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Sending final bulk request for batch. {:action_count=>1, :payload_size=>313, :content_length=>253, :batch_offset=>0}
{
      "@version" => "1",
          "host" => {
        "name" => "Mashhurs.local.host"
    },
    "@timestamp" => 2024-04-10T20:16:14.946669Z,
       "message" => "�",
         "event" => {
        "original" => "Hello world!",
        "sequence" => 0
    }
}

elastic-sonarqube · 2024-04-10T22:58:30Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

logstash-core/lib/logstash/json.rb

yaauie

I've left a number of comments in-line, filed a formal issue upstream (guyboertje/jrjackson#95), and provided an alternate based strongly on the original proof-of-concept utf8-coerce script from the linked issue.

It is regrettable that we need to walk and effectively create a deep clone of every object we are serializing, but without the upstream issue being fixed, I can see no other way.

We can avoid some copies by chaining non-destructive methods, and don't need to create our own Encoding::Converter instances since Ruby's String#encode handles things nicely with pre-defined conversions.

We also need to be very careful to not mutate input in a serialization operation, and it is possible to achieve what we are looking to do without relying on exceptions for flow control.

logstash-core/lib/logstash/json.rb

yaauie · 2024-04-29T22:26:25Z