Unicode pipeline and plugin ids #15971

yaauie · 2024-02-22T19:09:53Z

Release notes

Resolves a regression introduced in the 8.7 series in which the use of characters outside of the lower-ASCII plane in pipeline.id's caused the pipeline to fail to start up.
Fixes a serialization issue in the monitoring API that caused unicode characters in plugin id's to print incorrectly

What does this PR do?

JRuby's Ruby#newSymbol(String) throws an exception when provided a String that contains characters outside of lower-ASCII because JRuby internals expect "the incoming String to be one of our mangled ISO-8859-1 strings" as noted in a comment on jruby/jruby#6217.

Instead, we use Ruby#newString(String) to create a new RubyString (which works properly), and then rely on RubyString#intern to get our RubySymbol.

This fixes a regression introduced in the 8.7 series in which pipeline id's are consistently represented as ruby symbols in the metrics store, and ensures similar issue does not exist when specifying a plugin id that contains characters above the lower-ASCII plane.

This patch fixes related serialization issues when either plugin id's or pipeline id's contain characters outside the lower-ASCII plane.

Why is it important/What is the impact to the user?

Accented characters like umlauts have historically been supported in pipeline.id, and are useful in locales where they are meaningful.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

Run a pipeline whose pipeline.id has unicode characters:

bin/logstash --config.string 'input { heartbeat { id => "심장-박동" } }' --pipeline.id="Übér¬PîpéLiñé"

Observe that the pipeline starts, and log messages include the pipeline id correctly:

[2024-02-22T18:57:06,036][INFO ][logstash.javapipeline    ][Übér¬PîpéLiñé] Starting pipeline {:pipeline_id=>"Übér¬PîpéLiñé", "pipeline.workers"=>10, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>1250, "pipeline.sources"=>["config string"], :thread=>"#<Thread:0x408db9f5 /Users/yaauie/src/elastic/ls/logstash-core/lib/logstash/java_pipeline.rb:134 run>"}
[2024-02-22T18:57:06,332][INFO ][logstash.javapipeline    ][Übér¬PîpéLiñé] Pipeline Java execution initialization time {"seconds"=>0.29}
[2024-02-22T18:57:06,336][INFO ][logstash.javapipeline    ][Übér¬PîpéLiñé] Pipeline started {"pipeline.id"=>"Übér¬PîpéLiñé"}
[2024-02-22T18:57:06,351][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:Übér¬PîpéLiñé], :non_running_pipelines=>[]}

Observe that the API responses contain the pipeline and plugin names correctly-encoded:

╭─{ yaauie@maybe:~ }
╰─○ curl --silent -XGET 'localhost:9600/_node/stats' | jq '.pipelines | map_values(.plugins | map_values( . | map(.id)))'
{
  "Übér¬PîpéLiñé": {
    "inputs": [
      "심장-박동"
    ],
    "codecs": [
      "rubydebug_59ffc710-a1e6-4a62-b4ad-6773c5a67cf2",
      "plain_e4a7bb04-9d4d-48fe-9c15-7979adfbf77b"
    ],
    "filters": [],
    "outputs": [
      "09763aaeed1274ec476c4bd9ce5c556937f3afc1f41ac1061d21f011eb5b8245"
    ]
  }
}
[success]

Related issues

Closes: #15968

yaauie · 2024-02-22T23:08:43Z

Sigh.

These tests all work on my local machine, which has a LANG=en_US.UTF-8 and the complete complement of locales, but fail when executed in the docker container based on jammy that merely has C, POSIX, and C.UTF-8.

I have not yet been able to isolate why (since the integration test themselves clearly hold onto the UTF-8 bytes), but suspect it has to do with the process being spawned for the integration test interpreting those bytes as something other than UTF-8, and/or the bytes being returned from its HTTP API being interpreted by the integration test's process as something other than UTF-8.

elastic-sonarqube · 2024-02-23T00:58:03Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

JRuby's `Ruby#newSymbol(String)` throws an exception when provided a `String` that contains characters outside of lower-ASCII because JRuby internals expect "the incoming String to be one of our mangled ISO-8859-1 strings" as noted in a comment on jruby/jruby#6217. Instead, we use `Ruby#newString(String)` to create a new `RubyString` (which works properly), and then rely on `RubyString#intern` to get our `RubySymbol`. This fixes a regression introduced in the 8.7 series in which pipeline id's are consistently represented as ruby symbols in the metrics store, and ensures similar issue does not exist when specifying a plugin id that contains characters above the lower-ASCII plane.

We cannot rely on `RubySymbol#toString` to produce a properly-encoded `String` whe the string contains characters above the lower-ASCII plane because the result is effectively a binary ruby-internal marshal of the bytes that only holds when the symbol contains lower-ASCII. Instead, we can use the internally-memoizing `RubySymbol#name` to get a properly-encoded `RubyString`, and `RubyString#asJavaString()` to get a properly-encoded java-`String`.

Jackson's JSON serializer leaks the JRuby-internal byte structure of Symbols, which only aligns with the byte-structure of the symbol's actual string when that string is wholly-comprised of lower-ASCII characters. By pre-converting Symbols to Strings, we ensure that the result is readable and useful.

elasticmachine · 2024-05-16T19:50:09Z

💔 Build Failed

Buildkite Build
Commit: 1a59b4e

Failed CI Steps

History

💔 Build #916 failed 49ca706
💔 Build #915 failed d6546b7

elastic-sonarqube · 2024-05-16T19:51:32Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

yaauie added the bug label Feb 22, 2024

yaauie force-pushed the unicode-pipeline-and-plugin-ids branch from 96842dd to d6546b7 Compare February 22, 2024 19:57

roaksoax added the status:needs-review label Feb 22, 2024

yaauie added 4 commits May 16, 2024 19:36

spike: add log message with encoding info

1a59b4e

yaauie force-pushed the unicode-pipeline-and-plugin-ids branch from 49ca706 to 1a59b4e Compare May 16, 2024 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode pipeline and plugin ids #15971

Unicode pipeline and plugin ids #15971

yaauie commented Feb 22, 2024 •

edited

yaauie commented Feb 22, 2024

elastic-sonarqube bot commented Feb 23, 2024

elasticmachine commented May 16, 2024 •

edited

elastic-sonarqube bot commented May 16, 2024

Unicode pipeline and plugin ids #15971

Are you sure you want to change the base?

Unicode pipeline and plugin ids #15971

Conversation

yaauie commented Feb 22, 2024 • edited

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

How to test this PR locally

Related issues

yaauie commented Feb 22, 2024

elastic-sonarqube bot commented Feb 23, 2024

Quality Gate passed

elasticmachine commented May 16, 2024 • edited

💔 Build Failed

Failed CI Steps

History

elastic-sonarqube bot commented May 16, 2024

Quality Gate passed

yaauie commented Feb 22, 2024 •

edited

elasticmachine commented May 16, 2024 •

edited