Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode pipeline and plugin ids #15971

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

yaauie
Copy link
Member

@yaauie yaauie commented Feb 22, 2024

Release notes

  • Resolves a regression introduced in the 8.7 series in which the use of characters outside of the lower-ASCII plane in pipeline.id's caused the pipeline to fail to start up.
  • Fixes a serialization issue in the monitoring API that caused unicode characters in plugin id's to print incorrectly

What does this PR do?

JRuby's Ruby#newSymbol(String) throws an exception when provided a String that contains characters outside of lower-ASCII because JRuby internals expect "the incoming String to be one of our mangled ISO-8859-1 strings" as noted in a comment on jruby/jruby#6217.

Instead, we use Ruby#newString(String) to create a new RubyString (which works properly), and then rely on RubyString#intern to get our RubySymbol.

This fixes a regression introduced in the 8.7 series in which pipeline id's are consistently represented as ruby symbols in the metrics store, and ensures similar issue does not exist when specifying a plugin id that contains characters above the lower-ASCII plane.

This patch fixes related serialization issues when either plugin id's or pipeline id's contain characters outside the lower-ASCII plane.

Why is it important/What is the impact to the user?

Accented characters like umlauts have historically been supported in pipeline.id, and are useful in locales where they are meaningful.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

Run a pipeline whose pipeline.id has unicode characters:

bin/logstash --config.string 'input { heartbeat { id => "심장-박동" } }' --pipeline.id="Übér¬PîpéLiñé"

Observe that the pipeline starts, and log messages include the pipeline id correctly:

[2024-02-22T18:57:06,036][INFO ][logstash.javapipeline    ][Übér¬PîpéLiñé] Starting pipeline {:pipeline_id=>"Übér¬PîpéLiñé", "pipeline.workers"=>10, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>1250, "pipeline.sources"=>["config string"], :thread=>"#<Thread:0x408db9f5 /Users/yaauie/src/elastic/ls/logstash-core/lib/logstash/java_pipeline.rb:134 run>"}
[2024-02-22T18:57:06,332][INFO ][logstash.javapipeline    ][Übér¬PîpéLiñé] Pipeline Java execution initialization time {"seconds"=>0.29}
[2024-02-22T18:57:06,336][INFO ][logstash.javapipeline    ][Übér¬PîpéLiñé] Pipeline started {"pipeline.id"=>"Übér¬PîpéLiñé"}
[2024-02-22T18:57:06,351][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:Übér¬PîpéLiñé], :non_running_pipelines=>[]}

Observe that the API responses contain the pipeline and plugin names correctly-encoded:

╭─{ yaauie@maybe:~ }
╰─○ curl --silent -XGET 'localhost:9600/_node/stats' | jq '.pipelines | map_values(.plugins | map_values( . | map(.id)))'
{
  "Übér¬PîpéLiñé": {
    "inputs": [
      "심장-박동"
    ],
    "codecs": [
      "rubydebug_59ffc710-a1e6-4a62-b4ad-6773c5a67cf2",
      "plain_e4a7bb04-9d4d-48fe-9c15-7979adfbf77b"
    ],
    "filters": [],
    "outputs": [
      "09763aaeed1274ec476c4bd9ce5c556937f3afc1f41ac1061d21f011eb5b8245"
    ]
  }
}
[success]

Related issues

Closes: #15968

@yaauie
Copy link
Member Author

yaauie commented Feb 22, 2024

Sigh.

These tests all work on my local machine, which has a LANG=en_US.UTF-8 and the complete complement of locales, but fail when executed in the docker container based on jammy that merely has C, POSIX, and C.UTF-8.

I have not yet been able to isolate why (since the integration test themselves clearly hold onto the UTF-8 bytes), but suspect it has to do with the process being spawned for the integration test interpreting those bytes as something other than UTF-8, and/or the bytes being returned from its HTTP API being interpreted by the integration test's process as something other than UTF-8.

Copy link

Quality Gate passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No Coverage information No data about Coverage
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

JRuby's `Ruby#newSymbol(String)` throws an exception when provided a `String`
that contains characters outside of lower-ASCII because JRuby internals expect
"the incoming String to be one of our mangled ISO-8859-1 strings" as noted in
a comment on jruby/jruby#6217.

Instead, we use `Ruby#newString(String)` to create a new `RubyString` (which
works properly), and then rely on `RubyString#intern` to get our `RubySymbol`.

This fixes a regression introduced in the 8.7 series in which pipeline id's
are consistently represented as ruby symbols in the metrics store, and ensures
similar issue does not exist when specifying a plugin id that contains
characters above the lower-ASCII plane.
We cannot rely on `RubySymbol#toString` to produce a properly-encoded `String`
whe the string contains characters above the lower-ASCII plane because the
result is effectively a binary ruby-internal marshal of the bytes that only
holds when the symbol contains lower-ASCII.

Instead, we can use the internally-memoizing `RubySymbol#name` to get a
properly-encoded `RubyString`, and `RubyString#asJavaString()` to get a
properly-encoded java-`String`.
Jackson's JSON serializer leaks the JRuby-internal byte structure of Symbols,
which only aligns with the byte-structure of the symbol's actual string when
that string is wholly-comprised of lower-ASCII characters.

By pre-converting Symbols to Strings, we ensure that the result is readable
and useful.
@yaauie yaauie force-pushed the unicode-pipeline-and-plugin-ids branch from 49ca706 to 1a59b4e Compare May 16, 2024 19:37
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crash with non-ASCII characters in pipeline id
3 participants