Skip to content

Latest commit

 

History

History
45 lines (32 loc) · 1.56 KB

README.md

File metadata and controls

45 lines (32 loc) · 1.56 KB

string-offsets

Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.

Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of Unicode code points. It's therefore necessary to adjust string offsets when communicating across programming language boundaries. StringOffsets does these adjustments.

Each StringOffsets instance contains offset information for a single string. Building the data structure takes O(n) time and memory, but then most conversions are O(1).

"UTF-8 Conversions with BitRank" is a blog post explaining the implementation.

Usage

Add this to your Cargo.toml:

[dependencies]
string-offsets = "0.1"

Then:

use string_offsets::StringOffsets;

let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);

// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12);  // note: 0-based line numbers

// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);

See the documentation for more.