Skip to content

cometkim/unicode-segmenter

Repository files navigation

unicode-segmenter

Package Version Integration codecov LICENSE - MIT

A lightweight and fast, pure JavaScript library for Unicode segmentation.

Features

unicode-segmenter includes utilities to deal with:

  • Emojis and pictographic
  • Extended grapheme clusters
  • Non-Latin alphabets and numbers
  • UTF-8 characters and UTF-16 surrogates
  • Intl.Segmenter Polyfill

With no dependencies, so you can use it even in places where built-in Unicode libraries aren't available, such as old browsers, edge runtimes, and embedded environments.

Unicode® version

Unicode® 15.1.0 Standard Annex #29 Revision 43 (2023-08-16)

Runtime compatibility

unicode-segmenter uses most basic ES6+ features like generators, modules and String.prototype.codePointAt().

Those are available in lightweight JS runtimes like QuickJS as well as (not very) modern browsers. You can still use the library even in IE11 by transpiling/polyfilling them using Babel, regenerator, etc.

Usage

Using TypeScript

No worry. The project is fully type-checked, and provides *.d.ts for you 😉

Export unicode-segmenter/emoji

Utilities for matching emoji-like characters

Example: Use Unicode emoji property matches

import {
  isEmoji,             // match \p{Extended_Pictographic}
  isEmojiPresentation, // match \p{Emoji_Presentation}
} from 'unicode-segmenter/emoji';

isEmoji('😍'.codePointAt(0));
// => true
isEmoji('♡'.codePointAt(0));
// => true

isEmojiPresentation('😍'.codePointAt(0));
// => true
isEmojiPresentation('♡'.codePointAt(0));
// => false

Export unicode-segmenter/general

Utilities for matching alphanumeric characters

Example: Use Unicode general property matchers

import {
  isLetter,       // match \p{L}
  isNumeric,      // match \p{N}
  isAlphabetic,   // match \p{Alphabetic}
  isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';

Export unicode-segmenter/grapheme

Utilities for text segmentation by extended grapheme cluster rules

Example: Count graphemes

import { countGrapheme } from 'unicode-segmenter/grapheme';

'👋 안녕!'.length;
// => 6
countGrapheme('👋 안녕!');
// => 5

'a̐éö̲'.length;
// => 7
countGrapheme('a̐éö̲');
// => 3

Example: Get grapheme segments

import { graphemeSegments } from 'unicode-segmenter/grapheme';

[...graphemeSegments('a̐éö̲\r\n')];
// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }
// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }
// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }

Example: Build an advanced grapheme matcher

import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';

function* matchEmoji(str) {
  // internal field `_cat` is GraphemeCategory value of the match index
  for (const { segment, _cat } of graphemeSegments(input)) {
    if (_cat === GraphemeCategory.Extended_Pictographic) {
      yield segment;
    }
  }
}

[...matchEmoji('1🌷2🎁3💩4😜5👍')]
// 0: 🌷
// 1: 🎁
// 2: 💩
// 3: 😜
// 4: 👍

Export unicode-segmenter/intl-adapter

Intl.Segmenter API adapter (only granularity: "grapheme" available yet)

import { Segmenter } from 'unicode-segmenter/intl-adapter';

// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();

Export unicode-segmenter/intl-polyfill

Intl.Segmenter API polyfill (only granularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';

const segmenter = new Intl.Segmenter();

Export unicode-segmenter/utils

You can access some internal utilities to deal with UTF-8 in the JavaScript

Example: Handle UTF-16 surrogate pairs

import {
  isHighSurrogate,
  isLowSurrogate,
  surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';

const u32 = '😍';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);

if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
  const codePoint = surrogatePairToCodePoint(hi, lo);
  // => equivalent to u32.codePointAt(0)
}

Example: Determine the length of a character

import { isBMP } from 'unicode-segmenter/utils';

const char = '😍'; // .length = 2
const cp = char.codePointAt(0);

char.length === isBMP(cp) ? 1 : 2;
// => true

Benchmarks

unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking the performance, bundle size, and Unicode version compliance of several libraries.

Look benchmark to see how it works.

unicode-segmenter/emoji vs

  • built-in Unicode RegExp
  • emoji-regex@10.3.0 (101M+ weekly downloads on NPM)

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/emoji 15.1.0 ✔️ 3,058 2,611 1,041 751
emoji-regex* 15.1.0 (vary) ✔️ 12,946 12,859 2,180 1,746
RegExp w/ u* - - 0 0 0 0
  • emoji-regex only supports Emoji_Presentation property, not Extended_Pictographic.
  • You can build your own emoji-regex using emoji-test-regex-pattern.
  • RegExp Unicode data is always kept up to date as the runtime support.
  • RegExp Unicode may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

The runtime performance of unicode-segmenter/emoji is enough to test the presence of emoji in a text.

It's ~2.5x worse than RegExp w/ u for match-all performance, but that's useless examples in the real world because others don't care about grapheme clusters.

Details
cpu: Apple M1 Pro
runtime: node v20.13.1 (arm64-darwin)

benchmark                    time (avg)             (min … max)       p75       p99      p999
--------------------------------------------------------------- -----------------------------
• checking if any emoji
--------------------------------------------------------------- -----------------------------
unicode-segmenter/emoji   16.11 ns/iter     (15.28 ns … 339 ns)  16.32 ns  18.66 ns  43.42 ns
RegExp w/ unicode         19.03 ns/iter     (16.52 ns … 185 ns)   17.9 ns  46.28 ns  74.85 ns
emoji-regex               43.15 ns/iter   (41.54 ns … 73.51 ns)  43.58 ns  47.93 ns  65.73 ns

summary for checking if any emoji
  unicode-segmenter/emoji
   1.18x faster than RegExp w/ unicode
   2.68x faster than emoji-regex

• match all emoji
--------------------------------------------------------------- -----------------------------
unicode-segmenter/emoji   3'215 ns/iter     (2'958 ns … 189 µs)  3'208 ns  3'708 ns 11'833 ns
RegExp w/ unicode         1'285 ns/iter   (1'221 ns … 1'509 ns)  1'299 ns  1'449 ns  1'509 ns
emoji-regex              11'696 ns/iter    (11'125 ns … 239 µs) 11'667 ns 16'125 ns 20'375 ns

summary for match all emoji
  unicode-segmenter/emoji
   2.5x slower than RegExp w/ unicode
   3.64x faster than emoji-regex

unicode-segmenter/general vs

  • built-in unicode RegExp
  • XRegExp@5.1.1 (2.8M+ weekly downloads on NPM)

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/general 15.1.0 ✔️ 21,505 20,972 5,792 3,564
XRegExp 14.0.0 ✖️ ️ 383,156 194,202 62,986 39,871
RegExp w/ u* - - 0 0 0 0
  • RegExp Unicode data is always kept up to date as the runtime support.
  • RegExp Unicode may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

unicode-segmenter/general is almost equivalent to RegExp w/ u.

Details
cpu: Apple M1 Pro
runtime: node v20.13.1 (arm64-darwin)

benchmark                      time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------- -----------------------------
• checking any alphanumeric
----------------------------------------------------------------- -----------------------------
unicode-segmenter/general     236 ns/iter       (229 ns … 579 ns)    233 ns    304 ns    552 ns
XRegExp                       243 ns/iter       (239 ns … 319 ns)    242 ns    285 ns    317 ns
RegExp w/ unicode             236 ns/iter       (233 ns … 312 ns)    237 ns    263 ns    299 ns

summary for checking any alphanumeric
  unicode-segmenter/general
   1x faster than RegExp w/ unicode
   1.03x faster than XRegExp

• match all alphanumeric
----------------------------------------------------------------- -----------------------------
unicode-segmenter/general   1'883 ns/iter   (1'851 ns … 2'105 ns)  1'880 ns  2'027 ns  2'105 ns
XRegExp                     3'135 ns/iter   (3'109 ns … 3'300 ns)  3'137 ns  3'273 ns  3'300 ns
RegExp w/ unicode           1'540 ns/iter   (1'520 ns … 1'655 ns)  1'544 ns  1'643 ns  1'655 ns

summary for match all alphanumeric
  RegExp w/ unicode
   1.22x faster than unicode-segmenter/general
   2.04x faster than XRegExp

unicode-segmenter/grapheme vs

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/grapheme 15.1.0 ✔️ 33,045 29,667 9,343 5,658
graphemer 15.0.0 ✖️ ️ 410,424 95,104 15,752 10,660
grapheme-splitter 10.0.0 ✖️ 122,241 23,680 7,852 4,841
unicode-segmentation* 15.0.0 ✔️ 51,251 51,251 22,545 16,614
Intl.Segmenter* - - 0 0 0 0
  • unicode-segmentation size contains only the minimum WASM binary. It will be larger by adding more bindings.
  • Intl.Segmenter's Unicode data is always kept up to date as the runtime support.
  • Intl.Segmenter may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

unicode-segmenter/grapheme is 7~18x faster than other JS alternatives, 3~8x faster than native Intl.Segmenter), and 1.5~3x faster than WASM build of the Rust unicode-segmentation library.

The gap may increase depending on the environment. Bindings for browsers generally appear to perform worse. In most environments, unicode-segmenter/grapheme is over 6x faster than graphemer.

Details
cpu: Apple M1 Pro
runtime: node v20.13.1 (arm64-darwin)

benchmark                                        time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------------------------- -----------------------------
• Lorem ipsum (ascii)
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                             5'307 ns/iter     (4'708 ns … 252 µs)  5'125 ns  6'250 ns 68'625 ns
Intl.Segmenter                               51'373 ns/iter    (47'000 ns … 467 µs) 50'709 ns 58'583 ns    397 µs
graphemer                                    49'735 ns/iter  (46'416 ns … 1'739 µs) 47'042 ns    123 µs    342 µs
grapheme-splitter                            74'459 ns/iter    (73'292 ns … 211 µs) 73'834 ns 81'334 ns    169 µs
unicode-rs/unicode-segmentation (wasm-pack)  16'422 ns/iter    (15'625 ns … 325 µs) 16'375 ns 19'416 ns 89'125 ns

summary for Lorem ipsum (ascii)
  unicode-segmenter
   3.09x faster than unicode-rs/unicode-segmentation (wasm-pack)
   9.37x faster than graphemer
   9.68x faster than Intl.Segmenter
   14.03x faster than grapheme-splitter

• Emojis
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                             1'820 ns/iter   (1'730 ns … 2'428 ns)  1'853 ns  2'228 ns  2'428 ns
Intl.Segmenter                               14'743 ns/iter  (12'166 ns … 2'454 µs) 13'875 ns 18'000 ns 39'834 ns
graphemer                                    13'406 ns/iter  (12'625 ns … 1'243 µs) 13'292 ns 15'208 ns    117 µs
grapheme-splitter                            27'827 ns/iter    (26'625 ns … 513 µs) 27'709 ns 32'208 ns 82'958 ns
unicode-rs/unicode-segmentation (wasm-pack)   5'591 ns/iter   (5'462 ns … 5'916 ns)  5'655 ns  5'845 ns  5'916 ns

summary for Emojis
  unicode-segmenter
   3.07x faster than unicode-rs/unicode-segmentation (wasm-pack)
   7.37x faster than graphemer
   8.1x faster than Intl.Segmenter
   15.29x faster than grapheme-splitter

• Demonic characters
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                             1'789 ns/iter   (1'728 ns … 1'945 ns)  1'812 ns  1'905 ns  1'945 ns
Intl.Segmenter                                5'083 ns/iter   (3'505 ns … 9'451 ns)  7'867 ns  9'238 ns  9'451 ns
graphemer                                    27'906 ns/iter    (26'375 ns … 284 µs) 27'750 ns 30'917 ns    168 µs
grapheme-splitter                            20'428 ns/iter    (19'042 ns … 373 µs) 20'125 ns 23'833 ns    287 µs
unicode-rs/unicode-segmentation (wasm-pack)   2'513 ns/iter   (2'426 ns … 2'728 ns)  2'542 ns  2'693 ns  2'728 ns

summary for Demonic characters
  unicode-segmenter
   1.4x faster than unicode-rs/unicode-segmentation (wasm-pack)
   2.84x faster than Intl.Segmenter
   11.42x faster than grapheme-splitter
   15.6x faster than graphemer

• Tweet text (combined)
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                             8'170 ns/iter     (7'666 ns … 370 µs)  8'042 ns  9'125 ns    110 µs
Intl.Segmenter                               67'951 ns/iter    (63'542 ns … 664 µs) 67'875 ns 73'667 ns    359 µs
graphemer                                    68'831 ns/iter    (66'542 ns … 349 µs) 69'083 ns 78'500 ns    197 µs
grapheme-splitter                               151 µs/iter       (145 µs … 624 µs)    150 µs    183 µs    445 µs
unicode-rs/unicode-segmentation (wasm-pack)  24'231 ns/iter    (23'625 ns … 252 µs) 24'000 ns 26'250 ns    133 µs

summary for Tweet text (combined)
  unicode-segmenter
   2.97x faster than unicode-rs/unicode-segmentation (wasm-pack)
   8.32x faster than Intl.Segmenter
   8.42x faster than graphemer
   18.43x faster than grapheme-splitter

• Code snippet (combined)
----------------------------------------------------------------------------------- -----------------------------
unicode-segmenter                            19'604 ns/iter    (18'291 ns … 239 µs) 19'375 ns 27'709 ns    139 µs
Intl.Segmenter                                  160 µs/iter       (148 µs … 406 µs)    159 µs    309 µs    385 µs
graphemer                                       165 µs/iter       (159 µs … 377 µs)    165 µs    267 µs    351 µs
grapheme-splitter                               353 µs/iter     (340 µs … 1'264 µs)    354 µs    541 µs  1'136 µs
unicode-rs/unicode-segmentation (wasm-pack)  58'236 ns/iter    (55'958 ns … 905 µs) 58'333 ns 65'125 ns    199 µs

summary for Code snippet (combined)
  unicode-segmenter
   2.97x faster than unicode-rs/unicode-segmentation (wasm-pack)
   8.16x faster than Intl.Segmenter
   8.39x faster than graphemer
   17.98x faster than grapheme-splitter

LICENSE

MIT

Note

The initial implementation was ported manually from Rust's unicode-segmentation library, which is licenced under the MIT license.