unicode-segmenter

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

Zero-dependencies: It doesn't bloat node_modules or the networks tab.
Excellent compatibility: It works well on older browsers, edge runtimes, and embedded JavaScript runtimes like Hermes and QuickJS.
Small bundle size: It effectively compresses Unicode data and provides a tree-shakeable format, allowing unnecessary codes to be eliminated.
Extremely efficient: It's carefully optimized for performance, making it the fastest one in the ecosystem—outperforming even the built-in Intl.Segmenter.
TypeScript: It's fully type-checked, and provides definitions with JSDoc.
ESM-first: It natively supports ES Modules, also supports CommonJS too.

Unicode® Version

Unicode® 15.1.0

Unicode® Standard Annex #29 - Revision 43 (2023-08-16)

APIs

There are several entries for text segmentation.

unicode-segmenter/grapheme: Segments and counts extended grapheme clusters
unicode-segmenter/intl-adapter: Intl.Segmenter adapter
unicode-segmenter/intl-polyfill: Intl.Segmenter polyfill

And extra utilities for combined use cases.

unicode-segmenter/emoji: Matches single codepoint emojis
unicode-segmenter/general: Matches single codepoint alphanumerics
unicode-segmenter/utils: Handles UTF-8 and UTF-16 surrogates

Export `unicode-segmenter/grapheme`

Utilities for text segmentation by extended grapheme cluster rules.

Example: Count graphemes

import { countGrapheme } from 'unicode-segmenter/grapheme';

'👋 안녕!'.length;
// => 6
countGrapheme('👋 안녕!');
// => 5

'a̐éö̲'.length;
// => 7
countGrapheme('a̐éö̲');
// => 3

Example: Get grapheme segments

import { graphemeSegments } from 'unicode-segmenter/grapheme';

[...graphemeSegments('a̐éö̲\r\n')];
// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }
// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }
// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }

Example: Build an advanced grapheme matcher

graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.

For example, knowing the Grapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.

import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';

function* matchEmoji(str) {
  for (const { segment, _catBegin } of graphemeSegments(input)) {
    // `_catBegin` identified as Extended_Pictographic means the segment is emoji
    if (_catBegin === GraphemeCategory.Extended_Pictographic) {
      yield segment;
    }
  }
}

[...matchEmoji('1🌷2🎁3💩4😜5👍')]
// 0: 🌷
// 1: 🎁
// 2: 💩
// 3: 😜
// 4: 👍

Export `unicode-segmenter/intl-adapter`

Intl.Segmenter API adapter (only granularity: "grapheme" available yet)

import { Segmenter } from 'unicode-segmenter/intl-adapter';

// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();

Export `unicode-segmenter/intl-polyfill`

Intl.Segmenter API polyfill (only granularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';

const segmenter = new Intl.Segmenter();

Export `unicode-segmenter/emoji`

Utilities for matching emoji-like characters.

Example: Use Unicode emoji property matches

import {
  isEmojiPresentation,    // match \p{Emoji_Presentation}
  isExtendedPictographic, // match \p{Extended_Pictographic}
} from 'unicode-segmenter/emoji';

isEmojiPresentation('😍'.codePointAt(0));
// => true
isEmojiPresentation('♡'.codePointAt(0));
// => false

isExtendedPictographic('😍'.codePointAt(0));
// => true
isExtendedPictographic('♡'.codePointAt(0));
// => true

Export `unicode-segmenter/general`

Utilities for matching alphanumeric characters.

Example: Use Unicode general property matchers

import {
  isLetter,       // match \p{L}
  isNumeric,      // match \p{N}
  isAlphabetic,   // match \p{Alphabetic}
  isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';

Export `unicode-segmenter/utils`

You can access some internal utilities to deal with JavaScript strings.

Example: Handle UTF-16 surrogate pairs

import {
  isHighSurrogate,
  isLowSurrogate,
  surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';

const u32 = '😍';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);

if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
  const codePoint = surrogatePairToCodePoint(hi, lo);
  // => equivalent to u32.codePointAt(0)
}

Example: Determine the length of a character

import { isBMP } from 'unicode-segmenter/utils';

const char = '😍'; // .length = 2
const cp = char.codePointAt(0);

char.length === isBMP(cp) ? 1 : 2;
// => true

Runtime Compatibility

unicode-segmenter uses only fundamental features of ES2015, making it compatible with most browsers.

To ensure compatibility, the runtime should support:

String.prototype.codePointAt()
Generators
Modules

If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.

React Native Support

Since Hermes doesn't support the Intl.Segmenter API yet, unicode-segmenter is a good alternative.

unicode-segmenter is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See the benchmark for details.

Comparison

unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.

`unicode-segmenter/grapheme` vs

graphemer@1.4.0 (16.6M+ weekly downloads on NPM)
grapheme-splitter@1.0.4 (5.7M+ weekly downloads on NPM)
@formatjs/intl-segmenter@11.5.7 (5.4K+ weekly downloads on NPM)
WebAssembly binding of the Rust's unicode-segmentation library
Built-in Intl.Segmenter API

JS Bundle Stats

Name	Unicode®	ESM?	Size	Size (min)	Size (min+gzip)	Size (min+br)
`unicode-segmenter/grapheme`	15.1.0	✔️	28,270	24,291	6,347	4,273
`graphemer`	15.0.0	✖️ ️	410,435	95,104	15,752	10,660
`grapheme-splitter`	10.0.0	✖️	122,252	23,680	7,852	4,841
`@formatjs/intl-segmenter`*	15.0.0	✖️	491,043	318,721	54,248	34,380
`unicode-segmentation`*	15.0.0	✔️	45,803	41,717	19,687	13,477
`Intl.Segmenter`*	-	-	0	0	0	0

@formatjs/intl-segmenter handles grapheme, word, and sentence, but it's not tree-shakable.
unicode-segmentation size contains only the minimum WASM binary and bindings. It will be larger by adding more features.
Intl.Segmenter's Unicode data depends on the host, and may not be up-to-date.
Intl.Segmenter may not be available in some old browsers, edge runtimes, or embedded environments.

Hermes Bytecode Stats

Name	Unicode®	Bytecode size	Bytecode size (gzip)*
`unicode-segmenter/grapheme`	15.1.0	35,014	13,326
`graphemer`	15.0.0	133,949	31,710
`grapheme-splitter`	10.0.0	63,810	19,125
`@formatjs/intl-segmenter`*	15.0.0	315,865	99,063

It would be compressed when included as an app asset.

Runtime Performance

Here is a brief explanation, and you can see archived benchmark results.

Performance in Node.js: unicode-segmenter/grapheme is significantly faster than alternatives.

7~18x faster than other JavaScript libraries
1.5~3x faster than WASM binding of the Rust's unicode-segmentation
3~8x faster than built-in Intl.Segmenter

Performance in Bun: unicode-segmenter/grapheme has almost the same performance as the built-in Intl.Segmenter, with no performance degradation compared to other JavaScript libraries.

Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations, unicode-segmenter/grapheme generally outperforms other JavaScript libraries in most environments.

Performance in React Native: unicode-segmenter/grapheme is significantly faster than alternatives when compiled to Hermes bytecode. It's 2~4x faster than graphemer and 18~25x faster than grapheme-splitter, with the performance gap increasing with input size.

Instead of trusting these claims, you can try yarn perf:grapheme directly in your environment or build a benchmark yourself.

LICENSE

MIT

Note

The initial implementation was ported manually from Rust's unicode-segmentation library, which is licensed under the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

unicode-segmenter

Unicode® Version

APIs

Export `unicode-segmenter/grapheme`

Example: Count graphemes

Example: Get grapheme segments

Example: Build an advanced grapheme matcher

Export `unicode-segmenter/intl-adapter`

Export `unicode-segmenter/intl-polyfill`

Export `unicode-segmenter/emoji`

Example: Use Unicode emoji property matches

Export `unicode-segmenter/general`

Example: Use Unicode general property matchers

Export `unicode-segmenter/utils`

Example: Handle UTF-16 surrogate pairs

Example: Determine the length of a character

Runtime Compatibility

React Native Support

Comparison

`unicode-segmenter/grapheme` vs

JS Bundle Stats

Hermes Bytecode Stats

Runtime Performance

LICENSE

Files

README.md

Latest commit

History

README.md

File metadata and controls

unicode-segmenter

Unicode® Version

APIs

Export unicode-segmenter/grapheme

Example: Count graphemes

Example: Get grapheme segments

Example: Build an advanced grapheme matcher

Export unicode-segmenter/intl-adapter

Export unicode-segmenter/intl-polyfill

Export unicode-segmenter/emoji

Example: Use Unicode emoji property matches

Export unicode-segmenter/general

Example: Use Unicode general property matchers

Export unicode-segmenter/utils

Example: Handle UTF-16 surrogate pairs

Example: Determine the length of a character

Runtime Compatibility

React Native Support

Comparison

unicode-segmenter/grapheme vs

JS Bundle Stats

Hermes Bytecode Stats

Runtime Performance

LICENSE

Export `unicode-segmenter/grapheme`

Export `unicode-segmenter/intl-adapter`

Export `unicode-segmenter/intl-polyfill`

Export `unicode-segmenter/emoji`

Export `unicode-segmenter/general`

Export `unicode-segmenter/utils`

`unicode-segmenter/grapheme` vs