This module provides support for loading, manipulating, and comparing unicode text data.
It works by storing characters with their Unicode 'codepointvalue. In practice, this means that every character is a 64-bit integer, so a
textvalue will use substantially more memory than the equivalent encoded
string` value.
The advantages of text
over string
representations for Unicode are:
- comparisons, equality checks, etc. actually work for Unicode text and are not encoding-dependent.
- direct access to codepoint values.
The advantages of string
representations for Unicode are:
- compactness.
- reading/writing to files via the standard
io
library.
LUA has limited built-in support for Unicode text. string
values are "8-bit clean", which means it is an array of 8-bit characters. This is also how binary data from files is usually loaded, as 8-bit 'bytes'. Unicode characters can be up to 32-bits, so there are several standard ways to represent Unicode characters using 8-bit characters. Without going into detail, the most common encodings are called 'UTF-8' and 'UTF-16'. There are two variations of 'UTF-16', depending on the hardware architecture, known as 'big-endian' and 'little-endian'.
The built-in functions for string
, such as match
, gsub
and even len
will not work as expected when a string contains Unicode text. As such, this library fills some of the gaps for common operations when working with Unicode text.
You can convert to and from string
and text
values like so:
local text = require("cp.text")
local simpleString = "foobar"
local simpleText = text(stringValue)
local utf8String = "a丽𐐷" -- contains non-ascii characters, defaults to UTF-8.
local unicodeText = text "a丽𐐷" -- contains non-ascii characters, converts from a UTF-8 string.
local utf8String = tostring(unicodeText) -- `tostring` will default to UTF-8 encoding
local utf16leString = unicodeText:encode(text.encoding.utf16le) -- or you can be more specific
Note that text
values are not in any specific encoding, since they are stored as 64-bit integer code-points
rather than 8-bit characers.
- Constants - Useful values which cannot be changed
- encoding
- Functions - API calls offered directly by the extension
- is
- Constructors - API calls which return an object, typically one that offers API methods
- char
- fromCodepoints
- fromFile
- fromString
- Methods - API calls which can only be made on an object returned by a constructor
- encode
- find
- len
- match
- sub
Signature |
cp.text.encoding |
Type |
Constant |
Description |
The list of supported encoding formats: |
Signature |
cp.text.is(value) -> boolean |
Type |
Function |
Description |
Checks if the provided value is a text instance. |
Parameters |
value - The value to check
|
Returns |
true if the value is a text instance.
|
Signature |
cp.text.char(...) -> text |
Type |
Constructor |
Description |
Returns the list of one or more codepoint items into a text value, concatenating the results. |
Parameters |
... - The list of codepoint integers.
|
Returns |
- The
cp.text value for the list of codepoint values.
|
Signature |
cp.text.fromCodepoints(codepoints[, i[, j]]) -> text |
Type |
Constructor |
Description |
Returns a new text instance representing the specified array of codepoints. Since i and j default to the first |
Parameters |
codepoints - The array of codepoint integers.i - The starting index to read from codepoints. Defaults to 1 .j - The ending index to read from codepoints. Default to -1 .
|
Returns |
|
Notes |
- You can use a negative value for
i and j . If so, it will count back from then end of the codepoints array. - If the codepoint array begins with a Byte-Order Marker (BOM), the BOM is skipped in the resulting text.
|
Signature |
cp.text.fromFile(path[, encoding]) -> text |
Type |
Constructor |
Description |
Returns a new text instance representing the text loaded from the specified path. If no encoding is specified, |
Parameters |
value - The value to turn into a unicode text instance.encoding - One of the falues from text.encoding : utf8 , utf16le , or utf16be . Defaults to utf8 .
|
Returns |
|
Signature |
cp.text.fromString(value[, encoding]) -> text |
Type |
Constructor |
Description |
Returns a new text instance representing the string value of the specified value. If no encoding is specified, |
Parameters |
value - The value to turn into a unicode text instance.encoding - One of the falues from text.encoding : utf8 , utf16le , or utf16be . Defaults to utf8 .
|
Returns |
|
Notes |
- Calling
text(value) is the same as calling text.fromString(value, text.encoding.utf8) , so simple text can be initialized via local x = text "foo" when the .lua file's encoding is UTF-8.
|
Signature |
cp.text:encode([encoding]) -> string |
Type |
Method |
Description |
Returns the text as an encoded string value. |
Parameters |
encoding - The encoding to use when converting. Defaults to cp.text.encoding.utf8 .
|
Signature |
cp.text:find(pattern [, init [, plain]]) |
Type |
Method |
Description |
Looks for the first match of pattern in the string value . If it finds a match, then find returns the indices of value where this occurrence starts and ends; otherwise, it returns nil . A third, optional numerical argument init specifies where to start the search; its default value is 1 and can be negative. A value of true as a fourth, optional argument plain turns off the pattern matching facilities, so the function does a plain "find substring" operation, with no characters in pattern being considered "magic". Note that if plain is given, then init must be given as well. |
Returns |
- the start index, the end index, followed by any captures
|
Signature |
cp.text:len() -> number |
Type |
Method |
Description |
Returns the number of codepoints in the text. |
Parameters |
|
Returns |
- The number of codepoints.
|
Signature |
cp.text:match(pattern[, start]) -> ... |
Type |
Method |
Description |
Looks for the first match of the pattern in the text value. If it finds one, then match returns the captures from the pattern; otherwise it returns nil . If pattern specifies no captures, then the whole match is returned. A third, optional numerical argument init specifies where to start the search; its default value is 1 and can be negative. |
Parameters |
pattern - The text pattern to process.start - If specified, indicates the starting position to process from. Defaults to 1 .
|
Returns |
- The capture results, the whole match, or
nil .
|
Signature |
cp.text:sub(i [, j]) -> cp.text |
Type |
Method |
Description |
Returns the substring of this text that starts at i and continues until j ; i and j can be negative. |