Vintage Vectors takes a straightforward, yet unconventional approach to tokenization by representing words as sequences of unicode code points. In this method, words are treated as channels/steps (row wise), with each word encoded as a sequence of unicode code points in the feature (column wise) dimension. This approach is simple, and reliable to reproduce.
pip install vintage-vectors
The simplicity of this approach could offer benefits in terms of:
-
Memory Efficiency: By leveraging the channel dimension to store words and the feature dimension to represent the word itself, VintageVectors may provide a memory-friendly solution compared to traditional tokenization methods.
-
Preservation of Word Structure: Encoding words as sequences of characters preserves the internal structure, which could be advantageous for tasks requiring character-level understanding. Utilizes the benefits from character based tokenization schemes, while keeping a lower number of tokens than popular methods like Byte Pair Encoding (BPE).
-
Handle any language out of the box: (in theory) Using code points allows us to model any character within unicode. Current processing is compatible with latin based languages
This vintage, back-to-basics method draws inspiration from two sources:
-
Audio and DSP: Coming from a background in hip hop production and working with raw audio data, representing words as sequences of unicode points feels akin to working with raw samples - the building blocks of rich compositions.
-
CANINE: The CANINE paper, which tokenizes text at the unicode point level, served as a direct inspiration for this character-level tokenization scheme.
TODO
- Make this repo a pip package
- Upload text models trained with this tokenization
- Upload audio models when done training
No claims are made about its effectiveness or superiority. VintageVectors is an exploration of a simple, character-level approach to tokenization, drawing parallels from the worlds of audio and existing methods, while revisiting the foundations of text representations.