Skip to content

Minimal implementation of a Byte Pair Encoding (BPE) tokenizer in Zig

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

alvarobartt/bpe.zig

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bpe.zig

bpe.zig is a minimal implementation of a Byte Pair Encoding (BPE) tokenizer in Zig.

Warning

This implementation is currently an educational project for exploring Zig and tokenizer internals (particularly BPE used in models like e.g. GPT-2).

Usage

First you need to download the tokenizer.json file from the Hugging Face Hub at openai-community/gpt2.

const std = @import("std");
const Tokenizer = @import("bpe.Tokenizer");

pub fn main() !void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();
    const allocator = arena.allocator();

    // https://huggingface.co/openai-community/gpt2/tree/main/tokenizer.json
    var tokenizer = try Tokenizer.init("tokenizer.json", allocator);
    defer tokenizer.deinit();

    const text = "Hello, I'm a test string with numbers 123 and symbols @#$!<|endoftext|>";
    const encoding = try tokenizer.encode(text);
    defer allocator.free(encoding);

    std.debug.print("Encoded tokens: {any}\n", .{encoding});
}

License

This project is licensed under either of the following licenses, at your option:

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

Minimal implementation of a Byte Pair Encoding (BPE) tokenizer in Zig

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages