Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tokenize CLI tool have nicer command line arguments. #6188

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Noeda
Copy link

@Noeda Noeda commented Mar 20, 2024

Tokenize CLI tool is one of the tools in examples/*. It's a pretty short and simple tool that takes arguments like this:

  tokenize MODEL_FILENAME PROMPT [--ids]

And it would load the model, read the prompt, and then print a list of tokens it interpreted, or if --ids was given, just the integer values.

This changeset makes the command a bit more sophisticated with more options:

mikkojuola@Mikkos-Mac-Studio ~/llama.cpp> ./build/bin/tokenize
usage: ./build/bin/tokenize [options]

The tokenize program tokenizes a prompt using a given model,
and prints the resulting tokens to standard output.

It needs a model file, a prompt, and optionally other flags
to control the behavior of the tokenizer.

Invoke './build/bin/tokenize' like this:

    ./build/bin/tokenize MODEL_FNAME PROMPT [--ids]

  or this:

    ./build/bin/tokenize [options], where options are:

    -h, --help                           print this help and exit
    -m MODEL_PATH, --model MODEL_PATH    path to model.
    --ids                                if given, only print numerical token IDs, and not token strings.
    -f PROMPT_FNAME, --file PROMPT_FNAME read prompt from a file.
    -p PROMPT, --prompt PROMPT           read prompt from the argument.
    --stdin                              read prompt from standard input.
    --no-bos                             do not ever add a BOS token to the prompt, even if normally the model uses a BOS token.
    --log-disable

It will still recognize the old form (i.e. simple positional arguments) just to not surprise people. Although I would myself like to remove it entirely, to simplify the thing. Not sure anyone actually uses this tool except for ad-hoc testing like I do. Opinions on completely removing the "old style arguments"?

Motivation: I've been using this tool for my own tests with tokenization divergence investigations. I find it useful to do quick ad-hoc tests on text tokenization and comparisons. In particular I wanted it to behave nice if you give it a filename or pipe into it from stdin.

I took my hacks and cleaned them up into nicer looking command line arguments, following the style and argument names of some other CLI tools I saw. Also in general I added some error checking etc. so you are more likely to get a readable error than a segfault if you did something wrong.

Draft because I need to test some of the argument combinations and also Windows, and I want to see the CI results on GitHub here. I think the stdin reading as it is written might be sketchy on Windows, if you try to physically type letters, which would now become a feature of tokenize.

(std::cin does not have .is_open()? Got really confused when I was trying to write code to check did we read from stdin properly without syscall failures and trying to figure out if the code is checking syscall failures in a waterproof way. I'm a C programmer not a C++ one dammit)

@Noeda
Copy link
Author

Noeda commented Mar 20, 2024

Just noticed the CI doesn't run...maybe because it's my first PR and I'm not on an allowlist? Do I have a way to run the compilation tests somehow myself?

Edit: Oh er just as I wrote this I see things building. NVM.

examples/tokenize/tokenize.cpp Outdated Show resolved Hide resolved
Before this commit, tokenize was a simple CLI tool like this:

  tokenize MODEL_FILENAME PROMPT [--ids]

This simple tool loads the model, takes the prompt, and shows the tokens
llama.cpp is interpreting.

This changeset makes the tokenize more sophisticated, and more useful
for debugging and troubleshooting:

  tokenize [-m, --model MODEL_FILENAME]
           [--ids]
           [--stdin]
           [--prompt]
           [-f, --file]
           [--no-bos]
           [--log-disable]

It also behaves nicer on Windows now, interpreting and rendering Unicode
from command line arguments and pipes no matter what code page the user
has set on their terminal.
@Noeda Noeda marked this pull request as ready for review March 26, 2024 21:13
@Noeda
Copy link
Author

Noeda commented Mar 26, 2024

Added bunch of stuff since last commit, as part of wrestling with Windows cmd.exe console crap:

  • A lot of new code to interpret and render characters properly on Windows. Good lord Windows is annoying. But you get correctly interpreted and rendered text now out-of-box without setting any code pages (although you may have to set a font depending on what text you use).
  • --ids now prints in a format that parses directly as Python or JSON (useful for sketchy pipe shenanigans)
  • --log-disable to silence stderr (consistent with main)
  • Fixed the style, like the * pointer stuff to be more consistent with the rest of the codebase.
  • Prints "failed utf-8 decode" on tokens that don't parse as UTF-8. Seems like a fairly common thing to happen with modern models where individual tokens don't decode to valid UTF-8. I made it print hex codes instead so you see the bytes it wants to decode as, even if we can't render them properly.

I noticed midway that we had similar code handling Windows stuff in common/console.cpp. Not exactly what I needed for the tokenize but I added a TODO comment about it, and made the Windows bits a bit more general so maybe a later contribution has an easier time moving that to common code.


--help looks like this now:

shannon@junko ~/llama.cpp/build/bin> ./tokenize --help
usage: ./tokenize [options]

The tokenize program tokenizes a prompt using a given model,
and prints the resulting tokens to standard output.

It needs a model file, a prompt, and optionally other flags
to control the behavior of the tokenizer.

Invoke './tokenize' like this:

    ./tokenize MODEL_FNAME PROMPT [--ids]

  or this:

    ./tokenize [options], where options are:

    -h, --help                           print this help and exit
    -m MODEL_PATH, --model MODEL_PATH    path to model.
    --ids                                if given, only print numerical token IDs, and not token strings.
                                         The output format looks like [1, 2, 3], i.e. parseable by Python.
    -f PROMPT_FNAME, --file PROMPT_FNAME read prompt from a file.
    -p PROMPT, --prompt PROMPT           read prompt from the argument.
    --stdin                              read prompt from standard input.
    --no-bos                             do not ever add a BOS token to the prompt, even if normally the model uses a BOS token.
    --log-disable                        disable logs. Makes stderr quiet when loading the model.

Some checking that all is good

Verified that tokens are interpreted the same way on cmd.exe and Linux, and also that Windows renders tokens correctly (when valid UTF-8):

Windows cmd.exe, with --prompt こんにちは

Screenshot 2024-03-26 at 2 17 39 PM

Reading from a file looks fine:

Screenshot 2024-03-26 at 2 19 47 PM

Piping on Windows works too:

Screenshot 2024-03-26 at 2 20 11 PM

Checked that we get same IDs for こんにちは on Mac (got same on Linux too):

Screenshot 2024-03-26 at 2 20 55 PM

If the CI doesn't complain and there's no other feedback to fix I'm done with the PR.

@Noeda Noeda requested a review from ggerganov March 26, 2024 21:31
examples/tokenize/tokenize.cpp Show resolved Hide resolved
examples/tokenize/tokenize.cpp Outdated Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will still recognize the old form (i.e. simple positional arguments) just to not surprise people. Although I would myself like to remove it entirely, to simplify the thing. Not sure anyone actually uses this tool except for ad-hoc testing like I do. Opinions on completely removing the "old style arguments"?

Yes, let's remove the old style arguments to simplify

…guments.

It must now be invoked with long --model, --prompt etc. arguments only.
Shortens the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants