Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : greatly reduce output buffer memory usage #6122

Merged
merged 26 commits into from Mar 26, 2024

Commits on Mar 17, 2024

  1. Configuration menu
    Copy the full SHA
    1fd1918 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    98914c0 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    705d393 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    25981fc View commit details
    Browse the repository at this point in the history
  5. perplexity : fix Winogrande, use correct logits for second choice start

    The first logits used to evaluate the second choice were not from
    the end of the common prefix; instead, they were the logits from the end
    of the first choice. This has been corrected.
    
    The previous implementation sometimes had outliers in the scores of
    choices for some tasks, and the logic to skip choices words
    in the log-likelihood evaluation probably was an attempt to reduce those,
    but it was complex and didn't quite seem to be the right thing.
    
    This is simpler now, and the outlier scores aren't there anymore.
    compilade committed Mar 17, 2024
    Configuration menu
    Copy the full SHA
    17b45c9 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    d0129e8 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    487f89e View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    408fcb0 View commit details
    Browse the repository at this point in the history
  9. llama : fix wrong n_outputs in llama_set_inputs

    A mismatch happened when using a smaller n_ubatch than n_batch and then using
    llama_batch_get_one(). The decision of what n_outputs should be now almost
    fully depends on how lctx.n_outputs is set in llama_decode_internal.
    The conditions are simpler this way.
    
    * llama : when saving the state, recalculate n_outputs
    
    This ensures the correct number of outputs for the entire previous batch
    is stored in the session file, even when n_ubatch is smaller than n_batch.
    compilade committed Mar 17, 2024
    Configuration menu
    Copy the full SHA
    e19cb3a View commit details
    Browse the repository at this point in the history

Commits on Mar 18, 2024

  1. Configuration menu
    Copy the full SHA
    a57fa7f View commit details
    Browse the repository at this point in the history
  2. llama : fix running a batch with n_outputs == 0

    It previously worked because lctx.inp_out_ids was not initialized,
    so it pointed to some garbage address which was somehow still valid when I
    ran my tests.
    compilade committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    711b0bc View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    d100502 View commit details
    Browse the repository at this point in the history
  4. ggml : saner ggml_can_repeat with empty tensors

    *  ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1
    compilade committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    99c37cc View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    6bf7f3f View commit details
    Browse the repository at this point in the history

Commits on Mar 19, 2024

  1. Configuration menu
    Copy the full SHA
    09bb15a View commit details
    Browse the repository at this point in the history
  2. llama : use a vector for ctx->output_ids

    * llama : rework reallocation logic for llama_output_reserve
    
    Now comparing the actual size with the new total size of the output buffer
    to allow more efficient enabling and disabling of the embeddings
    and/or logits output in the future.
    compilade committed Mar 19, 2024
    Configuration menu
    Copy the full SHA
    4551e7e View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8b826c5 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    d04cfaf View commit details
    Browse the repository at this point in the history
  5. perplexity : make Winogrande work as it does on master

    The problems with the Winogrande implementation will
    need to be fixed in a separate PR to ease review.
    compilade committed Mar 19, 2024
    Configuration menu
    Copy the full SHA
    8f70dcb View commit details
    Browse the repository at this point in the history
  6. llama : clearer error messages for invalid logits or embeddings ids

    * llama : assert all models that can have inp_out_ids
    
    Since the graph topology is now constant, this presence check
    can be done even when there are no outputs.
    
    * llama : assert logits and embd buffers exist before writing to them
    compilade committed Mar 19, 2024
    Configuration menu
    Copy the full SHA
    615a3a4 View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2024

  1. Configuration menu
    Copy the full SHA
    7d8d6b5 View commit details
    Browse the repository at this point in the history
  2. perplexity : make hellaswag and multiple-choice outputs identical to …

    …master
    
    Due to how the KV cache is updated, the logprobs for tokens in a batch
    are very slightly affected by the other tokens present in the batch,
    so to make hellaswag and multiple-choice return exactly the same results
    as on master, the last token of each sequence needs to be evaluated
    even though its output is not used at all.
    
    This will probably be changed back in the future to make these benchmarks
    a tiny bit faster.
    
    * perplexity : fix division by zero when using less than 100 multiple-choice tasks
    compilade committed Mar 21, 2024
    Configuration menu
    Copy the full SHA
    5f33a67 View commit details
    Browse the repository at this point in the history

Commits on Mar 25, 2024

  1. Merge branch 'master' into compilade/smaller-output-buffer

    Notably includes the new repetition penalty default, support for grok-1,
    and support for split GGUF.
    compilade committed Mar 25, 2024
    Configuration menu
    Copy the full SHA
    ffa9abd View commit details
    Browse the repository at this point in the history

Commits on Mar 26, 2024

  1. llama : allow loading state saved with a different ctx size

    When loading a session file, the context size is now only required to be
    at least enough to load the KV cells contained in that session file,
    instead of requiring to use exactly the same context size as when saving.
    
    Doing this enables the use-case of extending or shrinking the context size
    of a saved session.
    
    This breaks existing session files because the meaning of kv_buf_size
    is slightly changed (previously it was the size of the whole KV cache,
    now it's only the size of the saved part of it). This allows for
    finer-grained sanity checks when loading in an effort to keep kv_buf_size
    useful even when the kv_size is changed.
    compilade committed Mar 26, 2024
    Configuration menu
    Copy the full SHA
    e9095ac View commit details
    Browse the repository at this point in the history
  2. llama : minor

    ggml-ci
    ggerganov committed Mar 26, 2024
    Configuration menu
    Copy the full SHA
    5027d81 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    20248e8 View commit details
    Browse the repository at this point in the history