Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to 2.3.0 #225

Draft
wants to merge 293 commits into
base: habana-main
Choose a base branch
from
Draft

Upgrade to 2.3.0 #225

wants to merge 293 commits into from

Conversation

yuanwu2017
Copy link

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

McPatate and others added 30 commits September 24, 2024 03:42
* feat: support response_format in chat

* fix: adjust typos

* fix: add trufflehog lint
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
* Use minijinja's pycompat mode for python methods

* fix: cargo fmt lint for pre commit

---------

Co-authored-by: Armin Ronacher <[email protected]>
* feat: add kserve feature and basic routes

* feat: implement infer endpoint wrapper around generate

* fix: refactor and improve types

* fix: improve infer and simplify

* fix: cleanup and improve api docs

* fix: refactor and encapsulate kserve feat in file

* fix: remove typos after rebase
Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------

Co-authored-by: Daniël de Kok <[email protected]>
* doc: adding architecture document

* doc: add architecture to toctree

* fix: avoid cargo lock changes

* fix: avoid cargo lock tweak

---------

Co-authored-by: drbh <[email protected]>
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. huggingface#2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
* Contributing guide & Code of Conduct

* Redirect to GitHub's tutorial on PRs
* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass
For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.
The subcommand did not work due to some broken imports.
* New runner. Manual squash.

* Network host.

* Put back trufflehog with proper extension.

* No network host ?

* Moving buildx install after tailscale ?

* 1.79
* Fix cargo-chef prepare

In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly.
If Cargo.lock is not present, cargo-chef will generate a new one first, which
might vary a lot and invalidate docker build caches.

* Fix Dockerfile_amd and Dockerfile_intel
* Support HF_TOKEN environement variable

* Load test.

---------

Co-authored-by: Nicolas Patry <[email protected]>
* Adding Service Name Environment variable for huggingface#2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option
* corrected Pydantic warning.

* Update clients/python/text_generation/types.py

Co-authored-by: Daniël de Kok <[email protected]>

---------

Co-authored-by: Nicolas Patry <[email protected]>
Co-authored-by: Daniël de Kok <[email protected]>
* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES

Signed-off-by: Wang, Yi A <[email protected]>

* Update server/text_generation_server/utils/import_utils.py

Co-authored-by: Daniël de Kok <[email protected]>

---------

Signed-off-by: Wang, Yi A <[email protected]>
Co-authored-by: Daniël de Kok <[email protected]>
* add CPU tgi support

Signed-off-by: Wang, Yi A <[email protected]>

* ipex distributed ops support

Signed-off-by: Wang, Yi A <[email protected]>

---------

Signed-off-by: Wang, Yi A <[email protected]>
Co-authored-by: Funtowicz Morgan <[email protected]>
* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes
* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN
* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI
Narsil and others added 20 commits September 25, 2024 06:15
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.
)

Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.
* Add nix test.

* Modifying yourself means you need to rerun.

* Fixing the test + adding click (needed for pre-commit hooks).

* Try thuis.

* Our runner + pure test (not written)

* Reemove server.

* Root user.

* Different user ?

* Add the actual test target.

* Forgot this modification.

* Add a formatter.

* Add the secrets.

* Fixed the auth token ?

* Adding the other tests.

* Missing pre-commit.

* Test requires cargo for cargo fmt.

* Update it a bit.

* Up.

* Attempting to use a cache location for the models.

* Ignore the cache for now.
huggingface#2510)

fix: pass missing revision arg for lora adapter when loading multiple adapters
enable intel ipex cpu and xpu in python3.11

Signed-off-by: Wang, Yi A <[email protected]>
* use ratatui not archived tui

* bump ratatui all the way with options
Disable by default because CI runners do not have enough GPUs.
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
Runs the tests in a Nix build sandbox.
…ce#2511)

* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner
* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow
* Update to moe-kenels 0.3.1

* Attempt to fix apt failure
…ce#2539)

* fix: wrap python basic logs in debug assertion in launcher

* use level filters instead
* Preparing for release.

* Upgrade version in docs.
@yuanwu2017 yuanwu2017 marked this pull request as draft September 26, 2024 06:34
@mandy-li
Copy link
Collaborator

mandy-li commented Oct 1, 2024

@yuanwu2017 , for the testing, pls add the torch.compile model to your test cases.

# CHUNK_SIZES = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]
# LAZY_MODE = int(os.environ.get('PT_HPU_LAZY_MODE', 1))
MAX_TOTAL_TOKENS = int(os.environ.get('MAX_TOTAL_TOKENS', 8192))
MAX_BATCH_TOTAL_TOKENS = int(os.environ.get('MAX_BATCH_TOTAL_TOKENS', 65536))
Copy link
Collaborator

@tthakkal tthakkal Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanwu2017 looks like you are not reading param --max-batch-total-tokens coming from command line, instead always setting it to MAX_BATCH_TOTAL_TOKENS=65536 , its causing out of memory at "Decode warmup with bigger batch_size"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only a draft. I am running the performance benchmark and fix some issues I found.

batch = self.generate_warmup_batch(request, PREFILL_WARMUP_SEQLEN_LIST[0], DECODE_WARMUP_BATCH_SIZE_LIST[-1], is_warmup)
_, prefill_batch, _ = self.generate_token([batch], is_warmup)
batches.append(prefill_batch)
while batch_size <= max_decode_batch_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warming up multiple decode batches causing out of memory. for example below command crashes in this loop with out of memory which used to work in last release.

docker run -it --rm -p 8056:80 \
    --runtime=habana \
        -v /data2/models:/data \
    -v /data2/models:/data/hub \
    -v /data2/models:/root/.cache/huggingface/hub \
    -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e HABANA_VISIBLE_DEVICES=1 \
   -e MAX_TOTAL_TOKENS=2048 \
   -e PREFILL_BATCH_BUCKET_SIZE=2 \
   -e BATCH_BUCKET_SIZE=32 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
    -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
    -e ENABLE_HPU_GRAPH=true \
    -e LIMIT_HPU_GRAPH=true \
    -e USE_FLASH_ATTENTION=true \
    -e FLASH_ATTENTION_RECOMPUTE=true \
    --entrypoint bash \
    --cap-add=sys_nice --ipc=host <PR_225_image>
   --model-id meta-llama/Llama-2-7b-chat-hf \
   --max-input-length 1024 \
   --max-total-tokens 2048 \
   --max-batch-prefill-tokens 2048 \
   --max-batch-total-tokens 65536 \
   --max-waiting-tokens 7 \
   --waiting-served-ratio 1.2 \
   --max-concurrent-requests 64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will fix it.

@yuanwu2017
Copy link
Author

@yuanwu2017 , for the testing, pls add the torch.compile model to your test cases.
Ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.