-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to 2.3.0 #225
base: habana-main
Are you sure you want to change the base?
Upgrade to 2.3.0 #225
Conversation
update commit
* feat: support response_format in chat * fix: adjust typos * fix: add trufflehog lint
* fix(layers): fix SuRotaryEmbedding * change arange * remove logs
* Use minijinja's pycompat mode for python methods * fix: cargo fmt lint for pre commit --------- Co-authored-by: Armin Ronacher <[email protected]>
* feat: add kserve feature and basic routes * feat: implement infer endpoint wrapper around generate * fix: refactor and improve types * fix: improve infer and simplify * fix: cleanup and improve api docs * fix: refactor and encapsulate kserve feat in file * fix: remove typos after rebase
Add support for GPTQ Marlin kernels GPTQ Marlin extends the Marlin kernels to support common GPTQ configurations: - bits: 4 or 8 - groupsize: -1, 32, 64, or 128 - desc_act: true/false Using the GPTQ Marlin kernels requires repacking the parameters in the Marlin quantizer format. The kernels were contributed by Neural Magic to VLLM. We vendor them here for convenience.
* Update the link for qwen2 * Fix Qwen2 model URL in model table * Fix too eager staging --------- Co-authored-by: Daniël de Kok <[email protected]>
* doc: adding architecture document * doc: add architecture to toctree * fix: avoid cargo lock changes * fix: avoid cargo lock tweak --------- Co-authored-by: drbh <[email protected]>
When a batch contained images if different sizes during prefill, the server would fail (see e.g. huggingface#2056). Images were processed separately and then concatenated. However, this can fail for images with different sizes. Fix this by preprocessing all images in the batch together, so that the image processor can ensure that all image tensors have compatible sizes.
* Contributing guide & Code of Conduct * Redirect to GitHub's tutorial on PRs
* Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass
For Phi-3-Small I need to shard a packed QKV bias tensor, for which I implemented the `Weights.get_packed_sharded` method. However, this method can also replace the `Weights._get_qweight` method and the custom sharding code from `Weights.get_weights_col_packed`.
The subcommand did not work due to some broken imports.
* New runner. Manual squash. * Network host. * Put back trufflehog with proper extension. * No network host ? * Moving buildx install after tailscale ? * 1.79
* Fix cargo-chef prepare In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly. If Cargo.lock is not present, cargo-chef will generate a new one first, which might vary a lot and invalidate docker build caches. * Fix Dockerfile_amd and Dockerfile_intel
* Support HF_TOKEN environement variable * Load test. --------- Co-authored-by: Nicolas Patry <[email protected]>
* Adding Service Name Environment variable for huggingface#2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option
* corrected Pydantic warning. * Update clients/python/text_generation/types.py Co-authored-by: Daniël de Kok <[email protected]> --------- Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Daniël de Kok <[email protected]>
* use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by: Wang, Yi A <[email protected]> * Update server/text_generation_server/utils/import_utils.py Co-authored-by: Daniël de Kok <[email protected]> --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Daniël de Kok <[email protected]>
…le with standard openai api (huggingface#2089) Co-authored-by: sunxichen <[email protected]>
* add CPU tgi support Signed-off-by: Wang, Yi A <[email protected]> * ipex distributed ops support Signed-off-by: Wang, Yi A <[email protected]> --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Funtowicz Morgan <[email protected]>
* feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes
* Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN
Signed-off-by: Wang, Yi A <[email protected]>
* Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI
* Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?
* Attempting to discard the trufflehog warning. * Attempt to fix trufflehog.
* Add nix test. * Modifying yourself means you need to rerun. * Fixing the test + adding click (needed for pre-commit hooks). * Try thuis. * Our runner + pure test (not written) * Reemove server. * Root user. * Different user ? * Add the actual test target. * Forgot this modification. * Add a formatter. * Add the secrets. * Fixed the auth token ? * Adding the other tests. * Missing pre-commit. * Test requires cargo for cargo fmt. * Update it a bit. * Up. * Attempting to use a cache location for the models. * Ignore the cache for now.
huggingface#2510) fix: pass missing revision arg for lora adapter when loading multiple adapters
enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <[email protected]>
* use ratatui not archived tui * bump ratatui all the way with options
Disable by default because CI runners do not have enough GPUs.
* Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.
Runs the tests in a Nix build sandbox.
…ce#2511) * Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner
* Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow
* Update to moe-kenels 0.3.1 * Attempt to fix apt failure
…e#2532) Signed-off-by: Wang, Yi A <[email protected]>
…ce#2539) * fix: wrap python basic logs in debug assertion in launcher * use level filters instead
* Preparing for release. * Upgrade version in docs.
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
@yuanwu2017 , for the testing, pls add the torch.compile model to your test cases. |
# CHUNK_SIZES = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048] | ||
# LAZY_MODE = int(os.environ.get('PT_HPU_LAZY_MODE', 1)) | ||
MAX_TOTAL_TOKENS = int(os.environ.get('MAX_TOTAL_TOKENS', 8192)) | ||
MAX_BATCH_TOTAL_TOKENS = int(os.environ.get('MAX_BATCH_TOTAL_TOKENS', 65536)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuanwu2017 looks like you are not reading param --max-batch-total-tokens
coming from command line, instead always setting it to MAX_BATCH_TOTAL_TOKENS=65536
, its causing out of memory at "Decode warmup with bigger batch_size"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is only a draft. I am running the performance benchmark and fix some issues I found.
batch = self.generate_warmup_batch(request, PREFILL_WARMUP_SEQLEN_LIST[0], DECODE_WARMUP_BATCH_SIZE_LIST[-1], is_warmup) | ||
_, prefill_batch, _ = self.generate_token([batch], is_warmup) | ||
batches.append(prefill_batch) | ||
while batch_size <= max_decode_batch_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warming up multiple decode batches causing out of memory. for example below command crashes in this loop with out of memory which used to work in last release.
docker run -it --rm -p 8056:80 \
--runtime=habana \
-v /data2/models:/data \
-v /data2/models:/data/hub \
-v /data2/models:/root/.cache/huggingface/hub \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e HABANA_VISIBLE_DEVICES=1 \
-e MAX_TOTAL_TOKENS=2048 \
-e PREFILL_BATCH_BUCKET_SIZE=2 \
-e BATCH_BUCKET_SIZE=32 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true \
--entrypoint bash \
--cap-add=sys_nice --ipc=host <PR_225_image>
--model-id meta-llama/Llama-2-7b-chat-hf \
--max-input-length 1024 \
--max-total-tokens 2048 \
--max-batch-prefill-tokens 2048 \
--max-batch-total-tokens 65536 \
--max-waiting-tokens 7 \
--waiting-served-ratio 1.2 \
--max-concurrent-requests 64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will fix it.
|
refine the warmup Signed-off-by: yuanwu <[email protected]>
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.