v0.1.7
What's Changed
- Build older cuda wheels by @casper-hansen in #158
- Exclude download of CUDA wheels by @casper-hansen in #159
- New benchmarks in README by @casper-hansen in #160
- Fix typo in benchmark command by @casper-hansen in #161
- Yi support by @casper-hansen in #167
- Make sure to delete dummy model by @casper-hansen in #180
- Fix CUDA error: invalid argument by @casper-hansen in #179
- New logic for passing past_key_value by @younesbelkada in #177
- Reset cache on new generation by @casper-hansen in #178
- Adaptive batch sizing by @casper-hansen in #181
- Pass arguments to AutoConfig by @s4rduk4r in #97
- Fix cache util logic by @casper-hansen in #186
- Fix multi-GPU loading and inference by @casper-hansen in #190
- [
core
] ReplaceQuantLlamaMLP
withQuantFusedMLP
by @younesbelkada in #188 - [
core
] Addis_hf_transformers
flag by @younesbelkada in #195 - Fixed multi-GPU quantization by @casper-hansen in #196
Full Changelog: v0.1.6...v0.1.7