Open
Description
- Fp8 kv-cache
- Kv-cache prefix reuse
- Grammar constrained speedup
-
torch.compile
like speedups - Simple one-liner
pip install
- Multi lora support (lorax kind of)
- Marlin quantization
- Exl2 quants (non gptq 2, 3, 4.5 bpw)
- Adding more documentation/help regarding getting production grafana dashboards easily setup.