Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Llama-3.2-Vision #163

Merged
merged 1 commit into from
Dec 30, 2024
Merged

Fix Llama-3.2-Vision #163

merged 1 commit into from
Dec 30, 2024

Conversation

Blaizzy
Copy link
Owner

@Blaizzy Blaizzy commented Dec 30, 2024

This PR introduces language only support significant improvements in generation speed and memory usage for Llama-3.2-11B-Vision-Instruct-4bit on MLX-VLM. Below are the key metrics comparing before and after performance, along with their percentage changes:

Metric Before After % Diff
Prompt Tokens/sec 2.807 2.825 +0.64%
Generation Tokens/sec 0.362 6.692 +1749%
Peak Memory (GB) 65.453 16.252 -75.2%

Key Improvements

  1. Generation Speed: We achieved nearly an 18x increase in tokens generated per second.
  2. Memory Efficiency: Reduced peak memory usage by approximately 75%.
  3. Slightly Faster Prompt Handling: Prompt ingestion speed improved by about 0.64%.

These optimizations should enable more efficient inference and more stable performance overall.

Please review the changes for correctness and let me know if you have any questions or concerns!

llama-3.2-vision performance improvement

Language Only vs VLM

llama-3.2-vision performance improvement

Closes #100

@Blaizzy Blaizzy changed the base branch from main to pc/refactor-utils-1 December 30, 2024 01:05
@Blaizzy Blaizzy merged commit 8740c0a into pc/refactor-utils-1 Dec 30, 2024
1 check passed
@Blaizzy Blaizzy deleted the pc/fix-mllama branch December 30, 2024 01:06
Blaizzy added a commit that referenced this pull request Dec 30, 2024
* remove unused

* add default layer_norm

* remove unused

* remove llava_bunny and idefics2 custom configs

* refactor molmo and qwen2 config

* add deprecation warning

* refactor update model configs

* refactor sanitize weights

* refactor class_predicate

* move custom config logic to from_dict

* uncomment

* fix config name

* rename aligner to projector

* fix tests

* remove module from update list

* add trusted remote as kwargs

* update baseImageProcessor

* refactor image processor

* pin latest transformers

* bump version

* refactor prepare inputs

* simplifiy image loading

* fix load_image and refactor load_config

* make skip_non_divisible a default

* skip non divisible default and rename model inputs

* refactor condition

* fix language input only

* add fetch KV

* Increase default max tokens to 256

* refactor generate, generate step and stream

* fix high usage and add language only support (#163)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Very High Memory Usage for Llama-3.2-11B-Vision-Instruct-4bit
1 participant