Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed LORA fails with index list out of range error when we pass prompts and adapters less than the full batch size #251

Closed
quic-jouachen opened this issue Jan 29, 2025 · 1 comment

Comments

@quic-jouachen
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Command Used to run / script used
  • Install apps and platform SDK on the setup
  • Clone the qefficient repo using - git clone --branch release/v1.19 https://github.com/quic/efficient-transformers
  • Create a virtual environment and install all the dependencies using pip install -e .
  • Install the transformers and qefficient libraries and execute the command mentioned below. 

Command used:

base_model_name = "mistralai/Mistral-7B-v0.1" 
seq_len = 128 
ctx_len = 256 
full_batch_size = 2 
device_group = [0] 
 
qeff_model = QEffAutoPeftModelForCausalLM.from_pretrained( 
    "predibase/gsm8k", "gsm8k", continuous_batching=True, finite_adapters=True 
) 
 
qeff_model.load_adapter("predibase/tldr_content_gen", "tldr_content_gen") 
qeff_model.load_adapter("predibase/dbpedia", "dbpedia") 
 
#export & compile model 
qpc_path = qeff_model.compile( 
    batch_size=1, 
    full_batch_size=full_batch_size, 
    prefill_seq_len=seq_len, 
    ctx_len=ctx_len, 
    num_devices=len(device_group), 
    num_cores=16, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True, 
) 
 
#run inference on the generate function 
prompts = [ 
    """Please answer the following question: James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?\n\nAnswer:""", 
] 
 
qeff_model.generate( 
    tokenizer=load_hf_tokenizer(pretrained_model_name_or_path=base_model_name), 
    prompts=prompts, 
    device_id=device_group, 
    prompt_to_adapter_mapping=[ 
        "gsm8k", 
    ] 
) 
  1. Error details

Error log:

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.55s/it] 
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 69759.73it/s] 
WARNING - QEfficient - Number of prompts are less than batch size/full batch size, repeating to required batch size 
WARNING - QEfficient - Streamer is currently unavailable for continuous batch execution. 
--------------------------------------------------------------------------- 
IndexError                                Traceback (most recent call last) 
Cell In[14], <a href='vscode-notebook-cell:?execution_count=14&line=32'>line 32</a> 
     <a href='vscode-notebook-cell:?execution_count=14&line=27'>27</a> # run inference on the generate function 
     <a href='vscode-notebook-cell:?execution_count=14&line=28'>28</a> prompts = [ 
     <a href='vscode-notebook-cell:?execution_count=14&line=29'>29</a>     """Please answer the following question: James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?\n\nAnswer:""", 
     <a href='vscode-notebook-cell:?execution_count=14&line=30'>30</a> ] 
---> <a href='vscode-notebook-cell:?execution_count=14&line=32'>32</a> qeff_model.generate( 
     <a href='vscode-notebook-cell:?execution_count=14&line=33'>33</a>     tokenizer=load_hf_tokenizer(pretrained_model_name_or_path=base_model_name), 
     <a href='vscode-notebook-cell:?execution_count=14&line=34'>34</a>     prompts=prompts, 
     <a href='vscode-notebook-cell:?execution_count=14&line=35'>35</a>     device_id=device_group, 
     <a href='vscode-notebook-cell:?execution_count=14&line=36'>36</a>     prompt_to_adapter_mapping=[ 
     <a href='vscode-notebook-cell:?execution_count=14&line=37'>37</a>         "gsm8k", 
     <a href='vscode-notebook-cell:?execution_count=14&line=38'>38</a>     ] 
     <a href='vscode-notebook-cell:?execution_count=14&line=39'>39</a> ) 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:378, in QEffAutoLoraModelForCausalLM.generate(self, tokenizer, prompts, prompt_to_adapter_mapping, device_id, runtime, **kwargs) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:373'>373</a> if len(prompt_to_adapter_mapping) != len(prompts): 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:374'>374</a>     raise RuntimeError( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:375'>375</a>         f"Number of prompts should match number of prompt_to_adapter_mapping, got len(prompts) = {len(prompts)}, len(prompt_to_adapter_mapping) = {len(prompt_to_adapter_mapping)}" 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:376'>376</a>     ) 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:378'>378</a> return QEfficient.cloud_ai_100_exec_kv( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:379'>379</a>     tokenizer, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:380'>380</a>     self.qpc_path, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:381'>381</a>     prompt=prompts, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:382'>382</a>     device_id=device_id, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:383'>383</a>     generation_len=generation_len, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:384'>384</a>     prompt_to_lora_id_mapping=[ 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:385'>385</a>         self.active_adapter_to_id[name] if name != "base" else 0 for name in prompt_to_adapter_mapping 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:386'>386</a>     ], 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:387'>387</a> ) 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:344, in cloud_ai_100_exec_kv(tokenizer, qpc_path, prompt, prompts_txt_file_path, device_id, generation_len, enable_debug_logs, stream, write_io_dir, automation, prompt_to_lora_id_mapping, is_tlm) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:337'>337</a>     exec_info = CloudAI100ExecInfo( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:338'>338</a>         batch_size=batch_size, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:339'>339</a>         generated_texts=generated_texts, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:340'>340</a>         generated_ids=generated_ids, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:341'>341</a>         perf_metrics=PerfMetrics(prefill_time, decode_perf, total_perf, total_time), 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:342'>342</a>     ) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:343'>343</a> else: 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:344'>344</a>     exec_info = generate_text.generate( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:345'>345</a>         prompt=prompt, generation_len=generation_len, prompt_to_lora_id_mapping=prompt_to_lora_id_mapping 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:346'>346</a>     ) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:348'>348</a> print_latency_stats_kv(prompt, exec_info=exec_info, automation=automation) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:349'>349</a> return exec_info 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1050, in TextGeneration.generate(self, prompt, generation_len, stream, prompt_to_lora_id_mapping) 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1048'>1048</a> if self._full_batch_size is not None: 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1049'>1049</a>     logger.warning("Streamer is currently unavailable for continuous batch execution.") 
-> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1050'>1050</a>     perf_metrics, generated_texts = self._continuous_batching_execution( 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1051'>1051</a>         prompt, generation_len, prompt_to_lora_id_mapping 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1052'>1052</a>     ) 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1053'>1053</a> else: 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1054'>1054</a>     if stream: 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:967, in TextGeneration._continuous_batching_execution(self, prompt, generation_len, prompt_to_lora_id_mapping) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:964'>964</a> self._qaic_model.run_prefill_for_all_inputs(self._prompt_queue, generation_len) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:966'>966</a> loop_start = perf_counter()  # Start decode loop timer 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:967'>967</a> decode_pause_time = self._qaic_model.run_continuous_batching_decode(self._prompt_queue, generation_len) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:968'>968</a> end = perf_counter() 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:970'>970</a> generated_texts = self._tokenizer.batch_decode(self._qaic_model.generated_ids, skip_special_tokens=True) 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:721, in QEffTextGenerationBase.run_continuous_batching_decode(self, prompt_queue, generation_len) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:719'>719</a> decode_pause_time = 0 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:720'>720</a> # Prepare decode inputs inputs. 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:721'>721</a> decode_inputs = self.prepare_decode_inputs() 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:723'>723</a> while prompt_queue or current_decode_ongoing.any(): 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:724'>724</a>     outputs = self._session.run(decode_inputs) 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:529, in QEffTextGenerationBase.prepare_decode_inputs(self) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:527'>527</a> if self._prompt_to_lora_id_mapping_decode: 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:528'>528</a>     if self.full_batch_size: 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:529'>529</a>         first_batch_lora_ids = [self._prompt_to_lora_id_mapping_decode[i] for i in range(self.full_batch_size)] 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:530'>530</a>         decode_inputs["lora_ids"] = np.array(first_batch_lora_ids, dtype=np.int64).reshape( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:531'>531</a>             self.full_batch_size, 1 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:532'>532</a>         ) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:533'>533</a>     else: 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:529, in <listcomp>(.0) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:527'>527</a> if self._prompt_to_lora_id_mapping_decode: 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:528'>528</a>     if self.full_batch_size: 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:529'>529</a>         first_batch_lora_ids = [self._prompt_to_lora_id_mapping_decode[i] for i in range(self.full_batch_size)] 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:530'>530</a>         decode_inputs["lora_ids"] = np.array(first_batch_lora_ids, dtype=np.int64).reshape( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:531'>531</a>             self.full_batch_size, 1 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:532'>532</a>         ) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:533'>533</a>     else: 
 
IndexError: list index out of range 

Expected behavior
When we pass prompts and adapters less than the full batch size, the fixed lora inference should either works like base model execution (with prompts duplicated to full batch size) or error out

Screenshots
N/A

Environment (please complete the following information):

  • Branch: release/v1.19

Additional context
N/A

quic-rishinr pushed a commit that referenced this issue Jan 31, 2025
…v1.19 (#250)

Same as
[PR#242](#242)

This is regarding the issue reported in
[issue#251](#251)

The finite lorax feature failed to execute when the number of prompts
provided is less than the full batch size. The solution involves
applying the same adjustment strategy for `prompt_to_lora_id_mapping` as
used for `prompt` in the `fix_prompts()` function located in
`QEfficient/generation/text_generation_inference.py`.

---------

Signed-off-by: Jou-An Chen <[email protected]>
Signed-off-by: Onkar Chougule <[email protected]>
Co-authored-by: Onkar Chougule <[email protected]>
quic-rishinr pushed a commit that referenced this issue Feb 19, 2025
This is regarding the issue reported in
[issue#251](#251)

The finite lorax feature failed to execute when the number of prompts
provided is less than the full batch size. The solution involves
applying the same adjustment strategy for `prompt_to_lora_id_mapping` as
used for `prompt` in the `fix_prompts()` function located in
`QEfficient/generation/text_generation_inference.py`.

Signed-off-by: Jou-An Chen <[email protected]>
@quic-rishinr
Copy link
Contributor

Fix merged in mainline.

quic-hemagnih pushed a commit to quic-hemagnih/efficient-transformers that referenced this issue Mar 12, 2025
This is regarding the issue reported in
[issue#251](quic#251)

The finite lorax feature failed to execute when the number of prompts
provided is less than the full batch size. The solution involves
applying the same adjustment strategy for `prompt_to_lora_id_mapping` as
used for `prompt` in the `fix_prompts()` function located in
`QEfficient/generation/text_generation_inference.py`.

Signed-off-by: Jou-An Chen <[email protected]>
Signed-off-by: Hem Agnihotri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants