Fixed LORA fails with index list out of range error when we pass prompts and adapters less than the full batch size #251

quic-jouachen · 2025-01-29T03:49:34Z

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Command Used to run / script used

Install apps and platform SDK on the setup
Clone the qefficient repo using - git clone --branch release/v1.19 https://github.com/quic/efficient-transformers
Create a virtual environment and install all the dependencies using pip install -e .
Install the transformers and qefficient libraries and execute the command mentioned below.

Command used:

base_model_name = "mistralai/Mistral-7B-v0.1" 
seq_len = 128 
ctx_len = 256 
full_batch_size = 2 
device_group = [0] 
 
qeff_model = QEffAutoPeftModelForCausalLM.from_pretrained( 
    "predibase/gsm8k", "gsm8k", continuous_batching=True, finite_adapters=True 
) 
 
qeff_model.load_adapter("predibase/tldr_content_gen", "tldr_content_gen") 
qeff_model.load_adapter("predibase/dbpedia", "dbpedia") 
 
#export & compile model 
qpc_path = qeff_model.compile( 
    batch_size=1, 
    full_batch_size=full_batch_size, 
    prefill_seq_len=seq_len, 
    ctx_len=ctx_len, 
    num_devices=len(device_group), 
    num_cores=16, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True, 
) 
 
#run inference on the generate function 
prompts = [ 
    """Please answer the following question: James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?\n\nAnswer:""", 
] 
 
qeff_model.generate( 
    tokenizer=load_hf_tokenizer(pretrained_model_name_or_path=base_model_name), 
    prompts=prompts, 
    device_id=device_group, 
    prompt_to_adapter_mapping=[ 
        "gsm8k", 
    ] 
)

Error details

Error log:

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.55s/it] 
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 69759.73it/s] 
WARNING - QEfficient - Number of prompts are less than batch size/full batch size, repeating to required batch size 
WARNING - QEfficient - Streamer is currently unavailable for continuous batch execution. 
--------------------------------------------------------------------------- 
IndexError                                Traceback (most recent call last) 
Cell In[14], <a href='vscode-notebook-cell:?execution_count=14&line=32'>line 32</a> 
     <a href='vscode-notebook-cell:?execution_count=14&line=27'>27</a> # run inference on the generate function 
     <a href='vscode-notebook-cell:?execution_count=14&line=28'>28</a> prompts = [ 
     <a href='vscode-notebook-cell:?execution_count=14&line=29'>29</a>     """Please answer the following question: James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?\n\nAnswer:""", 
     <a href='vscode-notebook-cell:?execution_count=14&line=30'>30</a> ] 
---> <a href='vscode-notebook-cell:?execution_count=14&line=32'>32</a> qeff_model.generate( 
     <a href='vscode-notebook-cell:?execution_count=14&line=33'>33</a>     tokenizer=load_hf_tokenizer(pretrained_model_name_or_path=base_model_name), 
     <a href='vscode-notebook-cell:?execution_count=14&line=34'>34</a>     prompts=prompts, 
     <a href='vscode-notebook-cell:?execution_count=14&line=35'>35</a>     device_id=device_group, 
     <a href='vscode-notebook-cell:?execution_count=14&line=36'>36</a>     prompt_to_adapter_mapping=[ 
     <a href='vscode-notebook-cell:?execution_count=14&line=37'>37</a>         "gsm8k", 
     <a href='vscode-notebook-cell:?execution_count=14&line=38'>38</a>     ] 
     <a href='vscode-notebook-cell:?execution_count=14&line=39'>39</a> ) 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:378, in QEffAutoLoraModelForCausalLM.generate(self, tokenizer, prompts, prompt_to_adapter_mapping, device_id, runtime, **kwargs) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:373'>373</a> if len(prompt_to_adapter_mapping) != len(prompts): 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:374'>374</a>     raise RuntimeError( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:375'>375</a>         f"Number of prompts should match number of prompt_to_adapter_mapping, got len(prompts) = {len(prompts)}, len(prompt_to_adapter_mapping) = {len(prompt_to_adapter_mapping)}" 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:376'>376</a>     ) 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:378'>378</a> return QEfficient.cloud_ai_100_exec_kv( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:379'>379</a>     tokenizer, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:380'>380</a>     self.qpc_path, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:381'>381</a>     prompt=prompts, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:382'>382</a>     device_id=device_id, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:383'>383</a>     generation_len=generation_len, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:384'>384</a>     prompt_to_lora_id_mapping=[ 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:385'>385</a>         self.active_adapter_to_id[name] if name != "base" else 0 for name in prompt_to_adapter_mapping 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:386'>386</a>     ], 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/peft/lora/auto.py:387'>387</a> ) 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:344, in cloud_ai_100_exec_kv(tokenizer, qpc_path, prompt, prompts_txt_file_path, device_id, generation_len, enable_debug_logs, stream, write_io_dir, automation, prompt_to_lora_id_mapping, is_tlm) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:337'>337</a>     exec_info = CloudAI100ExecInfo( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:338'>338</a>         batch_size=batch_size, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:339'>339</a>         generated_texts=generated_texts, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:340'>340</a>         generated_ids=generated_ids, 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:341'>341</a>         perf_metrics=PerfMetrics(prefill_time, decode_perf, total_perf, total_time), 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:342'>342</a>     ) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:343'>343</a> else: 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:344'>344</a>     exec_info = generate_text.generate( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:345'>345</a>         prompt=prompt, generation_len=generation_len, prompt_to_lora_id_mapping=prompt_to_lora_id_mapping 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:346'>346</a>     ) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:348'>348</a> print_latency_stats_kv(prompt, exec_info=exec_info, automation=automation) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:349'>349</a> return exec_info 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1050, in TextGeneration.generate(self, prompt, generation_len, stream, prompt_to_lora_id_mapping) 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1048'>1048</a> if self._full_batch_size is not None: 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1049'>1049</a>     logger.warning("Streamer is currently unavailable for continuous batch execution.") 
-> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1050'>1050</a>     perf_metrics, generated_texts = self._continuous_batching_execution( 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1051'>1051</a>         prompt, generation_len, prompt_to_lora_id_mapping 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1052'>1052</a>     ) 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1053'>1053</a> else: 
   <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:1054'>1054</a>     if stream: 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:967, in TextGeneration._continuous_batching_execution(self, prompt, generation_len, prompt_to_lora_id_mapping) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:964'>964</a> self._qaic_model.run_prefill_for_all_inputs(self._prompt_queue, generation_len) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:966'>966</a> loop_start = perf_counter()  # Start decode loop timer 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:967'>967</a> decode_pause_time = self._qaic_model.run_continuous_batching_decode(self._prompt_queue, generation_len) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:968'>968</a> end = perf_counter() 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:970'>970</a> generated_texts = self._tokenizer.batch_decode(self._qaic_model.generated_ids, skip_special_tokens=True) 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:721, in QEffTextGenerationBase.run_continuous_batching_decode(self, prompt_queue, generation_len) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:719'>719</a> decode_pause_time = 0 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:720'>720</a> # Prepare decode inputs inputs. 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:721'>721</a> decode_inputs = self.prepare_decode_inputs() 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:723'>723</a> while prompt_queue or current_decode_ongoing.any(): 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:724'>724</a>     outputs = self._session.run(decode_inputs) 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:529, in QEffTextGenerationBase.prepare_decode_inputs(self) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:527'>527</a> if self._prompt_to_lora_id_mapping_decode: 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:528'>528</a>     if self.full_batch_size: 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:529'>529</a>         first_batch_lora_ids = [self._prompt_to_lora_id_mapping_decode[i] for i in range(self.full_batch_size)] 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:530'>530</a>         decode_inputs["lora_ids"] = np.array(first_batch_lora_ids, dtype=np.int64).reshape( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:531'>531</a>             self.full_batch_size, 1 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:532'>532</a>         ) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:533'>533</a>     else: 
 
File /home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:529, in <listcomp>(.0) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:527'>527</a> if self._prompt_to_lora_id_mapping_decode: 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:528'>528</a>     if self.full_batch_size: 
--> <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:529'>529</a>         first_batch_lora_ids = [self._prompt_to_lora_id_mapping_decode[i] for i in range(self.full_batch_size)] 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:530'>530</a>         decode_inputs["lora_ids"] = np.array(first_batch_lora_ids, dtype=np.int64).reshape( 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:531'>531</a>             self.full_batch_size, 1 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:532'>532</a>         ) 
    <a href='/home/qraniumtest/suprvisw/fixed_lora/efficient-transformers/QEfficient/generation/text_generation_inference.py:533'>533</a>     else: 
 
IndexError: list index out of range

Expected behavior
When we pass prompts and adapters less than the full batch size, the fixed lora inference should either works like base model execution (with prompts duplicated to full batch size) or error out

Screenshots
N/A

Environment (please complete the following information):

Branch: release/v1.19

Additional context
N/A

The text was updated successfully, but these errors were encountered:

…v1.19 (#250) Same as [PR#242](#242) This is regarding the issue reported in [issue#251](#251) The finite lorax feature failed to execute when the number of prompts provided is less than the full batch size. The solution involves applying the same adjustment strategy for `prompt_to_lora_id_mapping` as used for `prompt` in the `fix_prompts()` function located in `QEfficient/generation/text_generation_inference.py`. --------- Signed-off-by: Jou-An Chen <[email protected]> Signed-off-by: Onkar Chougule <[email protected]> Co-authored-by: Onkar Chougule <[email protected]>

This is regarding the issue reported in [issue#251](#251) The finite lorax feature failed to execute when the number of prompts provided is less than the full batch size. The solution involves applying the same adjustment strategy for `prompt_to_lora_id_mapping` as used for `prompt` in the `fix_prompts()` function located in `QEfficient/generation/text_generation_inference.py`. Signed-off-by: Jou-An Chen <[email protected]>

quic-rishinr · 2025-02-19T10:31:33Z

Fix merged in mainline.

This is regarding the issue reported in [issue#251](quic#251) The finite lorax feature failed to execute when the number of prompts provided is less than the full batch size. The solution involves applying the same adjustment strategy for `prompt_to_lora_id_mapping` as used for `prompt` in the `fix_prompts()` function located in `QEfficient/generation/text_generation_inference.py`. Signed-off-by: Jou-An Chen <[email protected]> Signed-off-by: Hem Agnihotri <[email protected]>

This was referenced Jan 29, 2025

Add prompt_to_lora_id_mapping adjustment in fix_prompts() on release/v1.19 #250

Merged

Add prompt_to_lora_id_mapping adjustment in fix_prompts() #242

Merged

quic-rishinr closed this as completed Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed LORA fails with index list out of range error when we pass prompts and adapters less than the full batch size #251

Fixed LORA fails with index list out of range error when we pass prompts and adapters less than the full batch size #251

quic-jouachen commented Jan 29, 2025

quic-rishinr commented Feb 19, 2025

Fixed LORA fails with index list out of range error when we pass prompts and adapters less than the full batch size #251

Fixed LORA fails with index list out of range error when we pass prompts and adapters less than the full batch size #251

Comments

quic-jouachen commented Jan 29, 2025

quic-rishinr commented Feb 19, 2025