Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to deploy mistralai/Mistral-Nemo-Instruct-2407 #88

Open
TheMindExpansionNetwork opened this issue Jul 30, 2024 · 5 comments
Open

Comments

@TheMindExpansionNetwork

Hello you all keep scratching my head why sometimes I can deploy all on list but stuff I find having issues

anyways this is my logs just trying to use this repo https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407

I think same error was with llama 3.1 8b also not sure on quantization ones.

know it snoob stuff but thanks for help here is the logs

Search

0 matches
2024-07-30T07:21:51.526398877Z /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
2024-07-30T07:21:51.526440887Z warnings.warn(
2024-07-30T07:21:51.915487979Z INFO 07-30 07:21:51 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-2407', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024000, download_dir='/runpod-volume/huggingface-cache/hub', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mistral-Nemo-Instruct-2407)
2024-07-30T07:21:52.759581288Z INFO 07-30 07:21:52 utils.py:628] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2
2024-07-30T07:21:53.481568872Z INFO 07-30 07:21:53 selector.py:27] Using FlashAttention-2 backend.
2024-07-30T07:21:53.845809406Z engine.py :110 2024-07-30 07:21:53,845 Error initializing vLLM engine: Head size 160 is not supported by PagedAttention. Supported head sizes are: [64, 80, 96, 112, 128, 256].
2024-07-30T07:21:53.854397126Z [rank0]: Traceback (most recent call last):
2024-07-30T07:21:53.854422216Z [rank0]: File "/src/handler.py", line 6, in
2024-07-30T07:21:53.854425386Z [rank0]: vllm_engine = vLLMEngine()
2024-07-30T07:21:53.854427736Z [rank0]: File "/src/engine.py", line 25, in init
2024-07-30T07:21:53.854429746Z [rank0]: self.llm = self._initialize_llm() if engine is None else engine
2024-07-30T07:21:53.854432376Z [rank0]: File "/src/engine.py", line 111, in _initialize_llm
2024-07-30T07:21:53.854434466Z [rank0]: raise e
2024-07-30T07:21:53.854437186Z [rank0]: File "/src/engine.py", line 105, in _initialize_llm
2024-07-30T07:21:53.854439236Z [rank0]: engine = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-07-30T07:21:53.854441546Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
2024-07-30T07:21:53.854444006Z [rank0]: engine = cls(
2024-07-30T07:21:53.854446326Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 324, in init
2024-07-30T07:21:53.854448836Z [rank0]: self.engine = self._init_engine(*args, **kwargs)
2024-07-30T07:21:53.854450826Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
2024-07-30T07:21:53.854452766Z [rank0]: return engine_class(*args, **kwargs)
2024-07-30T07:21:53.854454676Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 160, in init
2024-07-30T07:21:53.854456636Z [rank0]: self.model_executor = executor_class(
2024-07-30T07:21:53.854458526Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init
2024-07-30T07:21:53.854460446Z [rank0]: self._init_executor()
2024-07-30T07:21:53.854462396Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 23, in _init_executor
2024-07-30T07:21:53.854464276Z [rank0]: self._init_non_spec_worker()
2024-07-30T07:21:53.854466236Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker
2024-07-30T07:21:53.854468156Z [rank0]: self.driver_worker.load_model()
2024-07-30T07:21:53.854470026Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 118, in load_model
2024-07-30T07:21:53.854471956Z [rank0]: self.model_runner.load_model()
2024-07-30T07:21:53.854473856Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 164, in load_model
2024-07-30T07:21:53.854490726Z [rank0]: self.model = get_model(
2024-07-30T07:21:53.854492966Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
2024-07-30T07:21:53.854495946Z [rank0]: return loader.load_model(model_config=model_config,
2024-07-30T07:21:53.854498016Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model
2024-07-30T07:21:53.854500036Z [rank0]: model = _initialize_model(model_config, self.load_config,
2024-07-30T07:21:53.854502056Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 88, in _initialize_model
2024-07-30T07:21:53.854504536Z [rank0]: return model_class(config=model_config.hf_config,
2024-07-30T07:21:53.854506526Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 338, in init
2024-07-30T07:21:53.854508526Z [rank0]: self.model = LlamaModel(config, quant_config, lora_config=lora_config)

@TheMindExpansionNetwork
Copy link
Author

2024-07-30T07:27:23.496500082Z File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 768, in from_dict
2024-07-30T07:27:23.496505705Z config = cls(**config_dict)
2024-07-30T07:27:23.496516212Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/configuration_llama.py", line 161, in init
2024-07-30T07:27:23.496750253Z self._rope_scaling_validation()
2024-07-30T07:27:23.496770229Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/configuration_llama.py", line 182, in _rope_scaling_validation
2024-07-30T07:27:23.496776450Z raise ValueError(
2024-07-30T07:27:23.496792960Z ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
2024-07-30T07:27:38.388007495Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-30T07:27:38.390861478Z /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
2024-07-30T07:27:38.390897675Z warnings.warn(
2024-07-30T07:27:38.435671350Z engine.py :110 2024-07-30 07:27:38,434 Error initializing vLLM engine: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
2024-07-30T07:27:38.435749229Z Traceback (most recent call last):
2024-07-30T07:27:38.435755539Z File "/src/handler.py", line 6, in
2024-07-30T07:27:38.435760036Z vllm_engine = vLLMEngine()
2024-07-30T07:27:38.435765005Z File "/src/engine.py", line 25, in init
2024-07-30T07:27:38.435770196Z self.llm = self._initialize_llm() if engine is None else engine
2024-07-30T07:27:38.435786726Z File "/src/engine.py", line 111, in _initialize_llm
2024-07-30T07:27:38.435791799Z raise e
2024-07-30T07:27:38.435796432Z File "/src/engine.py", line 105, in _initialize_llm
2024-07-30T07:27:38.435800735Z engine = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-07-30T07:27:38.435805046Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 346, in from_engine_args
2024-07-30T07:27:38.435864074Z engine_config = engine_args.create_engine_config()
2024-07-30T07:27:38.435870006Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 520, in create_engine_config
2024-07-30T07:27:38.435875168Z model_config = ModelConfig(
2024-07-30T07:27:38.435879946Z File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 119, in init
2024-07-30T07:27:38.435884668Z self.hf_config = get_config(self.model, trust_remote_code, revision,
2024-07-30T07:27:38.435889614Z File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 38, in get_config
2024-07-30T07:27:38.435893901Z raise e
2024-07-30T07:27:38.435898595Z File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 23, in get_config
2024-07-30T07:27:38.435902844Z config = AutoConfig.from_pretrained(
2024-07-30T07:27:38.435907398Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 958, in from_pretrained
2024-07-30T07:27:38.436363428Z return config_class.from_dict(config_dict, **unused_kwargs)
2024-07-30T07:27:38.436383585Z File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 768, in from_dict
2024-07-30T07:27:38.436408462Z config = cls(**config_dict)
2024-07-30T07:27:38.436413777Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/configuration_llama.py", line 161, in init
2024-07-30T07:27:38.436418385Z self._rope_scaling_validation()
2024-07-30T07:27:38.436423064Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/configuration_llama.py", line 182, in _rope_scaling_validation
2024-07-30T07:27:38.436428017Z raise ValueError(
2024-07-30T07:27:38.436433264Z ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'

Guess different error trying this meta-llama/Meta-Llama-3.1-8B-Instruct

@TheMindExpansionNetwork
Copy link
Author

2024-07-30T20:39:41.413020519Z ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

I am not sure what the issue is it seems it is everytime I do something that is for llama

@originade
Copy link

This is actually an error with vLLM base image. This issue is fixed in the newest version of vLLM (v0.5.3) but this project has not been updated yet

@TheMindExpansionNetwork
Copy link
Author

Awesome thanks for letting me know I will just keep updated.

This is super awesome project thank you all

@hypnocapybara
Copy link

I tried with that Dockerfile and deployed it

FROM runpod/worker-vllm:stable-cuda12.1.0


RUN --mount=type=cache,target=/root/.cache/pip \
    python3 -m pip install --upgrade pip && \
    python3 -m pip install --upgrade -r /requirements.txt

RUN python3 -m pip install --upgrade vllm transformers


CMD ["python3", "/src/handler.py"]

But the code also needs to be updated in the new version of vllm, OpenAIServingChat accepts the config param

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants