Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set dtype=half? #32

Open
Hermi-Mire opened this issue Feb 16, 2025 · 1 comment
Open

How to set dtype=half? #32

Hermi-Mire opened this issue Feb 16, 2025 · 1 comment

Comments

@Hermi-Mire
Copy link

Hi, I try to reproduce on 2080 Ti, but got error message:

  File "/workspace/dialogue/deepscaler/verl/verl/trainer/main_ppo.py", line 114, in main
    ray.get(main_task.remote(config))
  File "/root/miniconda3/envs/py310/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/py310/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py310/lib/python3.10/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/envs/py310/lib/python3.10/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::main_task() (pid=6802, ip=172.17.0.4)
  File "/workspace/dialogue/deepscaler/verl/verl/trainer/main_ppo.py", line 199, in main_task
    trainer.init_workers()
  File "/workspace/dialogue/deepscaler/verl/verl/trainer/ppo/ray_trainer.py", line 530, in init_workers
    self.actor_rollout_wg.init_model()
  File "/workspace/dialogue/deepscaler/verl/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(ValueError): ray::WorkerDict.actor_rollout_init_model() (pid=7224, ip=172.17.0.4, actor_id=a1b97f929b7bcffade92906d01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f60c7f6b040>)
  File "/workspace/dialogue/deepscaler/verl/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/workspace/dialogue/deepscaler/verl/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
  File "/workspace/dialogue/deepscaler/verl/verl/workers/fsdp_workers.py", line 358, in init_model
    self.rollout, self.rollout_sharding_manager = self._build_rollout()
  File "/workspace/dialogue/deepscaler/verl/verl/workers/fsdp_workers.py", line 293, in _build_rollout
    rollout = vLLMRollout(actor_module=self.actor_module_fsdp,
  File "/workspace/dialogue/deepscaler/verl/verl/workers/rollout/vllm_rollout/vllm_rollout.py", line 98, in __init__
    self.inference_engine = LLM(actor_module,
  File "/workspace/dialogue/deepscaler/verl/verl/third_party/vllm/vllm_v_0_6_3/llm.py", line 147, in __init__
    self.llm_engine = LLMEngine.from_engine_args(model, tokenizer, engine_args)  # TODO: check usagecontext
  File "/workspace/dialogue/deepscaler/verl/verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py", line 393, in from_engine_args
    engine = cls(
  File "/workspace/dialogue/deepscaler/verl/verl/third_party/vllm/vllm_v_0_6_3/llm_engine_sp.py", line 212, in __init__
    self.model_executor = executor_class(
  File "/workspace/dialogue/deepscaler/verl/verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py", line 71, in __init__
    self._init_executor(model, distributed_init_method)
  File "/workspace/dialogue/deepscaler/verl/verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py", line 78, in _init_executor
    self._init_workers_sp(model, distributed_init_method)
  File "/workspace/dialogue/deepscaler/verl/verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py", line 111, in _init_workers_sp
    self.worker.init_device()
  File "/workspace/dialogue/deepscaler/verl/verl/third_party/vllm/vllm_v_0_6_3/worker.py", line 163, in init_device
    _check_if_gpu_supports_dtype(self.model_config.dtype)
  File "/root/miniconda3/envs/py310/lib/python3.10/site-packages/vllm/worker/worker.py", line 473, in _check_if_gpu_supports_dtype
    raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
@michaelzhiluo
Copy link
Contributor

Set this: https://github.com/agentica-project/deepscaler/blob/main/verl/verl/trainer/config/ppo_trainer.yaml#L69
to float16 or half.

Bfloat16 is only supported on A100/H100/... GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants