Skip to content

More granular selection of which hardware to use for what task #48

Open
@DoomOfMax97

Description

@DoomOfMax97

Would it be possible to add a way to launch/use a model with this configuration?

  • Run Main Model inference on CPU (RAM)
  • Run Speculative Model on CPU (RAM)
  • Run initial context processing on GPU, then move it to CPU (RAM) as well (or only leave this on the GPU if that wouldn’t work)
  • Same for the Speculative Model; however, its context size and potential speedup of computation is (probably) negligible.

I am asking because in my testing the speed is fine when running everything on CPU (RAM), including speculative decoding—except for the initial context processing. This step requires a relatively low amount of VRAM but is orders of magnitude faster when computed on the GPU.

Also, as a somewhat unrelated sidenote: I am achieving much higher speeds with these speculative decoding settings (using llama 3.3 70B Q4_K_M with llama 3.2 1B Q8_0, with similar results for other main model sizes):

  • Probability: 0.9
  • Min: 0
  • Max: 5
  • Tokens matched: ≥ ~35%
  • Speedup: ~1.33x

It would be handy if you could further display all tokens for speculative decoding—both those wasted and those used—along with their respective probabilities, in some kind of debug view to fine-tune the process. Alternatively, an adaptive algorithm could be implemented to determine the best settings either during runtime or as an initial evaluation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions