More granular selection of which hardware to use for what task

Would it be possible to add a way to launch/use a model with this configuration?

- Run Main Model inference on CPU (RAM)  
- Run Speculative Model on CPU (RAM)  
- Run initial context processing on GPU, then move it to CPU (RAM) as well (or only leave this on the GPU if that wouldn’t work)  
- Same for the Speculative Model; however, its context size and potential speedup of computation is (probably) negligible.

I am asking because in my testing the speed is fine when running everything on CPU (RAM), including speculative decoding—except for the initial context processing. This step requires a relatively low amount of VRAM but is orders of magnitude faster when computed on the GPU.

Also, as a somewhat unrelated sidenote: I am achieving much higher speeds with these speculative decoding settings (using llama 3.3 70B Q4_K_M with llama 3.2 1B Q8_0, with similar results for other main model sizes):
- **Probability:** 0.9  
- **Min:** 0  
- **Max:** 5  
- **Tokens matched:** ≥ ~35%  
- **Speedup:** ~1.33x

It would be handy if you could further display all tokens for speculative decoding—both those wasted and those used—along with their respective probabilities, in some kind of debug view to fine-tune the process. Alternatively, an adaptive algorithm could be implemented to determine the best settings either during runtime or as an initial evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More granular selection of which hardware to use for what task #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

More granular selection of which hardware to use for what task #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions