Allow Non-Incremental `vllm.RegexLogitsProcessor` (enable fast-forward) #900

lapp0 · 2024-05-17T09:39:49Z

Fixes #855

Problem:

If multiple new tokens are introduced to vllm.RegexLogitsProcessor without the previous sequence (input_ids[:-1]) being processed, state defaults to 0 in

self._fsm_state[seq_id] = self.fsm.get_next_state(
    state=self._fsm_state[last_seq_id],   # DefaultDict, if last_seq_id doesn't exist it assumes initial state 0!
    token_id=last_token
)

Example of problematic calls:

lp = RegexLogitsProcessor(...)

biased_scores = lp([], scores)
biased_scores = lp([1], scores)
biased_scores = lp([1, 2, 3, 4, 5], scores)

Solution

Calculate all missing FSM states sequentially, enabling fast-forward.

Would appreciate a review @br3no, I'd like ensure this is compliant with expectations within vLLM :)

br3no · 2024-05-18T20:23:04Z

@lapp0 I think there is a misunderstanding here.

In #855 the issue has to do with new type Instruction = Union[Write, Generate] in https://github.com/outlines-dev/outlines/blob/main/outlines/fsm/guide.py and the usage of the get_next_instruction method of Guide in the LogitsProcessor. The thing is that both instructions Write and Generate declare a member variable called tokens but the semantics of this member variable is not the same.

In Write, tokens means the sequence of next tokens that Outlines wants to "write" to the sequence. These are the ff-tokens I was talking about. tokens is to be interpreted as the sorted list of next tokens for the sequence being generated.

In Generate, tokens means the set of tokens that Outlines says are valid (i.e. transitions in the state machine) for the next generation step. Only one of these tokens will eventually be generated by the LLM.

You can see how I will make sure the bug doesn't happen in vLLM, once the update of Outlines is merged here: https://github.com/br3no/vllm/blob/e6ffb1af2e904436473568f9d43131c998e55063/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L59

Note that this way of solving the issue makes using ff-tokens from outlines impossible... So it can only be considered an interim solution.

Regarding the integration with vLLM: there is ongoing work in vLLM to allow for ff-tokens and the design of how they will work is still unclear (at least to me). So I'm not sure the LogitsProcessor interface will change in the way you assume here, i.e. that the input_ids parameter to __call__ could contain n (n > 1) newly generated tokens.

lapp0 · 2024-05-19T20:55:27Z

Thanks so much for your detailed response @br3no

Here are my thoughts:

Current PR changes should remain to ensure if multiple tokens are added to input_ids the logits processor get the correct mask for the state.
The fact that Write.tokens and Generate.tokens means separate things is confusing. IMO we should change it to Write.tokens and Generate.legal_next_tokens.
Application of Write instructions is out of scope for a logits processor. We probably need two separate components in vLLM: a guide logits processor, and a guide fast forward handler.

You have more context so I'd love to hear any corrections / guidance / thoughts. But here's what I'm thinking: The logits processor should behave as it does now: it creates a mask with a single token provided a Write instruction. The fast forward handler would share the Guide object with the logits processor. The fast forward handler would return an empty sequence upon Generate instructions, and Write.tokens upon a write instruction. Infrastructure for Speculative might be applicable for the fast forward handler.

What are your thoughts on the direction we should take and how can I help?

br3no · 2024-05-21T14:49:05Z

I'm not sure there ever will be multiple tokens added to input_ids, so I'd rather wait for vLLM to define the way things are supposed to work before implementing this.

...

Yes, I agree there is room for reducing confusion. Maybe you want to propose the change in a separate PR and discuss this with the Outlines maintainers?

Regarding the LogitsProcessor's API, this is something that is being worked on in vLLM. I think there is not much one can do right now but to wait what will come out of this work.

lapp0 · 2024-05-21T19:22:33Z

Thanks for the clarification @br3no!

lapp0 added 2 commits May 17, 2024 04:30

Allow Non-Incremental vllm.RegexLogitsProcessor (enable fast-forward)

338f20e

Merge branch 'main' into issue-855

7106049

lapp0 marked this pull request as draft May 18, 2024 23:30

lapp0 closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Non-Incremental `vllm.RegexLogitsProcessor` (enable fast-forward) #900

Allow Non-Incremental `vllm.RegexLogitsProcessor` (enable fast-forward) #900

lapp0 commented May 17, 2024

br3no commented May 18, 2024

lapp0 commented May 19, 2024 •

edited

br3no commented May 21, 2024 •

edited

lapp0 commented May 21, 2024

Allow Non-Incremental vllm.RegexLogitsProcessor (enable fast-forward) #900

Allow Non-Incremental vllm.RegexLogitsProcessor (enable fast-forward) #900

Conversation

lapp0 commented May 17, 2024

Problem:

Solution

br3no commented May 18, 2024

lapp0 commented May 19, 2024 • edited

br3no commented May 21, 2024 • edited

lapp0 commented May 21, 2024

Allow Non-Incremental `vllm.RegexLogitsProcessor` (enable fast-forward) #900

Allow Non-Incremental `vllm.RegexLogitsProcessor` (enable fast-forward) #900

lapp0 commented May 19, 2024 •

edited

br3no commented May 21, 2024 •

edited