Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Non-Incremental vllm.RegexLogitsProcessor (enable fast-forward) #900

Closed
wants to merge 2 commits into from

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented May 17, 2024

Fixes #855

Problem:

If multiple new tokens are introduced to vllm.RegexLogitsProcessor without the previous sequence (input_ids[:-1]) being processed, state defaults to 0 in

self._fsm_state[seq_id] = self.fsm.get_next_state(
    state=self._fsm_state[last_seq_id],   # DefaultDict, if last_seq_id doesn't exist it assumes initial state 0!
    token_id=last_token
)

Example of problematic calls:

lp = RegexLogitsProcessor(...)

biased_scores = lp([], scores)
biased_scores = lp([1], scores)
biased_scores = lp([1, 2, 3, 4, 5], scores) 

Solution

Calculate all missing FSM states sequentially, enabling fast-forward.

Would appreciate a review @br3no, I'd like ensure this is compliant with expectations within vLLM :)

@br3no
Copy link
Contributor

br3no commented May 18, 2024

@lapp0 I think there is a misunderstanding here.

In #855 the issue has to do with new type Instruction = Union[Write, Generate] in https://github.com/outlines-dev/outlines/blob/main/outlines/fsm/guide.py and the usage of the get_next_instruction method of Guide in the LogitsProcessor. The thing is that both instructions Write and Generate declare a member variable called tokens but the semantics of this member variable is not the same.

In Write, tokens means the sequence of next tokens that Outlines wants to "write" to the sequence. These are the ff-tokens I was talking about. tokens is to be interpreted as the sorted list of next tokens for the sequence being generated.

In Generate, tokens means the set of tokens that Outlines says are valid (i.e. transitions in the state machine) for the next generation step. Only one of these tokens will eventually be generated by the LLM.

You can see how I will make sure the bug doesn't happen in vLLM, once the update of Outlines is merged here: https://github.com/br3no/vllm/blob/e6ffb1af2e904436473568f9d43131c998e55063/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L59

Note that this way of solving the issue makes using ff-tokens from outlines impossible... So it can only be considered an interim solution.


Regarding the integration with vLLM: there is ongoing work in vLLM to allow for ff-tokens and the design of how they will work is still unclear (at least to me). So I'm not sure the LogitsProcessor interface will change in the way you assume here, i.e. that the input_ids parameter to __call__ could contain n (n > 1) newly generated tokens.

@lapp0 lapp0 marked this pull request as draft May 18, 2024 23:30
@lapp0
Copy link
Contributor Author

lapp0 commented May 19, 2024

Thanks so much for your detailed response @br3no

Here are my thoughts:

  1. Current PR changes should remain to ensure if multiple tokens are added to input_ids the logits processor get the correct mask for the state.

  2. The fact that Write.tokens and Generate.tokens means separate things is confusing. IMO we should change it to Write.tokens and Generate.legal_next_tokens.

  3. Application of Write instructions is out of scope for a logits processor. We probably need two separate components in vLLM: a guide logits processor, and a guide fast forward handler.

You have more context so I'd love to hear any corrections / guidance / thoughts. But here's what I'm thinking: The logits processor should behave as it does now: it creates a mask with a single token provided a Write instruction. The fast forward handler would share the Guide object with the logits processor. The fast forward handler would return an empty sequence upon Generate instructions, and Write.tokens upon a write instruction. Infrastructure for Speculative might be applicable for the fast forward handler.

What are your thoughts on the direction we should take and how can I help?

@br3no
Copy link
Contributor

br3no commented May 21, 2024

I'm not sure there ever will be multiple tokens added to input_ids, so I'd rather wait for vLLM to define the way things are supposed to work before implementing this.

  1. ...

Yes, I agree there is room for reducing confusion. Maybe you want to propose the change in a separate PR and discuss this with the Outlines maintainers?

Regarding the LogitsProcessor's API, this is something that is being worked on in vLLM. I think there is not much one can do right now but to wait what will come out of this work.

@lapp0
Copy link
Contributor Author

lapp0 commented May 21, 2024

Thanks for the clarification @br3no!

@lapp0 lapp0 closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Logits Processors Guide integration will be buggy when len(tokens) > 1 in a Write instruction
2 participants