Prepare `outlines` for `outlines-core` v0.2 #1386

yvan-sraka · 2025-01-22T10:38:03Z

This is a very early half-baked draft aimed at addressing #1380, Currently, I’m struggling to make it work, so far, my development environment is just replacing the Python interfaces in outlines-core with those from the fsm subfolder:

outlines-core on git main [xr!+?] is pkg v0.0.0 via py v3.12.8 via rs v1.84.0  
[OCI] > git status  
On branch main  
Your branch is up to date with 'outline-core/main'.  
Changes to be committed:  
  (use "git restore --staged <file>..." to unstage)  
        modified:   python/outlines_core/__init__.py  
        deleted:    python/outlines_core/fsm/__init__.py  
        renamed:    python/outlines_core/fsm/json_schema.py -> python/outlines_core/json_schema.py  
        renamed:    python/outlines_core/fsm/outlines_core_rs.pyi -> python/outlines_core/outlines_core_rs.pyi

I also edited my outlines Conda environment to use the local version of outlines-core:

diff --git a/environment.yml b/environment.yml  
index 6fc980f..6553e74 100644  
--- a/environment.yml  
+++ b/environment.yml  
@@ -20,4 +20,4 @@ dependencies:  
   - transformers  
   - pip  
   - pip:  
-    - -e ".[test]"  
+    - -e ".[test]" -e ../outlines_core  
diff --git a/pyproject.toml b/pyproject.toml  
index 28060f7..42da6b0 100644  
--- a/pyproject.toml  
+++ b/pyproject.toml  
@@ -40,7 +40,7 @@ dependencies = [  
    "pycountry",  
    "airportsdata",  
    "torch",  
-   "outlines_core==0.1.26",  
+   "outlines_core",  
 ]  
 dynamic = ["version"]

Overall, I try to update the outlines.fsm.guide module to match the logic in the example from the issue. Are we planning to keep supporting all the Guide variants? For example, IIUC, outlines-core no longer provides RegexGuide, right? So I integrated some code from https://github.com/dottxt-ai/outlines-core/blob/0.1.27/python/outlines_core/fsm/guide.py, as per @torymur’s advice.

yvan-sraka · 2025-01-22T10:46:03Z

It doesn't yet use the new Index and Vocabulary interfaces.

I'll have to make it look more like this?

import json
from outlines_core.json_schema import build_regex_from_schema
from outlines_core.fsm.guide import Guide, Index, Vocabulary

class GuideWrapper:
    """Wrapper for the Guide class to handle different guide types."""

    def __init__(self, schema: dict, tokenizer_model: str):
        self.regex = build_regex_from_schema(json.dumps(schema))
        self.vocabulary = Vocabulary.from_pretrained(tokenizer_model)
        self.index = Index(self.regex, self.vocabulary)
        self.guide = Guide(self.index)

    def get_current_state(self):
        """Get current state of the Guide."""
        return self.guide.get_state()

    def get_allowed_tokens(self):
        """Get allowed tokens for the current state of the Guide."""
        return self.guide.get_tokens()

    def advance(self, token_id: int):
        """Advance Guide to the next state via some token_id and return allowed tokens for that new state."""
        return self.guide.advance(token_id)

    def is_finished(self):
        """Check if Guide is finished."""
        return self.guide.is_finished()

    def get_eos_token_id(self):
        """Get the EOS token ID from the vocabulary."""
        return self.vocabulary.get_eos_token_id()

# Example usage
schema = {
    "title": "Foo",
    "type": "object",
    "properties": {"date": {"type": "string", "format": "date"}}
}
tokenizer_model = "openai-community/gpt2"
guide_wrapper = GuideWrapper(schema, tokenizer_model)

# Get current state of the Guide:
current_state = guide_wrapper.get_current_state()

# Get allowed tokens for the current state of the Guide:
allowed_tokens = guide_wrapper.get_allowed_tokens()

# Advance Guide to the next state via some token_id and return allowed tokens for that new state:
next_allowed_tokens = guide_wrapper.advance(allowed_tokens[-1])

# To check if Guide is finished:
guide_finished = guide_wrapper.is_finished()

# If it's finished then this assertion holds:
assert guide_wrapper.get_allowed_tokens() == [guide_wrapper.get_eos_token_id()]

outlines/fsm/guide.py

torymur · 2025-01-23T14:47:26Z

It doesn't yet use the new Index and Vocabulary interfaces. I'll have to make it look more like this?

Yes, you will need Index and Vocabulary interfaces. But GuideWrapper in your example doesn't look very helpful, it just unnecessarily repeats the existing logic. RegexGuide is already a wrapper of the Guide just for outlines context, it should adhere to defined protocols of RegexGuide for now and that's it.

In other words, I would suggest to start with that removed RegexGuide from outlines-core and method by method try to fulfill all the requirements outlines asks from it with new interfaces:
https://github.com/dottxt-ai/outlines-core/blob/0.1.27/python/outlines_core/fsm/guide.py

Then we can iterate over what could be missing in the outlines_core interfaces if we see some struggling in the RegexGuide's logic. Or not, if it's just a backward compatibility thing, for example, then it should be in outlines only. We might generally improve on guides interfaces in outlines, but I suspect, that our hands are already full for now.

For example, you can start with something very basic like this and then see where it leads you, method by method:

class RegexGuide(Guide):
    """Guide to generate text in the language of a regular expression."""

    def __init__(self, guide, eos_tensor):
        # EOS tensor is used in instructions, so better to be created just once
        self.eos_tensor = eos_tensor
        self._guide = guide

    @classmethod
    def from_regex(cls, regex, tokenizer):
        # Tokenizer has something like `name_or_path` to get to the model's name
        vocabulary = Vocabulary.from_pretrained(tokenizer.name_or_path())
        # Unclear yet if we need attr access to Index/Vocabulary, you will see it when it develops
        index = Index(regex, vocabulary)
        guide = Guide(index)

        eos_tensor = torch.tensor([vocabulary.eos_token_id()])
        return cls(guide, eos_tensor)

torymur · 2025-01-31T23:20:00Z

It was finally published to pypi, please try using 0.2.2 version of outlines-core.

yvan-sraka · 2025-02-03T09:10:32Z

==================================== ERRORS ====================================
_________________ ERROR collecting tests/fsm/test_cfg_guide.py _________________
ImportError while importing test module '/home/runner/work/outlines/outlines/tests/fsm/test_cfg_guide.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/fsm/test_cfg_guide.py:7: in <module>
    from outlines import grammars, models
outlines/__init__.py:3: in <module>
    import outlines.generate
outlines/generate/__init__.py:3: in <module>
    from .choice import choice
outlines/generate/choice.py:7: in <module>
    from outlines_core.fsm.json_schema import build_regex_from_schema
E   ModuleNotFoundError: No module named 'outlines_core.fsm'
=========================== short test summary info ============================
ERROR tests/fsm/test_cfg_guide.py
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 2.98s ===============================

rlouf · 2025-02-03T09:25:09Z

The Json Schema conversion is happening here now.

rlouf · 2025-02-03T11:38:34Z

You can also import Guide as from outlines_core import Guide

rlouf · 2025-02-04T09:18:11Z

outlines/fsm/guide.py

+class BetterFSM:
+    def __init__(self):
+        self.finals = None
+        self.flat_transition_map = None


Do we still need this?

No we shouldn't need this

Should I remove everything that's fsm-related?

yvan-sraka · 2025-02-05T10:00:06Z

outlines/fsm/parsing.py

@@ -623,7 +619,7 @@ def match(self, text, pos, last_fsm_state_seq: Optional[Tuple[int, ...]] = None)
            text_part,
        )

-        state_seq = walk_fsm(
+        state_seq = walk_fsm(  # type: ignore  # noqa: F821


I'm not sure if I should remove the whole PartialScanner, or workaround that?

We need to keep everything that is related to grammars for now

yvan-sraka · 2025-02-05T10:02:24Z

tests/fsm/test_parsing.py

@@ -209,7 +209,7 @@ def test_sequential_parse_example(cleanup_lark_import):
 # TODO: Remove once fsm_union and walk_fsm are implemented in Outlines-Core


It seems that walk_fsm isn't yet available in outlines-core, is that comment still accurate?

yvan-sraka requested a review from torymur January 22, 2025 10:38

yvan-sraka self-assigned this Jan 22, 2025

yvan-sraka linked an issue Jan 22, 2025 that may be closed by this pull request

Prepare outlines to use outlines-core v0.2 #1380

Open