Skip to content

Conversation

@f14-bertolotti
Copy link
Contributor

Overview

This PR migrates the custom SampleCache system to the more robust and well-maintained diskcache library.
It follows up on issue #1053.

What’s Changed

The legacy SampleCache class and its associated cached decorator have been fully removed and replaced with a new cached decorator built on top of diskcache.

The new decorator:

  • Uses model.config.cache_dir to instantiate a shared cache via diskcache.Cache(model.config.cache_dir).
  • Detects which documents are already cached, and calls the underlying sampling method only on the subset of uncached documents.
  • Stores all newly returned samples back into the cache after each sampling method finishes.
  • Supports both synchronous and asynchronous sampling methods, enabling seamless use with both async and sync vLLM.

Most of the logic lives in src/lighteval/utils/cache_management.py, where the decorator was rewritten from scratch. All remaining changes simply remove references to the previous caching implementation.

Currently, cache writes occur only after a full sampling method completes (which can take hours for certain tasks). In the future, we can decorate inner per-request methods to enable more granular incremental caching.

Additional Improvements

  • Added a progress bar for async vLLM to improve user feedback during long-running evaluations.

Testing

The implementation has been tested with:

uv run lighteval vllm examples/model_configs/vllm_model_config.yaml aime25   # async + sync vLLM
uv run lighteval vllm examples/model_configs/transformers_model.yaml aime25
uv run pytest

Co-authored-by: Francesco Bertolotti <[email protected]>
@f14-bertolotti f14-bertolotti changed the title graceful shutdown of vllm async diskcache for caching Nov 19, 2025
@NathanHB
Copy link
Member

oh great !! will test it today and come back to you for review :)
cc @clefourrier

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking nice overall, a couple nits - could you also try testing it with the transformers and inference provider backends?

Comment on lines 623 to 646
with rich.progress.Progress(
"[progress.description]{task.description}",
rich.progress.BarColumn(),
"[progress.completed]{task.completed}/{task.total}",
"•",
rich.progress.TimeElapsedColumn(),
"•",
rich.progress.TimeRemainingColumn(),
) as pbar:
task_id = pbar.add_task("[green]Sending Requests...", total=len(docs))

async def track(coro):
"""Wraps a coroutine to update progress bar when done."""
result = await coro
pbar.update(task_id, advance=1)
return result

wrapped = [
track(self._async_one_item(index=index, doc=doc, generative=generative))
for index, doc in enumerate(docs)
]

result = await asyncio.gather(*wrapped)
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to current PR, would make sense to have it in its standalone PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a separate PR for it

# Save updated dataset
dataset = Dataset.from_list(all_samples)
dataset.to_parquet(str(cache_file))
def default_json_encoder(obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a json encoder in the utils iirc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble finding it. Could you point it out for me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnhancedJSONEncoder in src/lighteval/logging/evaluation_tracker.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanHB wdyt of moving it to the utils? it's not logging specific

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That encoder should work perfectly. I need only to add .model_dump() for pydantic objects

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, ty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick note about the encoder: I noticed that the method is defined as an instance method, which means it expects a self parameter. I can work around this by forcing self to None:

functools.partial(EnhancedJSONEncoder.default, None)

However, it might be cleaner to define it as a @staticmethod within EnhancedJSONEncoder instead.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this - it's supposed to be used with the cls arg in json.dumps, so instead of doing default=your_func you do cls=the_class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding your issue, so really feel free to extend a bit on what you need if relevant :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mb, I didn't even know that you could pass to json.dumps a class.

Comment on lines 76 to 77
if isinstance(docs, Doc):
docs = [docs]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could keep as_list

doc, sampling_method=sampling_method, config=self.config, args=args, kwargs=kwargs
)
if key in cache:
logger.info("Cache hit")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're going to get too many logs with this, especially on datasets like mmlu with 10K+ samples - maybe just store how many times the cache was hit and log it at the end?
Cache was hit for x documents out of y.

logger.info("Cache hit")
results[idx] = cache[key]["response"]
else:
logger.info("Cache miss")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem


return wrapper
def decorator(sampler: Callable): # noqa: C901
@functools.wraps(sampler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampler is a bit misleading here - maybe model_call would be clearer?

model_outputs = await self.model.loglikelihood(docs)
outputs[sampling_method] = model_outputs

self.model.cleanup()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just gave it a go locally, it's great ! Thanks for the contrib.
Will merge once nits above are addressed and tests are fixed :)

@f14-bertolotti
Copy link
Contributor Author

Ok, I think I have addressed all the issues. I am opening another PR for the progress bar.

@f14-bertolotti
Copy link
Contributor Author

I have added a last minute commit. I forgot to remove two lines relative to this comment #1068 (comment)

Comment on lines 105 to 107
if isinstance(docs, Doc):
docs = [docs]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use as_list not remove entirely ^^"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mb, I misunderstood your comment. the new commit uses as_list.

@NathanHB
Copy link
Member

hey @f14-bertolotti do you need any help on this ? 🤗

@f14-bertolotti
Copy link
Contributor Author

I took a look at this over the weekend. I’m having some trouble running the tests, though---on my laptop they take a very long time, and the compute nodes I can access don’t have internet connectivity, which complicates things further.

From what I can tell, the issue seems to be that a list has, for some reason, become a tuple. For example:

Current: [(330, 2)]
Reference: [[330, 2]]

I don’t see how this would be related to the caching. If you could also take a look, it would be really helpful.

@f14-bertolotti
Copy link
Contributor Author

Hi @NathanHB,

It took me a while, but I ran the test that appears to be failing—tests/slow_tests/test_vllm_model.py::test_vllm_model—on both the main and diskcache branches.

In both cases, the test fails due to differences in the generated outputs. For example:

E                 Output_Tokens differs:
E                   Current: [[52, 2]]
E                   Reference: [[340, 2]]

My guess is that this may be caused by vLLM not being fully deterministic. I also noticed that a CUDA environment variable (CUBLAS_WORKSPACE_CONFIG) is set, so hardware differences could be contributing as well (I ran these tests on a Mac).

Regarding the original issue we saw with tests/slow_tests/test_vllm_model.py::test_vllm_model, I haven’t been able to reproduce it on my machine.

@f14-bertolotti
Copy link
Contributor Author

Hi @NathanHB ,

I was wondering what your plans are for this PR. Are you waiting for a fix to tests/slow_tests/test_vllm_model.py::test_vllm_model?

Do you have an estimated timeline for the merge? I’m asking because I’d like to start working on additional contributions once this is merged.

@NathanHB
Copy link
Member

NathanHB commented Dec 4, 2025

hey @f14-bertolotti ! Sorry i was off next week and we have many other exciting things running for evals :)
I want to fix the tests before merging the PR, it should be an easy issue to fix, will try to take a look friday or otherwise start of next week.
Sorry for the delay on this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants