diskcache for caching #1068

f14-bertolotti · 2025-11-19T10:29:51Z

Overview

This PR migrates the custom SampleCache system to the more robust and well-maintained diskcache library.
It follows up on issue #1053.

What’s Changed

The legacy SampleCache class and its associated cached decorator have been fully removed and replaced with a new cached decorator built on top of diskcache.

The new decorator:

Uses model.config.cache_dir to instantiate a shared cache via diskcache.Cache(model.config.cache_dir).
Detects which documents are already cached, and calls the underlying sampling method only on the subset of uncached documents.
Stores all newly returned samples back into the cache after each sampling method finishes.
Supports both synchronous and asynchronous sampling methods, enabling seamless use with both async and sync vLLM.

Most of the logic lives in src/lighteval/utils/cache_management.py, where the decorator was rewritten from scratch. All remaining changes simply remove references to the previous caching implementation.

Currently, cache writes occur only after a full sampling method completes (which can take hours for certain tasks). In the future, we can decorate inner per-request methods to enable more granular incremental caching.

Additional Improvements

Added a progress bar for async vLLM to improve user feedback during long-running evaluations.

Testing

The implementation has been tested with:

uv run lighteval vllm examples/model_configs/vllm_model_config.yaml aime25   # async + sync vLLM
uv run lighteval vllm examples/model_configs/transformers_model.yaml aime25
uv run pytest

Co-authored-by: Francesco Bertolotti <[email protected]>

NathanHB · 2025-11-20T10:53:12Z

oh great !! will test it today and come back to you for review :)
cc @clefourrier

HuggingFaceDocBuilderDev · 2025-11-20T10:56:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

clefourrier

Looking nice overall, a couple nits - could you also try testing it with the transformers and inference provider backends?

clefourrier · 2025-11-20T11:00:22Z

src/lighteval/models/vllm/vllm_model.py

+        with rich.progress.Progress(
+            "[progress.description]{task.description}",
+            rich.progress.BarColumn(),
+            "[progress.completed]{task.completed}/{task.total}",
+            "•",
+            rich.progress.TimeElapsedColumn(),
+            "•",
+            rich.progress.TimeRemainingColumn(),
+        ) as pbar:
+            task_id = pbar.add_task("[green]Sending Requests...", total=len(docs))
+
+            async def track(coro):
+                """Wraps a coroutine to update progress bar when done."""
+                result = await coro
+                pbar.update(task_id, advance=1)
+                return result
+
+            wrapped = [
+                track(self._async_one_item(index=index, doc=doc, generative=generative))
+                for index, doc in enumerate(docs)
+            ]
+
+            result = await asyncio.gather(*wrapped)
+        return result


unrelated to current PR, would make sense to have it in its standalone PR

I will make a separate PR for it

clefourrier · 2025-11-20T11:00:57Z

src/lighteval/utils/cache_management.py

-            # Save updated dataset
-            dataset = Dataset.from_list(all_samples)
-            dataset.to_parquet(str(cache_file))
+def default_json_encoder(obj):


We already have a json encoder in the utils iirc

I am having trouble finding it. Could you point it out for me?

EnhancedJSONEncoder in src/lighteval/logging/evaluation_tracker.py

@NathanHB wdyt of moving it to the utils? it's not logging specific

That encoder should work perfectly. I need only to add .model_dump() for pydantic objects

perfect, ty

A quick note about the encoder: I noticed that the method is defined as an instance method, which means it expects a self parameter. I can work around this by forcing self to None:

functools.partial(EnhancedJSONEncoder.default, None)

However, it might be cleaner to define it as a @staticmethod within EnhancedJSONEncoder instead.

What do you think?

I don't think you need this - it's supposed to be used with the cls arg in json.dumps, so instead of doing default=your_func you do cls=the_class

Unless I'm misunderstanding your issue, so really feel free to extend a bit on what you need if relevant :)

mb, I didn't even know that you could pass to json.dumps a class.

clefourrier · 2025-11-20T11:04:03Z

src/lighteval/utils/cache_management.py

+            if isinstance(docs, Doc):
+                docs = [docs]


you could keep as_list

clefourrier · 2025-11-20T11:05:52Z

src/lighteval/utils/cache_management.py

+                        doc, sampling_method=sampling_method, config=self.config, args=args, kwargs=kwargs
+                    )
+                    if key in cache:
+                        logger.info("Cache hit")


You're going to get too many logs with this, especially on datasets like mmlu with 10K+ samples - maybe just store how many times the cache was hit and log it at the end?
Cache was hit for x documents out of y.

clefourrier · 2025-11-20T11:05:58Z

src/lighteval/utils/cache_management.py

+                        logger.info("Cache hit")
+                        results[idx] = cache[key]["response"]
+                    else:
+                        logger.info("Cache miss")


clefourrier · 2025-11-20T11:08:18Z

src/lighteval/utils/cache_management.py

-
-        return wrapper
+    def decorator(sampler: Callable):  # noqa: C901
+        @functools.wraps(sampler)


sampler is a bit misleading here - maybe model_call would be clearer?

clefourrier · 2025-11-20T11:20:05Z

src/lighteval/pipeline.py

                    model_outputs = await self.model.loglikelihood(docs)
                    outputs[sampling_method] = model_outputs

+        self.model.cleanup()


NathanHB

Just gave it a go locally, it's great ! Thanks for the contrib.
Will merge once nits above are addressed and tests are fixed :)

f14-bertolotti · 2025-11-20T13:48:25Z

Ok, I think I have addressed all the issues. I am opening another PR for the progress bar.

f14-bertolotti · 2025-11-20T16:01:10Z

I have added a last minute commit. I forgot to remove two lines relative to this comment #1068 (comment)

clefourrier · 2025-11-20T16:27:05Z

src/lighteval/utils/cache_management.py

-            if isinstance(docs, Doc):
-                docs = [docs]
-


you should use as_list not remove entirely ^^"

mb, I misunderstood your comment. the new commit uses as_list.

NathanHB · 2025-11-24T11:01:46Z

hey @f14-bertolotti do you need any help on this ? 🤗

f14-bertolotti · 2025-11-24T11:10:28Z

I took a look at this over the weekend. I’m having some trouble running the tests, though---on my laptop they take a very long time, and the compute nodes I can access don’t have internet connectivity, which complicates things further.

From what I can tell, the issue seems to be that a list has, for some reason, become a tuple. For example:

Current: [(330, 2)]
Reference: [[330, 2]]

I don’t see how this would be related to the caching. If you could also take a look, it would be really helpful.

f14-bertolotti · 2025-11-25T06:04:34Z

Hi @NathanHB,

It took me a while, but I ran the test that appears to be failing—tests/slow_tests/test_vllm_model.py::test_vllm_model—on both the main and diskcache branches.

In both cases, the test fails due to differences in the generated outputs. For example:

E                 Output_Tokens differs:
E                   Current: [[52, 2]]
E                   Reference: [[340, 2]]

My guess is that this may be caused by vLLM not being fully deterministic. I also noticed that a CUDA environment variable (CUBLAS_WORKSPACE_CONFIG) is set, so hardware differences could be contributing as well (I ran these tests on a Mac).

Regarding the original issue we saw with tests/slow_tests/test_vllm_model.py::test_vllm_model, I haven’t been able to reproduce it on my machine.

f14-bertolotti · 2025-11-28T08:29:06Z

Hi @NathanHB ,

I was wondering what your plans are for this PR. Are you waiting for a fix to tests/slow_tests/test_vllm_model.py::test_vllm_model?

Do you have an estimated timeline for the merge? I’m asking because I’d like to start working on additional contributions once this is merged.

NathanHB · 2025-12-04T16:04:03Z

hey @f14-bertolotti ! Sorry i was off next week and we have many other exciting things running for evals :)
I want to fix the tests before merging the PR, it should be an easy issue to fix, will try to take a look friday or otherwise start of next week.
Sorry for the delay on this !

graceful shutdown of vllm async

7acea0d

Co-authored-by: Francesco Bertolotti <[email protected]>

f14-bertolotti changed the title ~~graceful shutdown of vllm async~~ diskcache for caching Nov 19, 2025

clefourrier reviewed Nov 20, 2025

View reviewed changes

Merge branch 'main' into f14-diskcache

6c1bc6c

NathanHB approved these changes Nov 20, 2025

View reviewed changes

francesco-bertolotti added 3 commits November 20, 2025 14:45

added model_dump() to EnhancedJSONEncoder

c3125a0

using EnhancedJSONEncoder instead of custom encoder

ef98082

reverted progress bar in async vllm

e73adee

NathanHB mentioned this pull request Nov 20, 2025

remove forbiden caracters in files, caches and details #1062

Open

forgot removing two lines

a42babf

clefourrier reviewed Nov 20, 2025

View reviewed changes

reverted list checking with as_list

8639936

NathanHB added breaking enhancement labels Nov 24, 2025

typ annotation fix

012158e

diskcache for caching #1068

Are you sure you want to change the base?

diskcache for caching #1068

Uh oh!

Conversation

f14-bertolotti commented Nov 19, 2025

Overview

What’s Changed

Additional Improvements

Testing

Uh oh!

NathanHB commented Nov 20, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 20, 2025

Uh oh!

clefourrier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NathanHB left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

f14-bertolotti commented Nov 20, 2025

Uh oh!

f14-bertolotti commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NathanHB commented Nov 24, 2025

Uh oh!

f14-bertolotti commented Nov 24, 2025

Uh oh!

f14-bertolotti commented Nov 25, 2025

Uh oh!

f14-bertolotti commented Nov 28, 2025

Uh oh!

NathanHB commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

clefourrier left a comment •

edited

Loading

NathanHB left a comment •

edited

Loading