Better cache management with broadcast and caches. #406

cugni · 2024-09-10T12:40:19Z

What went wrong?

In Qbeast Spark, we use command broadcast in multiple places (e.g., twice in the BoreadcastedTableChanges, 1 in the DenormalizedBlock, and 4 in the OTreeDataAnalyzer), but we never unpersist or destroy the broadcast, which occupies unnecessary space in Spark's memory. Spark uses LRU policy to evict the cache, but if we can tell that a broadcast is no longer used, we should clean it to avoid conflicts with other workloads in the same Spark cluster.

The complexity of solving this is that in our API, we often pass Dataframes, so the part of the code that is creating the broadcast is not the one that is executing the final action on the Dataframe; we have no clear way to understand when it is a good time to call a unpersist.
For example, in this not-working scala pseudo-code:

def f(df:Dataframe):Dataframe =>{
val bc = spark.sparkContext.broadcast(Seq(1,2,3))
df.mapPartition(
rows ->{
val l = bd.value
// do something
})
// I can't call bc.unpersist(), as the broadcast hasn't been used yet. 
}


val data = f(spark.range(10)).collect()
// now I can unpersist bc, but I have no reference to it.

Ideally, we should have something like

def f(df:Dataframe):Dataframe =>{
val bc = QbeastCacheContext.broadcast(Seq(1,2,3))
df.mapPartition(
rows ->{
val l = bd.value
// do something
})
}

QbeastCacheContext.init()
val data = f(spark.range(10)).collect()
QbeastCacheContext.release()

TODO

understand if it is a problem (what's the impact of leaving an object unused in the cache)?
Propose an API
Write code and test of the API
Update the code to use the new API.

cugni added type: enhancement Improvement of existing feature or code priority: normal This issue has normal priority type:performance labels Sep 10, 2024

cugni mentioned this issue Sep 10, 2024

Issue #405: DataWriter refactory #402

Merged

6 tasks

fpj added type:performance and removed priority: normal This issue has normal priority type:performance labels Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better cache management with broadcast and caches. #406

Better cache management with broadcast and caches. #406

cugni commented Sep 10, 2024

Better cache management with broadcast and caches. #406

Better cache management with broadcast and caches. #406

Comments

cugni commented Sep 10, 2024

What went wrong?