Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collection bindings OOM for large data sets #214

Open
huahaiy opened this issue May 27, 2023 · 1 comment
Open

collection bindings OOM for large data sets #214

huahaiy opened this issue May 27, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@huahaiy
Copy link
Contributor

huahaiy commented May 27, 2023

From clojurian datalevin channel:

andersmurphy 4:56 AM
So I’m finding when I use collection bindings. I.e pass a collection and :in $ [?x …] I’m much more likely to run out of memory, even if the collection contains a single value. Inlining the values and performing an or doesn’t result in running out of memory. Is there something that makes collection bindings inherently expensive?
This implementation does run out of memory (with large datasets):

(d/q '[:find (pull ?a [:artist/name])
       :in $ [?c ...]
       :where [?a :artist/country ?country]
              [?country :country/name ?c]]
     db ["Canada" "Japan"])

The implementations bellow don’t run out of memory (with large datasets):

(d/q '[:find (pull ?a [:artist/name])
       :where [?a :artist/country ?country]
       (or [?country :country/name "Canada"]
         [?country :country/name "Japan"])]
  db)

or

(d/q '[:find (pull ?a [:artist/name])
         :in $ ?c1 ?c2
         :where [?a :artist/country ?country]
         (or [?country :country/name ?c1]
           [?country :country/name ?c2])]
    db
    "Canada"
    "Japan")
@huahaiy huahaiy added the enhancement New feature or request label May 27, 2023
@huahaiy
Copy link
Contributor Author

huahaiy commented Feb 12, 2024

The first approach is very expensive, because a cross product of two countries with all artists are produced, whereas the latter two options does a natural join of these. Natural join is a lot cheaper than cartesian product.

A possible optimization is to automatically translate the first to the laters. This is going to be a more advanced optimization that is further down the line. The next release of the optimizer probably won't have this feature, as we are focusing on optimizing the where clauses and simple bindings. The optimization of collection bindings will wait after that is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant