Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as_of with Postgres generates a Sequential Scan with composite index enabled #1310

Open
tim-schilling opened this issue Feb 19, 2024 · 5 comments

Comments

@tim-schilling
Copy link
Contributor

tim-schilling commented Feb 19, 2024

Describe the bug
When using print(MyModel.history.as_of(as_of).only('id').explain(analyze=True)), it generates a Sequential Scan fromt his code here:

elif backend == "postgresql":
latest_pk_attr_historic_ids = (
self.order_by(self._pk_attr, "-history_date", "-pk")
.distinct(self._pk_attr)
.values_list("pk", flat=True)
)
latest_historics = self.filter(history_id__in=latest_pk_attr_historic_ids)

To Reproduce
Steps to reproduce the behavior:

  1. Run print(MyModel.history.as_of(as_of).only('id').explain(analyze=True))
  2. Look for Seq Scan on in the report.
In [13]: print(Submission.history.as_of(as_of).only('id').explain(analyze=True))
Sort  (cost=33072.63..33264.17 rows=76618 width=16) (actual time=413.229..418.838 rows=140182 loops=1)
  Sort Key: lms_historicalsubmission.history_date DESC, lms_historicalsubmission.history_id DESC
  Sort Method: external merge  Disk: 3576kB
  ->  Nested Loop  (cost=25555.05..26856.84 rows=76618 width=16) (actual time=119.558..391.156 rows=140182 loops=1)
        ->  HashAggregate  (cost=25554.76..25556.76 rows=200 width=4) (actual time=118.856..137.045 rows=140182 loops=1)
              Group Key: u0.history_id
              Batches: 5  Memory Usage: 10817kB  Disk Usage: 1600kB
              ->  Unique  (cost=22932.70..23698.88 rows=148470 width=16) (actual time=86.763..100.536 rows=140182 loops=1)
                    ->  Sort  (cost=22932.70..23315.79 rows=153236 width=16) (actual time=86.762..91.731 rows=145860 loops=1)
                          Sort Key: u0.id, u0.history_date DESC, u0.history_id DESC
                          Sort Method: external merge  Disk: 3720kB
                          ->  Seq Scan on lms_historicalsubmission u0  (cost=0.00..7113.45 rows=153236 width=16) (actual time=0.598..50.720 rows=145860 loops=1)
                                Filter: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
        ->  Index Scan using lms_historicalsubmission_pkey on lms_historicalsubmission  (cost=0.29..7.74 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=140182)
              Index Cond: (history_id = u0.history_id)
              Filter: ((history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone) AND ((history_type)::text <> '-'::text))
Planning Time: 2.461 ms
Execution Time: 425.243 ms

Expected behavior
Sequential scans are the worst type of operation since it means every single record is being checked. This should not be the case, especially considering the basic nature of the query.

If we use the sqlite version of latest_of_each which uses a Subquery, we end up hitting an index scan which is significantly better:

In [2]: print(Submission.history.as_of(as_of).only('id').explain(analyze=True))
Incremental Sort  (cost=17.69..1321467.26 rows=76618 width=16) (actual time=0.549..290.055 rows=140182 loops=1)
  Sort Key: lms_historicalsubmission.history_date DESC, lms_historicalsubmission.history_id DESC
  Presorted Key: lms_historicalsubmission.history_date
  Full-sort Groups: 4378  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
  Pre-sorted Groups: 13  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
  ->  Index Scan Backward using lms_histori_history_8e6816_idx on lms_historicalsubmission  (cost=0.42..1318028.83 rows=76618 width=16) (actual time=0.097..277.867 rows=140182 loops=1)
        Index Cond: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
        Filter: (((history_type)::text <> '-'::text) AND (SubPlan 1))
        Rows Removed by Filter: 5678
        SubPlan 1
          ->  Limit  (cost=8.45..8.46 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                ->  Sort  (cost=8.45..8.46 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                      Sort Key: u0.history_date DESC, u0.history_id DESC
                      Sort Method: quicksort  Memory: 25kB
                      ->  Index Scan using lms_historicalsubmission_id_1ed5a117 on lms_historicalsubmission u0  (cost=0.42..8.44 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                            Index Cond: (id = lms_historicalsubmission.id)
                            Filter: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
Planning Time: 3.206 ms
Execution Time: 295.404 ms

Environment (please complete the following information):

  • OS: Mac M1 air, 14.2.1
  • Browser (if applicable): None
  • Django Simple History Version: development / 3.0.0
  • Django Version: 4.2.10
  • Database Version: Postgres 15.5

Suggested fix
I'd suggest dropping the postgresql logic branch in the latest_of_each method in favor of the Subquery.

@tim-schilling
Copy link
Contributor Author

@jeking3 I believe you added this method in #866. I'd love to get your opinion on this.

@dmartin
Copy link

dmartin commented Apr 29, 2024

I seem to be running into this as well, regardless of whether a composite index is being used. By adding a
with patch("django.db.connection.vendor", "something else"):
context manager around my .as_of() calls to avoid the Postgres-specific branch, I'm seeing up to 6x speed improvements on certain queries.

@tim-schilling
Copy link
Contributor Author

@ddabble any concern with removing that code block? I think this is an easy win.

@ddabble
Copy link
Member

ddabble commented May 3, 2024

Sure, as long as removing it doesn't severely affect performance in some cases (which I assumed was the reason it was added in the first place, but perhaps not..?), I'm all for both improving performance and reducing the amount of DB backend-specific code 🙂 Would be great having some tests analyzing/measuring the query performance, though, but I'm not familiar with any method of doing that easily 🤔

@ddabble
Copy link
Member

ddabble commented May 3, 2024

@jeking3 I believe you added this method in #866. I'd love to get your opinion on this.

Btw, it looks like the method was mainly added by @TyrantWave in #759, and that @jeking3 simply moved the code in #866

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants