as_of with Postgres generates a Sequential Scan with composite index enabled #1310

tim-schilling · 2024-02-19T17:14:22Z

Describe the bug
When using print(MyModel.history.as_of(as_of).only('id').explain(analyze=True)), it generates a Sequential Scan fromt his code here:

django-simple-history/simple_history/manager.py

Lines 76 to 82 in 273a4ef

 elif backend == "postgresql": 

 latest_pk_attr_historic_ids = ( 

 self.order_by(self._pk_attr, "-history_date", "-pk") 

 .distinct(self._pk_attr) 

 .values_list("pk", flat=True) 

 ) 

 latest_historics = self.filter(history_id__in=latest_pk_attr_historic_ids)

To Reproduce
Steps to reproduce the behavior:

Run print(MyModel.history.as_of(as_of).only('id').explain(analyze=True))
Look for Seq Scan on in the report.

In [13]: print(Submission.history.as_of(as_of).only('id').explain(analyze=True))
Sort  (cost=33072.63..33264.17 rows=76618 width=16) (actual time=413.229..418.838 rows=140182 loops=1)
  Sort Key: lms_historicalsubmission.history_date DESC, lms_historicalsubmission.history_id DESC
  Sort Method: external merge  Disk: 3576kB
  ->  Nested Loop  (cost=25555.05..26856.84 rows=76618 width=16) (actual time=119.558..391.156 rows=140182 loops=1)
        ->  HashAggregate  (cost=25554.76..25556.76 rows=200 width=4) (actual time=118.856..137.045 rows=140182 loops=1)
              Group Key: u0.history_id
              Batches: 5  Memory Usage: 10817kB  Disk Usage: 1600kB
              ->  Unique  (cost=22932.70..23698.88 rows=148470 width=16) (actual time=86.763..100.536 rows=140182 loops=1)
                    ->  Sort  (cost=22932.70..23315.79 rows=153236 width=16) (actual time=86.762..91.731 rows=145860 loops=1)
                          Sort Key: u0.id, u0.history_date DESC, u0.history_id DESC
                          Sort Method: external merge  Disk: 3720kB
                          ->  Seq Scan on lms_historicalsubmission u0  (cost=0.00..7113.45 rows=153236 width=16) (actual time=0.598..50.720 rows=145860 loops=1)
                                Filter: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
        ->  Index Scan using lms_historicalsubmission_pkey on lms_historicalsubmission  (cost=0.29..7.74 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=140182)
              Index Cond: (history_id = u0.history_id)
              Filter: ((history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone) AND ((history_type)::text <> '-'::text))
Planning Time: 2.461 ms
Execution Time: 425.243 ms

Expected behavior
Sequential scans are the worst type of operation since it means every single record is being checked. This should not be the case, especially considering the basic nature of the query.

If we use the sqlite version of latest_of_each which uses a Subquery, we end up hitting an index scan which is significantly better:

In [2]: print(Submission.history.as_of(as_of).only('id').explain(analyze=True))
Incremental Sort  (cost=17.69..1321467.26 rows=76618 width=16) (actual time=0.549..290.055 rows=140182 loops=1)
  Sort Key: lms_historicalsubmission.history_date DESC, lms_historicalsubmission.history_id DESC
  Presorted Key: lms_historicalsubmission.history_date
  Full-sort Groups: 4378  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
  Pre-sorted Groups: 13  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
  ->  Index Scan Backward using lms_histori_history_8e6816_idx on lms_historicalsubmission  (cost=0.42..1318028.83 rows=76618 width=16) (actual time=0.097..277.867 rows=140182 loops=1)
        Index Cond: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
        Filter: (((history_type)::text <> '-'::text) AND (SubPlan 1))
        Rows Removed by Filter: 5678
        SubPlan 1
          ->  Limit  (cost=8.45..8.46 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                ->  Sort  (cost=8.45..8.46 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                      Sort Key: u0.history_date DESC, u0.history_id DESC
                      Sort Method: quicksort  Memory: 25kB
                      ->  Index Scan using lms_historicalsubmission_id_1ed5a117 on lms_historicalsubmission u0  (cost=0.42..8.44 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                            Index Cond: (id = lms_historicalsubmission.id)
                            Filter: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
Planning Time: 3.206 ms
Execution Time: 295.404 ms

Environment (please complete the following information):

OS: Mac M1 air, 14.2.1
Browser (if applicable): None
Django Simple History Version: development / 3.0.0
Django Version: 4.2.10
Database Version: Postgres 15.5

Suggested fix
I'd suggest dropping the postgresql logic branch in the latest_of_each method in favor of the Subquery.

The text was updated successfully, but these errors were encountered:

tim-schilling · 2024-02-19T17:15:08Z

@jeking3 I believe you added this method in #866. I'd love to get your opinion on this.

dmartin · 2024-04-29T23:03:53Z

I seem to be running into this as well, regardless of whether a composite index is being used. By adding a
with patch("django.db.connection.vendor", "something else"):
context manager around my .as_of() calls to avoid the Postgres-specific branch, I'm seeing up to 6x speed improvements on certain queries.

tim-schilling · 2024-05-02T19:49:47Z

@ddabble any concern with removing that code block? I think this is an easy win.

ddabble · 2024-05-03T17:04:59Z

Sure, as long as removing it doesn't severely affect performance in some cases (which I assumed was the reason it was added in the first place, but perhaps not..?), I'm all for both improving performance and reducing the amount of DB backend-specific code 🙂 Would be great having some tests analyzing/measuring the query performance, though, but I'm not familiar with any method of doing that easily 🤔

ddabble · 2024-05-03T17:07:41Z

@jeking3 I believe you added this method in #866. I'd love to get your opinion on this.

Btw, it looks like the method was mainly added by @TyrantWave in #759, and that @jeking3 simply moved the code in #866

tim-schilling added the enhancement label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

as_of with Postgres generates a Sequential Scan with composite index enabled #1310

as_of with Postgres generates a Sequential Scan with composite index enabled #1310

tim-schilling commented Feb 19, 2024 •

edited by ddabble

tim-schilling commented Feb 19, 2024

dmartin commented Apr 29, 2024

tim-schilling commented May 2, 2024

ddabble commented May 3, 2024 •

edited

ddabble commented May 3, 2024

as_of with Postgres generates a Sequential Scan with composite index enabled #1310

as_of with Postgres generates a Sequential Scan with composite index enabled #1310

Comments

tim-schilling commented Feb 19, 2024 • edited by ddabble

tim-schilling commented Feb 19, 2024

dmartin commented Apr 29, 2024

tim-schilling commented May 2, 2024

ddabble commented May 3, 2024 • edited

ddabble commented May 3, 2024

tim-schilling commented Feb 19, 2024 •

edited by ddabble

ddabble commented May 3, 2024 •

edited