Non-deterministic performance of query on select, from 1s to 60s on table with 1 billion rows
I'm trying to investigate why the performance of this query is so non-deterministic. It can take anywhere from 1 seconds, to 60 seconds and above. The nature of the query is to select a "time window", and get all rows from within that time window.
Here's the query in question, running on a table of approximately 1 billion rows:
SELECT CAST(extract(EPOCH from ts)*1000000 as bigint) as ts
, ticks
, quantity
, side
FROM order_book
WHERE ts >= TO_TIMESTAMP(1618882633073383/1000000.0)
AND ts < TO_TIMESTAMP(1618969033073383/1000000.0)
AND zx_prod_id = 0
ORDER BY ts ASC, del desc;
This is how the table is created
CREATE TABLE public.order_book
(
ts timestamp with time zone NOT NULL,
zx_prod_id smallint NOT NULL,
ticks integer NOT NULL,
quantity integer NOT NULL,
side boolean NOT NULL,
del boolean NOT NULL
)
The values within TO_TIMESTAMP
will keep sliding forward as I walk the whole table. Here is the EXPLAIN ANALYZE
output for the same query on two different time windows:
Slow Performance
Gather Merge (cost=105996.20..177498.48 rows=586308 width=18) (actual time=45196.559..45280.769 rows=539265 loops=1)
Workers Planned: 6
Workers Launched: 6
Buffers: shared hit=116386 read=42298
-> Sort (cost=104996.11..105240.40 rows=97718 width=18) (actual time=45169.717..45176.775 rows=77038 loops=7)
Sort Key: (((date_part('epoch'::text, _hyper_16_214_chunk.ts) * '1000000'::double precision))::bigint), _hyper_16_214_chunk.del DESC
Sort Method: quicksort Memory: 9327kB
Worker 0: Sort Method: quicksort Memory: 8967kB
Worker 1: Sort Method: quicksort Memory: 9121kB
Worker 2: Sort Method: quicksort Memory: 9098kB
Worker 3: Sort Method: quicksort Memory: 9075kB
Worker 4: Sort Method: quicksort Memory: 9019kB
Worker 5: Sort Method: quicksort Memory: 9031kB
Buffers: shared hit=116386 read=42298
-> Result (cost=0.57..96897.07 rows=97718 width=18) (actual time=7.475..45131.932 rows=77038 loops=7)
Buffers: shared hit=116296 read=42298
-> Parallel Index Scan using _hyper_16_214_chunk_order_book_ts_idx on _hyper_16_214_chunk (cost=0.57..95187.01 rows=97718 width=18) (actual time=7.455..45101.670 rows=77038 loops=7)
Index Cond: ((ts >= '2021-04-22 01:34:31.357179+00'::timestamp with time zone) AND (ts < '2021-04-22 02:34:31.357179+00'::timestamp with time zone))
Filter: (zx_prod_id = 0)
Rows Removed by Filter: 465513
Buffers: shared hit=116296 read=42298
Planning Time: 1.107 ms
JIT:
Functions: 49
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 9.273 ms, Inlining 0.000 ms, Optimization 2.008 ms, Emission 36.235 ms, Total 47.517 ms
Execution Time: 45335.178 ms
Fast Performance
Gather Merge (cost=105095.94..170457.62 rows=535956 width=18) (actual time=172.723..240.628 rows=546367 loops=1)
Workers Planned: 6
Workers Launched: 6
Buffers: shared hit=158212
-> Sort (cost=104095.84..104319.16 rows=89326 width=18) (actual time=146.702..152.849 rows=78052 loops=7)
Sort Key: (((date_part('epoch'::text, _hyper_16_214_chunk.ts) * '1000000'::double precision))::bigint), _hyper_16_214_chunk.del DESC
Sort Method: quicksort Memory: 11366kB
Worker 0: Sort Method: quicksort Memory: 8664kB
Worker 1: Sort Method: quicksort Memory: 8986kB
Worker 2: Sort Method: quicksort Memory: 9116kB
Worker 3: Sort Method: quicksort Memory: 8858kB
Worker 4: Sort Method: quicksort Memory: 9057kB
Worker 5: Sort Method: quicksort Memory: 6611kB
Buffers: shared hit=158212
-> Result (cost=0.57..96750.21 rows=89326 width=18) (actual time=6.145..127.591 rows=78052 loops=7)
Buffers: shared hit=158122
-> Parallel Index Scan using _hyper_16_214_chunk_order_book_ts_idx on _hyper_16_214_chunk (cost=0.57..95187.01 rows=89326 width=18) (actual time=6.124..114.023 rows=78052 loops=7)
Index Cond: ((ts >= '2021-04-22 01:34:31.357179+00'::timestamp with time zone) AND (ts < '2021-04-22 02:34:31.357179+00'::timestamp with time zone))
Filter: (zx_prod_id = 4)
Rows Removed by Filter: 464498
Buffers: shared hit=158122
Planning Time: 0.419 ms
JIT:
Functions: 49
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 10.405 ms, Inlining 0.000 ms, Optimization 2.185 ms, Emission 39.188 ms, Total 51.778 ms
Execution Time: 274.413 ms
I interpreted this output as most of the blame lying on this parallel index scan.
At first, I tried to raise work_mem
to 1 GB and shared_buffers
to 24 GB, thinking that maybe it couldn't fit all the stuff it needed in RAM, but that didn't seem to help.
Next, I tried creating an index on (zx_prod_id, ts)
, thinking that the filter on the parallel index scan might be taking a while, but that didn't seem to do anything either.
I'm no database expert, and so I've kind of exhausted the limits of my knowledge.
Answers 1
Explanation
The all-important difference between the two query plans is the added
read=xyz
bit in multiple places of the slow version.Slow:
Fast:
This tells you that Postgres encountered data (or index) pages that were not cached, yet. Repeat the slow query (possibly more than one time, until the added
read=xyz
vanishes), then you'll see comparable performance on equally cached data pages.Read the manual about
BUFFERS
in theEXPLAIN
output.Related:
Notably, increasing
work_mem
can actually hurt your situation, aswork_mem
and cache memory (both operating system cache and Postres' own shared memory buffers) are competitors for available RAM. Settingwork_mem
higher can lead to data being evicted from the (then smaller) cache sooner, so your "slow query" becomes more likely to occur. Don't forget that Postgres can use many times the amount set withwork_mem
. The manual:Your query uses 6 parallel workers, that's already 6x
work_mem
- as possible worst case; not every worker will use its maximum allowance.In case of insufficient
work_mem
you seedisk
mentioned in various contexts of theEXPLAIN
output. That's when additionalwork_mem
is in order.Increasing
shared_buffers
can make more sense. Still, don't set it too high, either. The manual:If you have plenty of RAM for everything (hard to imagine for a "table with 1 billion rows"), then all that's missing is that data pages in question just were not cached, yet. You may be interested in
pg_prewarm
. See:Aside: the second query plan shows
Filter: (zx_prod_id = 4)
, which does not match the displayed query, but I think that's insignificant for the problem.Additional improvements
You already tried an index on
(zx_prod_id, ts)
but it ...That index would typically look good for the query, though.
The multicolumn index is around 50 % larger than the index on just
(ts)
: 4 + 8 + 16 = 28 bytes vs. 4 + 8 + 8 = 20 bytes per index tuple, and possibly less compression from "index deduplication". And since RAM seems to be a contested resource, that is a downside. See:I see
rows=539265
vs.Rows Removed by Filter: 465513
. So the query reads twice the rows it needs to. Depending on data distribution, that index may or may not be an improvement. It would certainly be if your table was physically clustered on that index. Then most data pages contain rows with the samezx_prod_id
value, which makes that index much more effective. Again, see:Your query seems to always filter for a single value of
zx_prod_id
. If that's focused around a few values, or if there are only few distinct values to begin with, a couple of partial indexes can make sense. Like, if onlyzx_prod_id IN (1,4)
are targeted:Then the query can work with a (much) smaller index, and does not have to filter half of the rows.
ts
obviously is typetimestamptz
. Still you handle timestamps asbigint
numbers representing microseconds. That split approach has downsides and pitfalls. It's typically best to work with timestamps across the food chain. ExecutingCAST(extract(EPOCH from ts)*1000000 as bigint)
for every row is not free. Adds up for half a million rows.