Understanding Query Execution Plans
How Query Plans Affect Query Performance
In our previous post, we discussed how to use the EXPLAIN operator to check whether an index is being utilized during query execution. However, we didn’t dive deeply into how EXPLAIN works or what the output means. If you missed that post, be sure to check it out for a solid foundation.
In today’s post, we’ll take a deeper look at the Query Execution Plan and break down the various key aspects, including Index Scan, Gather, Seq Scan, and important metrics like cost, rows, and more. Let’s get started!
We have a id
column with an index created because id
is declared as the Primary Key
. Let’s retrieve all users whose id
is less than 100.
EXPLAIN SELECT id, first_name, last_name, email, birthday, is_pro
FROM users WHERE ID < 100;
Execution Plan for Query Using the Indexed Column — id
+--------------------------------------------------------------------------+
|QUERY PLAN |
+--------------------------------------------------------------------------+
|Index Scan using users_pkey on users (cost=0.42..11.30 rows=106 width=52)|
| Index Cond: (id < 100) |
+--------------------------------------------------------------------------+
Let’s break down the output to understand what is going under the hood.
Query Plain Breakdown
1. Index Scan using users_pkey on users:
Index Scan
: this means that the database is using an index to locate the rows that satisfy the condition, rather than sequential scanning of the whole table row by row. Index Scan is typically more efficient for queries that can take advantage of an index.using users_pkey
: This specifies that the database is using theusers_pkey
index to perform the scan that is created automatically whenid
is declared asPrimary Key
.on users
: It indicates which table the scan is being applied to — in this caseusers
table.
2. (cost=0.42..11.30 rows=106 width=52):
cost=0.42..11.30
: This is the estimated cost of executing the query, expressed inarbitrary units
. It’s important to note that this cost is neither directly related totime
normemory usage
but represents the internal calculation PostgreSQL uses to determine the efficiency of the query plan.
- 0.42: This is the estimated startup cost, or the cost to begin scanning. the index.
- 11.30: This is the total cost of the query plan, including reading the data and completing the operation.
The database uses these cost estimates to compare different possible query plans and choose the most efficient one. A lower cost indicates a more efficient plan.rows=106
: This represents the estimated number of rows that the database expects to return based on the query plan. In this case, the database expects 106 rows to match the conditionid < 100
.width=52
: This represents the average width (in bytes) of each row returned by the query. It’s the estimated size of each row in the result set.
3. Index Cond: (id < 100):
Index Cond:
This refers to the condition that the index is using to filter rows. In this case, the condition isid < 100
.
The index (users_pkey
) is being used to filter out rows where theid
is less than 100. This is more efficient than a sequential scan because the index is sorted, and the database can quickly jump to the relevant rows rather than checking every row in the table.
Execution Plan for Query Using the Non-Indexed Column — first_name
EXPLAIN SELECT *
FROM users WHERE first_name = 'Jason';
Output
+-------------------------------------------------------------------------+
|QUERY PLAN |
+-------------------------------------------------------------------------+
|Gather (cost=1000.00..19869.57 rows=328 width=76) |
| Workers Planned: 2 |
| -> Parallel Seq Scan on users (cost=0.00..18836.77 rows=137 width=76)|
| Filter: (first_name = 'Jason'::text) |
+-------------------------------------------------------------------------+
Query Plan Breakdown
Gather (cost=1000.00..19869.57 rows=328 width=76):
Gather
: This indicates that the query is using parallelism. The query is being executed by multiple workers and the results are collected (gathered) into a single result set. It suggests that the database has decided parallel execution would be beneficial for this query.cost=1000.00..19869.57
: Thecost
range represents the estimated start-up cost and total cost for the Gather operation.
- 1000.00 is the startup cost for theGather
operation.
- 19869.57 is the total cost of the query (including both startup and processing costs).rows=328:
This represents the estimated total number of rows that will be returned by the query after gathering the results from all workers. The query planner estimates that it will return 328 rows in total.width=52:
This is the average width (in bytes) of each row in the result set. Each row is estimated to be 52 bytes in size.
2. Workers Planned: 2
- This part indicates that 2 worker processes have been planned to execute the query in parallel. These workers will each scan a part of the
users
table concurrently, which can lead to a significant performance improvement when scanning large datasets.
3. -> Parallel Seq Scan on users (cost=0.00..18836.77 rows=137 width=52)
Parallel Seq Scan:
This shows that the database is performing a sequential scan on theusers
table, but it's doing so in parallel. Each of the worker processes will scan different portions of the table. While it’s still a sequential scan (i.e., reading the table row by row), it's being done by multiple workers concurrently.cost=0.00..18836.77:
This represents the cost range for the Parallel Sequential Scan:
- 0.00 is the startup cost for each worker.
- 18836.77 is the total cost of the scan for each worker.rows=137:
This is the estimated number of rows that each worker will process. Since there are 2 workers, the total number of rows processed will be 137 rows per worker, giving an estimated total of 274 rows (although the final total rows from theGather
step is 328, which suggests some rows may be processed by the second worker or split among the workers).width=52:
This is the average width of each row in the result set for the sequential scan. Each row is estimated to take 52 bytes.
4. Filter: (first_name = ‘Jason’::text)
Filter:
This indicates that there is a filter condition applied to the data being scanned: the query is looking for rows where thefirst_name
column is equal to'Jason'
.first_name = 'Jason'
: This condition filters the rows based on the value of thefirst_name
column.'Jason'::text
: The::text
part is a type cast, meaning the string'Jason'
is being explicitly cast to thetext
data type, though in this case, it's likely redundant because'Jason'
is already a text value.
Putting it All Together
This query plan shows that the database is performing a parallel sequential scan on the users
table, with two worker processes scanning the table in parallel. Each worker is looking for rows where the first_name
column equals 'Jason'
. After scanning, the Gather
node collects the results from the two workers and returns the final set of rows.
Why Parallel Execution?
Parallel execution can be helpful when dealing with large datasets, as it allows the system to process multiple parts of the data simultaneously, which reduces the overall query execution time. This is especially useful for large tables or queries that need to scan a lot of data.
In this case, PostgreSQL decided that a parallel sequential scan would be faster than other potential strategies (like an index scan), likely because it’s scanning a large portion of the table and the filter condition (first_name = 'Jason'
) applies to many rows.
Key Points to Remember:
Gather:
Combines results from multiple parallel workers.Parallel Seq Scan:
Scans the table row-by-row, but done in parallel by multiple workers.Workers Planned:
Two parallel workers are used to scan the data.Filter:
Filters rows wherefirst_name = 'Jason'
.
Breaking Down the Index Scan on an Indexed Column
In the first example, the output for the query was as following:
EXPLAIN SELECT id, first_name, last_name, email, birthday, is_pro
FROM users WHERE ID < 100;
-- OUTPUT
+-------------------------------------------------------------------------+
|QUERY PLAN |
+-------------------------------------------------------------------------+
|Index Scan using users_pkey on users (cost=0.42..11.09 rows=94 width=52)|
| Index Cond: (id < 100) |
+-------------------------------------------------------------------------+
However, what if we change the condition? Let’s say we need to retrieve all the users whose id > 100
. Let’s check what we get in this scenario.
EXPLAIN SELECT id, first_name, last_name, email, birthday, is_pro
FROM users WHERE id > 100;
-- Output
+-------------------------------------------------------------+
|QUERY PLAN |
+-------------------------------------------------------------+
|Seq Scan on users (cost=0.00..26054.85 rows=989813 width=52)|
| Filter: (id > 100) |
+-------------------------------------------------------------+
OOPS! What happened to our query and index? Although index is defined on a column, the PostgreSQL used Seq Scan
instead of Index Scan
.
In the next post, we’ll dive into the factors that influence this decision, such as cardinality
and selectivity
, and explore how they affect the execution plan to better understand query performance.
In this post, we explored how PostgreSQL builds the query execution plan based on the presence and absence of a dedicated index on a column. We also delved into key aspects of the query execution, including the roles of Gather, cost, rows, width, and other important details.
If you enjoyed this post, be sure to check out my other articles! Subscribe to stay updated on future posts. In upcoming articles, we’ll continue exploring Query Plans and Database Indexes
Happy coding! 🧑💻👩💻👨💻