Understanding Query Execution Plans

Java Jedi
6 min readJan 13, 2025

How Query Plans Affect Query Performance

In our previous post, we discussed how to use the EXPLAIN operator to check whether an index is being utilized during query execution. However, we didn’t dive deeply into how EXPLAIN works or what the output means. If you missed that post, be sure to check it out for a solid foundation.

In today’s post, we’ll take a deeper look at the Query Execution Plan and break down the various key aspects, including Index Scan, Gather, Seq Scan, and important metrics like cost, rows, and more. Let’s get started!

We have a id column with an index created because id is declared as the Primary Key. Let’s retrieve all users whose id is less than 100.

EXPLAIN SELECT id, first_name, last_name, email, birthday, is_pro
FROM users WHERE ID < 100;

Execution Plan for Query Using the Indexed Column — id

+--------------------------------------------------------------------------+
|QUERY PLAN |
+--------------------------------------------------------------------------+
|Index Scan using users_pkey on users (cost=0.42..11.30 rows=106 width=52)|
| Index Cond: (id < 100) |
+--------------------------------------------------------------------------+

Let’s break down the output to understand what is going under the hood.

Query Plain Breakdown

1. Index Scan using users_pkey on users:

  • Index Scan : this means that the database is using an index to locate the rows that satisfy the condition, rather than sequential scanning of the whole table row by row. Index Scan is typically more efficient for queries that can take advantage of an index.
  • using users_pkey: This specifies that the database is using the users_pkey index to perform the scan that is created automatically when id is declared as Primary Key.
  • on users : It indicates which table the scan is being applied to — in this case users table.

2. (cost=0.42..11.30 rows=106 width=52):

  • cost=0.42..11.30 : This is the estimated cost of executing the query, expressed in arbitrary units. It’s important to note that this cost is neither directly related to time nor memory usage but represents the internal calculation PostgreSQL uses to determine the efficiency of the query plan.
    - 0.42: This is the estimated startup cost, or the cost to begin scanning. the index.
    - 11.30: This is the total cost of the query plan, including reading the data and completing the operation.
    The database uses these cost estimates to compare different possible query plans and choose the most efficient one. A lower cost indicates a more efficient plan.
  • rows=106: This represents the estimated number of rows that the database expects to return based on the query plan. In this case, the database expects 106 rows to match the condition id < 100.
  • width=52: This represents the average width (in bytes) of each row returned by the query. It’s the estimated size of each row in the result set.

3. Index Cond: (id < 100):

  • Index Cond: This refers to the condition that the index is using to filter rows. In this case, the condition is id < 100.
    The index (users_pkey) is being used to filter out rows where the id is less than 100. This is more efficient than a sequential scan because the index is sorted, and the database can quickly jump to the relevant rows rather than checking every row in the table.

Execution Plan for Query Using the Non-Indexed Column — first_name

EXPLAIN SELECT *
FROM users WHERE first_name = 'Jason';

Output

+-------------------------------------------------------------------------+
|QUERY PLAN |
+-------------------------------------------------------------------------+
|Gather (cost=1000.00..19869.57 rows=328 width=76) |
| Workers Planned: 2 |
| -> Parallel Seq Scan on users (cost=0.00..18836.77 rows=137 width=76)|
| Filter: (first_name = 'Jason'::text) |
+-------------------------------------------------------------------------+

Query Plan Breakdown

Gather (cost=1000.00..19869.57 rows=328 width=76):

  • Gather: This indicates that the query is using parallelism. The query is being executed by multiple workers and the results are collected (gathered) into a single result set. It suggests that the database has decided parallel execution would be beneficial for this query.
  • cost=1000.00..19869.57: The cost range represents the estimated start-up cost and total cost for the Gather operation.
    - 1000.00 is the startup cost for the Gather operation.
    - 19869.57 is the total cost of the query (including both startup and processing costs).
  • rows=328: This represents the estimated total number of rows that will be returned by the query after gathering the results from all workers. The query planner estimates that it will return 328 rows in total.
  • width=52: This is the average width (in bytes) of each row in the result set. Each row is estimated to be 52 bytes in size.

2. Workers Planned: 2

  • This part indicates that 2 worker processes have been planned to execute the query in parallel. These workers will each scan a part of the users table concurrently, which can lead to a significant performance improvement when scanning large datasets.

3. -> Parallel Seq Scan on users (cost=0.00..18836.77 rows=137 width=52)

  • Parallel Seq Scan: This shows that the database is performing a sequential scan on the users table, but it's doing so in parallel. Each of the worker processes will scan different portions of the table. While it’s still a sequential scan (i.e., reading the table row by row), it's being done by multiple workers concurrently.
  • cost=0.00..18836.77: This represents the cost range for the Parallel Sequential Scan:
    - 0.00 is the startup cost for each worker.
    - 18836.77 is the total cost of the scan for each worker.
  • rows=137: This is the estimated number of rows that each worker will process. Since there are 2 workers, the total number of rows processed will be 137 rows per worker, giving an estimated total of 274 rows (although the final total rows from the Gather step is 328, which suggests some rows may be processed by the second worker or split among the workers).
  • width=52: This is the average width of each row in the result set for the sequential scan. Each row is estimated to take 52 bytes.

4. Filter: (first_name = ‘Jason’::text)

  • Filter: This indicates that there is a filter condition applied to the data being scanned: the query is looking for rows where the first_name column is equal to 'Jason'.
  • first_name = 'Jason': This condition filters the rows based on the value of the first_name column.
    'Jason'::text: The ::text part is a type cast, meaning the string 'Jason' is being explicitly cast to the text data type, though in this case, it's likely redundant because 'Jason' is already a text value.

Putting it All Together

This query plan shows that the database is performing a parallel sequential scan on the users table, with two worker processes scanning the table in parallel. Each worker is looking for rows where the first_name column equals 'Jason'. After scanning, the Gather node collects the results from the two workers and returns the final set of rows.

Why Parallel Execution?

Parallel execution can be helpful when dealing with large datasets, as it allows the system to process multiple parts of the data simultaneously, which reduces the overall query execution time. This is especially useful for large tables or queries that need to scan a lot of data.

In this case, PostgreSQL decided that a parallel sequential scan would be faster than other potential strategies (like an index scan), likely because it’s scanning a large portion of the table and the filter condition (first_name = 'Jason') applies to many rows.

Key Points to Remember:

  • Gather: Combines results from multiple parallel workers.
  • Parallel Seq Scan: Scans the table row-by-row, but done in parallel by multiple workers.
  • Workers Planned:Two parallel workers are used to scan the data.
  • Filter: Filters rows where first_name = 'Jason'.

Breaking Down the Index Scan on an Indexed Column

In the first example, the output for the query was as following:

EXPLAIN SELECT id, first_name, last_name, email, birthday, is_pro
FROM users WHERE ID < 100;

-- OUTPUT

+-------------------------------------------------------------------------+
|QUERY PLAN |
+-------------------------------------------------------------------------+
|Index Scan using users_pkey on users (cost=0.42..11.09 rows=94 width=52)|
| Index Cond: (id < 100) |
+-------------------------------------------------------------------------+

However, what if we change the condition? Let’s say we need to retrieve all the users whose id > 100. Let’s check what we get in this scenario.


EXPLAIN SELECT id, first_name, last_name, email, birthday, is_pro
FROM users WHERE id > 100;

-- Output

+-------------------------------------------------------------+
|QUERY PLAN |
+-------------------------------------------------------------+
|Seq Scan on users (cost=0.00..26054.85 rows=989813 width=52)|
| Filter: (id > 100) |
+-------------------------------------------------------------+

OOPS! What happened to our query and index? Although index is defined on a column, the PostgreSQL used Seq Scan instead of Index Scan.

In the next post, we’ll dive into the factors that influence this decision, such as cardinality and selectivity, and explore how they affect the execution plan to better understand query performance.

In this post, we explored how PostgreSQL builds the query execution plan based on the presence and absence of a dedicated index on a column. We also delved into key aspects of the query execution, including the roles of Gather, cost, rows, width, and other important details.

If you enjoyed this post, be sure to check out my other articles! Subscribe to stay updated on future posts. In upcoming articles, we’ll continue exploring Query Plans and Database Indexes

Happy coding! 🧑‍💻👩‍💻👨‍💻

--

--

Java Jedi
Java Jedi

No responses yet