SQL Optimization Series - Part 2 - Get to know SQL Indexes | By hungnguyen028

SQL

A better way to debug your PHP code

Khoa Pham

November 16, 2022

Discover iOS Dark Mode Programming

thomas.le

November 1, 2022

Introduction

Everyone knows that “An index makes the query fast,” but that is not enough to optimize your queries. We must also know how an index works internally, and how to use it properly.

If you are looking for how to optimize your SQL performance, you have come to the right place. For a quick walkthrough on indexes, check out our previous article.

Last time, we already discussed some common SQL indexes and their use cases that cause slow performance. In this article, we will continue with two more specific cases, and how to solve them.

Function-based index

Let’s take a look back to our previous example when we created an index on last_name to speed up the performance.

Although the index helps to improve the performance considerably, it requires the search term to be in the exact same type case (UPPER or lower) as what is stored in the database.

Now, how about we simplify the process by ignoring the case, and just allowing users to search by the last name? In this case, we can just simply convert both sides of the condition to UPPER CASE:

As expected, regardless of the capitalization of the search term or the last_name column, the UPPER function makes them match. Now let's take a look at the query plan:

Uh-oh, it's Seq Scan (Full Table Scan)! This means that the DBMS needs to scan the entire table to find the matching entries. But why does DBMS not use the index on last_name?

We can easily recognize the relation between last_name and UPPER(last_name), and we expect the database to see it as well. Yet the search is not on the last_name column, but on UPPER(last_name), and that is something completely different from the database’s perspective. This is a trap that we usually all fall into.

To support that query, we need an index that covers the UPPER(last_name):

An index that contains functions or expressions is called a Function-based index (FBI*)*. The function-based index applies the function first and puts the result into the index. As a result, the index tree contains the name in upper case.

The database can use a function-based index if the exact expression of the index definition appears in an SQL statement. Let's take a look at the execution plan:

Now that the database has used the new index for the query, it traverses the B-tree and follows the leaf node chain for the matching entries.

Since there is no dedicated function for Function-based Indexes, we can even use User-defined functions in the index definition. However, the function must be determined, which means that for the same parameters, the functions must always return the same results. When inserting a new row, the database will apply the function and store the result in the index, and will only be updated when the row is changed.

Range scanning index

Not only can we use equal operators when working with indexes, but also inequalities such as <> as well. We can also take advantage of this when optimizing queries with indexes.

The biggest problem of the Range scanning index is the leaf node traversal. If there are many leaf nodes to scan, we may face a performance issue. So the best practice is to keep the range as small as possible. But the question is, how?

The answer to explicitly mention the start and end of the query:

The index on date_of_birth is only scanned within the specified range, starting at the first date and ending at the second.

But what if we also want to find employees from a specific country? Then we need a new condition on country_id:

In this case, the index needs to cover 2 columns, but in which order?

Let’s see the range scan on this new index (date_of_birth, country_id)

a screenshot of the range scan on the above new index (date_of_birth, country_id)

The index is ordered by date_of_birth first. The country_id would only be used to sort these entries if more than one employee was born on the same day.

We can easily see that the country_id becomes useless during tree traversal and there is no entry with country_id = 27 in the branch nodes (only one in the leaf nodes). The filter on date_of_birth becomes the only condition that limits the scanned index range.

Let’s try to reverse the column order of the index. The following figure illustrates the scan if the index starts with country_id:

a screenshot of an illustration of the range scan if the index starts with country_id

According to the index tree, there is no need to scan the first leaf node because the branch node has shown that there is no employee for country_id 27 born after June 25th, 1979.

The tree traversal leads to the second leaf node. With the index, the conditions on date_of_birth and country_id limit the scanned index range. The traversal could end at the same leaf node.

Let’s see the execution plan when we have the index start with date_of_birth:

This is the execution plan when we have the index start with country_id:

The performance could be very different depending on the data and search criteria, but the bigger the range becomes, the bigger the performance difference will be.

Based on the example, we can have 2 things to remember:

Index for equality first, then ranges.
The most selective column should be at the leftmost position of the index.

To optimize query performance, it is very important to know the scanned index range, yet most database systems cannot show this in the execution plan. In order to detect this range, we need to use Access Predicate and Index Filter Predicate:

The Access Predicate: expresses the start and stop conditions of the leaf node traversal
Index Filter Predicate: is applied during the leaf node traversal only and does not contribute to the start and stop conditions and neither narrows the scanned range

In this series, we use PostgreSQL 12, and unfortunately, PostgreSQL execution plans do not show Access predicate and Index Filter Predicate separately — both show up as “Index Cond”. This means that we need to compare the “Index Cond” to the index definition in order to differentiate access predicates from index filter predicates.

Back to our example, for the index that starts with date_of_birth, date_of_birth is the Access Predicate that limits the scanned range, country_id is used only as a filter. Whereas for the index that starts with country_id, country_id and date_of_birth are both Access Predicates which limits the scanned range.

If you want to dig deeper on Access Predicates and Filter Predicates, check out these articles:

Closing

In this article, we have discovered how to use Function-based Index and Range scanning Index in order to not slow down our queries. As there is no dedicated function for Function-based Indexes, we can even use user-defined functions, just make sure that your function is determined first. We can also take advantage of inequalities when optimizing with Range scanning indexes, but remember to limit your scanning range to avoid getting slowed down by leaf node traversal.

For more cool technical stuff, check out our wiki. Until next time, everyone, happy coding!