SQL Optimization Series - Part 1 - Get to know SQL Indexes | By hungnguyen028

SQL

A better way to debug your PHP code

Khoa Pham

November 16, 2022

Discover iOS Dark Mode Programming

thomas.le

November 1, 2022

About the Series

SQL optimization is the process of tuning SQL queries to improve server performance. The fundamental idea is to reduce the amount of time it takes for a user to retrieve a response after performing a query, as well as the number of resources required to process one.

For software teams that rely on relational databases, the ability to do SQL performance tuning is much needed. Follow us on this advanced SQL series to pocket yourself some handy query tips.

Introduction

Indexes are a common way to enhance database performance. Small, fast, and optimized for quick lookups, an index allows the database server to find and retrieve specific rows much faster than it could without one.

With examples, we will provide you with a closer look at SQL indexes.

SQL Index 101

This is a query to find employees by their id:

And its query plan:

As shown in the query plan, Seq Scan (or Full Table Scan) has been used for the query. Thus, the system would have to scan the entire employee table to find all matching entries. This is clearly an inefficient method. In this case, let’s try adding a new index:

And voila:

You can see that the new index has reduced significantly the execution time of the query. The database does not need to scan the entire table for the matching entries but only in the index.

But how does the index improve its performance?

An index was built using a combination of B-Tree and Doubly Linked List. B-Tree is a data structure that allows us to quickly search for the matching node (O(log n)). The Doubly Linked List builds the logical order between the leaf nodes. Let’s take a look at the following tree:

The figure above illustrates a search for key 57.

The tree traversal starts at the root node. Each entry is processed in ascending order until a value is greater than or equal to 57, it is entry 64 in the figure.
The database follows the reference to the brand node and repeats the procedure until the tree traversal reaches the leaf node.
From the leaf node, the database will access the table row based on ROWID (a reference to real data).

With the index, the database doesn’t need to scan the entire table, it quickly skips the entries which do not match the criteria and therefore reduces the number of disk accesses.

“An index makes the query fast” is the most basic explanation of an index, but despite the efficiency of the index, there are still cases where index lookup doesn’t work as fast as expected. These poor index incidents usually occur when an index is created on a column that doesn’t provide data manipulation, or on multiple columns that slow down queries instead of speeding them up.

Concatenate index

Let’s consider this is a company merger, employees can have the same id while being in different countries. As a result, the employee_id is no longer unique across multiple countries.

Now that the new primary key contains both country_id and employee_id, the index for it would be:

To find a specific employee, we use the following query:

After modifying the index, the database uses the new index for the lookup. The execution time looks good. Whenever we use the full primary keys in the where clause, the index will be used for finding matching entries. But what if we use 1 column only?

The execution plan reveals that the database does not use the index, it performs a Full Table Scan instead. As a result, the database reads the entire table and evaluates every row.

The execution time will grow with the table size. The danger of this is that it is often fast enough in small data sets like the development environment, yet would cause serious performance problems in production.

The ordering of a two-column index is like the ordering of a telephone directory: it is sorted by surname first, then by the first name. That means that a two-column index does not support searching by the second column alone.

Let’s look at the B-tree of the two-column index:

a b-tree of the two-column index for employee_id and country_id in an SQL database

The picture above shows that the entries for country_id = 20 are not stored next to each other, and there’s even no country_id = 20 in the tree (only in leaf nodes). Therefore, the tree is now useless for the query and cannot be used for the lookup.

We can create another index on country_id to improve the query speed. However, there is a better solution if we assume that searching only employee_id does not make sense.

The trick is to reverse the index column order so that country_id is in the first position so that the tree will be sorted by country_id:

The execution plan confirms that the database uses the “reversed” index. The country_id is not unique anymore so the database must follow the leaf nodes in order to find all matching entries (500 rows).

A database can use a concatenated index when searching with the leading columns. An index with three columns can be used:

When searching for the first column
When searching with the first two columns together
When searching using all columns.

To define an optimal index we must understand more than just how indexes work, we must also know how the application queries the data. This means we have to know the column combinations that appear in the where clause.

Slow index

Modifying the index might also bring unpleasant side effects. At first glance, we can sense the following query will cause the application to slow down:

There are 2 possible reasons:

The query is slow because the index lookup returns many ROWIDs, one for each employee and the database must retrieve them individually. It combines 2 factors that make an index slow: the database reads a wide index range and has to fetch many rows individually.
The second reason could be the database has to scan all the rows in the table to find matching entries.

The query plan is chosen by the planner/optimizer. The query optimizer is a component of a database that transforms an SQL statement into an execution plan. There are 2 types of query optimizers: Cost-based optimizers and Rule-based optimizers.

Cost-based optimizers use statistics of tables, columns, tree deep, number of leaf nodes, etc to give the optimal execution plan while rule-based optimizers use a set of rules to determine how to execute a query.

However, in this case, neither of the query plans is good for the query. The example cannot hide the fact that a new index is the best solution.

The index access delivers one row only. The database has to fetch only that row from the table and this query plan is definitely faster than the others. Using an index does not automatically mean a statement is executed in the best way possible.

Briefly

Everyone knows that “An index makes the query fast,” but that’s not enough to optimize your queries. We must also know how an index works internally, how to use it properly, and the most important thing that we should know is how the application queries the data.

Until next time, everyone, happy coding!

Tutorials

SQL Optimization Series - Part 1 - Get to know SQL Indexes