Index - ShuWa

An index is a directory of data, and the so-called storage engine is essentially the implementation method of how to store data, how to create indexes for stored data, and how to update and query data, among other techniques.

Classification of Indexes#

Indexes can be classified from four perspectives.

By "data structure": B+ tree index, Hash index, Full-text index.
By "physical storage": clustered index (primary key index), secondary index (auxiliary index).
By "field characteristics": primary key index, unique index, ordinary index, prefix index.
By "number of fields": single-column index, composite index.

Classification by Data Structure#

From the perspective of data structure, common indexes in MySQL include B+ Tree index, HASH index, Full-Text index.

B+ Tree Index#

The B+ Tree index type is also the most commonly used index type by the MySQL storage engine.
When creating a table, the InnoDB storage engine selects different columns as indexes based on different scenarios:

If there is a primary key, it will default to using the primary key as the clustered index key;
If there is no primary key, it will choose the first unique column that does not contain NULL values as the clustered index key;
In the absence of both, InnoDB will automatically generate an implicit auto-increment id column as the clustered index key.

The primary key index and secondary index created by default use the B+ Tree index.
B+ Tree is a type of multi-way tree where only leaf nodes store data, and non-leaf nodes only store indexes. Moreover, the data in each node is stored in primary key order, and each leaf node has two pointers pointing to the next and previous leaf nodes, forming a doubly linked list.

B+ Tree can store tens of millions of data with only 3-4 levels of height, which means that querying target data from a table with tens of millions of records requires at most 3-4 disk I/O operations. Therefore, compared to B-trees and binary trees, the greatest advantage of B+ Tree lies in its high query efficiency, as even with a large amount of data, the disk I/O for querying a single data point remains at 3-4 times.

Why does MySQL InnoDB choose B+ tree as the index data structure?

1. B+ Tree vs B Tree
B+ Tree only stores data in leaf nodes, while B Tree also stores data in non-leaf nodes, so the amount of data in a single node of B+ Tree is smaller, allowing more nodes to be queried under the same number of disk I/O operations. Additionally, B+ Tree leaf nodes are connected by a doubly linked list, which is suitable for common range-based sequential searches in MySQL, something B Tree cannot achieve.
2. B+ Tree vs Hash
Hash is extremely fast for equality queries, with a search complexity of O(1). However, hash tables are not suitable for range queries; they are better suited for equality queries, which is why B+ Tree indexes have a broader range of applicable scenarios compared to hash table indexes.

Classification by Physical Storage#

Divided into clustered index (primary key index) and secondary index (auxiliary index).
The differences between these two are:

The leaf nodes of the B+ Tree of the primary key index store the actual data, with all complete user records stored in the leaf nodes of the primary key index's B+ Tree;
The leaf nodes of the B+ Tree of the secondary index store primary key values, not the actual data.

Covering Index: Therefore, when querying using a secondary index, if the queried data can be found in the secondary index, there is no need to look up the primary index; this process is called covering index.
Back to Table: If the queried data is not in the secondary index, it will first retrieve the secondary index, find the corresponding leaf node, obtain the primary key value, and then query the primary index to retrieve the data; this process is called back to table.

Classification by Field Characteristics#

From the perspective of field characteristics, indexes are divided into primary key index, unique index, ordinary index, and prefix index.
1. Primary Key Index
A primary key index is an index established on the primary key field, usually created together when the table is created. A table can have only one primary key index, and the values of the index column cannot be NULL.
2. Unique Index
A unique index is established on UNIQUE fields, and a table can have multiple unique indexes. The values of the index column must be unique but can be NULL.
3. Ordinary Index
An ordinary index is established on ordinary fields, which are neither required to be primary keys nor required to be UNIQUE.
4. Prefix Index
A prefix index refers to an index established on the first few characters of a character-type field rather than on the entire field. Prefix indexes can be established on columns of types char, varchar, binary, or varbinary. The purpose of using a prefix index is to reduce the storage space occupied by the index and improve query efficiency.

Classification by Number of Fields#

From the perspective of the number of fields, indexes are divided into single-column index and composite index (compound index).

An index established on a single column is called a single-column index, such as a primary key index;
An index established on multiple columns is called a composite index.

Composite Index
By combining multiple fields into one index, this index is referred to as a composite index. For example, combining the product_no and name fields in the product table into a composite index (product_no, name) can be done as follows:

CREATE INDEX index_product_no_name ON product(product_no, name);

As can be seen, the non-leaf nodes of the composite index use the values of the two fields as the key values of the B+ Tree. When querying data in the composite index, it first compares the product_no field, and in cases where product_no is the same, it then compares the name field.
In other words, the B+ Tree for querying the composite index is first sorted by product_no, and then, in cases where product_no is the same, it is sorted by the name field.
Thus, when using a composite index, there is a most-left matching principle, meaning that the matching of the index is done in a most-left priority manner. When querying using a composite index, if the "most-left matching principle" is not followed, the composite index will become ineffective, and the advantages of fast querying using the index will be lost.

Range Queries in Composite Indexes#

The most-left matching principle of composite indexes will continue to match to the right until it encounters a "range query," at which point it will stop matching. This means that the field of the range query can utilize the composite index, but the fields following the range query field cannot utilize the composite index. The most-left matching principle of composite indexes stops matching when encountering range queries (such as >, <); however, it does not stop matching for range queries like >=, <=, BETWEEN, or like prefix matching, as I have illustrated with four examples earlier.

Index Condition Pushdown#

Index condition pushdown optimization can filter out records that do not meet the conditions during the traversal of the composite index, reducing the number of back-to-table operations.
For a composite index (a, b), when executing the statement select * from table where a > 1 and b = 2, if only the a field can utilize the index, then after finding the first primary key value (ID = 2) that meets the condition in the B+ Tree of the composite index, it checks if b = 2, and if it does not meet the condition, it filters it out directly.

Index Distinctiveness#

The fields that are positioned earlier in the index have a higher probability of being used for index filtering. In actual development work, when establishing a composite index, it is important to place fields with high distinctiveness at the front, as these fields are more likely to be used by more SQL statements.
Distinctiveness is calculated as the number of different values of a column divided by the total number of rows in the table, with the formula: $q = distinct(column)/count(*)$
For example, the distinctiveness of gender is very low, making it unsuitable for creating an index or placing it at the front of a composite index. In contrast, fields like UUID are more suitable for indexing or being placed at the front of a composite index.

Sorting with Composite Index#

For the following SQL, how can we improve query efficiency using indexes?
select * from order where status = 1 order by create_time asc
A better approach is to create a composite index on the status and create_time columns, as this can avoid file sorting in the MySQL database.
When querying, if only the status index is used but the statement also requires sorting by create_time, it will require file sorting, which means that the Extra column in the SQL execution plan will show Using filesort.
Therefore, to take advantage of the ordered nature of the index, a composite index should be established on the status and create_time columns, so that the data filtered by status is already sorted by create_time, avoiding file sorting and improving query efficiency.

What are some methods to optimize indexes?#

When is it necessary / unnecessary to create indexes?#

The greatest benefit of indexes is to improve query speed, but indexes also have drawbacks, such as:

They require physical space, and the larger the number, the more space they occupy;
Creating and maintaining indexes takes time, and this time increases with the amount of data;
They can reduce the efficiency of insert, delete, and update operations because every time an index is modified, the B+ tree must be dynamically maintained to keep the index ordered.

When is it appropriate to use indexes?#

When fields have uniqueness constraints, such as product codes;
Fields that are frequently used in WHERE query conditions, as this can improve the overall query speed of the table. If the query condition involves multiple fields, a composite index can be established.
Fields that are frequently used in GROUP BY and ORDER BY clauses, as this means that there is no need to sort again during querying since we already know that the records in the B+ Tree are sorted after establishing the index.

When is it unnecessary to create indexes?#

Fields that are not used in WHERE conditions, GROUP BY, or ORDER BY clauses. The value of an index is to quickly locate data; if a field does not serve this purpose, it is generally unnecessary to create an index, as indexes occupy physical space.
Fields with a large amount of duplicate data do not need indexes, such as the gender field, which only has male and female. If the distribution of male and female records is even in the database table, searching for either value may yield half the data. In such cases, it is better not to have an index, as MySQL has a query optimizer that generally ignores indexes and performs a full table scan when it detects that a certain value appears in a high percentage of the table's rows.
When the table data is too small, indexes are unnecessary;
Fields that are frequently updated should not have indexes created on them, such as user balances in e-commerce projects, because frequently modifying indexed fields requires frequent rebuilding of the index to maintain the ordered nature of the B+ Tree, which can impact database performance.

What are some methods to optimize indexes?#

Prefix Index Optimization#

As the name suggests, a prefix index uses the first few characters of a string in a certain field to establish an index. Why do we need to use prefixes to create indexes?
Using a prefix index reduces the size of the indexed field, allowing more index values to be stored on a single index page, effectively improving the query speed of the index. When using large string fields as indexes, prefix indexes can help reduce the size of index entries.
However, prefix indexes have certain limitations, such as:

They cannot be used with order by;
They cannot be used as covering indexes;

Covering Index Optimization#

Suppose we only need to query the name and price of a product; what can we do to avoid back-to-table operations?
We can create a composite index with "product ID, name, price." If these data exist in the index, the query will not need to retrieve the primary index again, thus avoiding back-to-table operations.

Primary Key Indexes Should Preferably Be Auto-Incrementing#

If we use an auto-incrementing primary key, then each new data entry will be added sequentially to the current index node position without needing to move existing data. When the page is full, a new page will be automatically allocated. Since each time a new record is inserted, it is an append operation that does not require moving data, this method of inserting data is very efficient.
If we use a non-auto-incrementing primary key, since each insertion of the primary key index value is random, each time new data is inserted, it may be placed in the middle of existing data pages, necessitating the movement of other data to accommodate the new entry, and sometimes even requiring data to be copied from one page to another. This situation is commonly referred to as page splitting. Page splitting can also cause a lot of memory fragmentation, leading to a non-compact index structure, which affects query efficiency.

Indexes Should Preferably Be Set to NOT NULL#

The first reason: if the indexed column contains NULL, it complicates the optimizer's index selection process, making it harder to optimize, as columns that can be NULL complicate indexing, index statistics, and value comparisons. For example, during index statistics, count will omit rows with NULL values.
The second reason: NULL values are meaningless but occupy physical space, leading to storage space issues. When InnoDB stores records, if the table contains fields that allow NULL, then at least 1 byte of space will be used in the row format to store the NULL value list.

Preventing Index Ineffectiveness#

Situations that cause index ineffectiveness:

When we use left or right fuzzy matching, such as like %xx or like %xx%, these two methods will cause index ineffectiveness;
When we perform calculations, functions, or type conversions on indexed columns in query conditions, these situations will also cause index ineffectiveness;
Composite indexes must follow the most-left matching principle to be used correctly; otherwise, they will become ineffective.
In the WHERE clause, if the condition column before OR is an indexed column, while the condition column after OR is not an indexed column, then the index will become ineffective.

Which count performs best?#

COUNT(*) = COUNT(1) < COUNT(field) (with secondary index) < COUNT(primary key field) (only with primary key index) < COUNT (non-primary key field) (without secondary index)

What is count?#

count() is an aggregate function, and its parameters can be field names or any other expressions. The function counts how many records meet the query conditions where the specified parameter is not NULL.

What is the execution process of count(primary key field)?#

When counting how many records there are using the count function, the MySQL server layer maintains a variable called count.
The server layer loops through InnoDB to read a record, and if the parameter specified by the count function is not NULL, it increments the count variable by 1 until all records that meet the query are read, at which point it exits the loop. Finally, the value of the count variable is sent to the client.
InnoDB uses B+ trees to store records, which are divided into clustered indexes and secondary indexes based on the type of index. The difference is that the leaf nodes of clustered indexes store actual data, while the leaf nodes of secondary indexes store primary key values, not actual data.
If the table only has a primary key index and no secondary index, then InnoDB will loop through the clustered index, returning the records read to the server layer, and then read the id values from the records to check if they are NULL. If they are not NULL, it increments the count variable by 1.
However, if the table has a secondary index, the object of InnoDB's loop traversal will not be the clustered index but the secondary index.
This is because the same number of secondary index records can occupy less storage space than clustered index records, so the secondary index tree is smaller than the clustered index tree. Thus, the I/O cost of traversing the secondary index is lower than that of traversing the clustered index, which is why the "optimizer" prefers to choose the secondary index.

What is the execution process of count(1) and count(*)?#

The execution process of count(1) and count() is basically the same because count() = count(0).
InnoDB loops through the clustered index (primary key index), returning the records read to the server layer, but it does not read any field values from the records because the parameter of the count function is 1, not a field, so there is no need to read field values from the records. The parameter 1 is clearly not NULL, so the server layer increments the count variable by 1 each time it reads a record from InnoDB.
It can be seen that count(1) is one step shorter than count(primary key field) because it does not need to read field values from the records, so it is generally said that count(1) executes slightly more efficiently than count(primary key field).

What is the execution process of count(field)?#

The execution efficiency of count(field) is the worst because it requires a full table scan.

Differences between MYISAM and InnoDB

When using the MyISAM engine, executing the count function only requires O(1) complexity because each MyISAM data table has meta information that stores the row_count value, which is guaranteed to be consistent by table-level locks, so directly reading the row_count value is the execution result of the count function.

How to optimize count(*)?#

When faced with large table record statistics, the execution efficiency of count is very poor. For example, the t_order table has over 12 million records and has created a secondary index, but executing select count(*) from t_order takes about 5 seconds!
How can we improve efficiency?
First method: Approximate values
If your business does not require precise counts, such as search engines providing approximate counts of search results.
In this case, we can use the show table status or explain command to estimate the table.
Second method: Store count values in an additional table
If you want to obtain the exact total count of records in a table, you can save this count value in a separate count table.
When inserting a record into the data table, increment the count field in the count table by 1. This means that during insert and delete operations, we need to maintain this count table as well.