Inverted File Flat Vector Indexes Online Rebuild

Data distribution shifts caused by new DML operations (inserts, updates, or deletes) that occur after the index has been created can significantly influence accuracy of the results. These changes require careful maintenance strategies to ensure sustained retrieval accuracy.

Before we explore the impact of data changes on IVF index quality, let’s briefly recall how an IVF (Inverted File Flat) index is constructed. The process begins by running K-means clustering on a sample set of vectors from the base table, with the resulting centroid vectors stored in the CENTROIDS table. For each data vector, the closest centroid is determined based on distance, and this assignment is recorded in the Centroid Partitions table. The effectiveness of the IVF index is largely shaped by choices made during the index creation. These parameters collectively determine how well the index partitions the data and, ultimately, how efficiently it can support high-quality nearest neighbor searches. For example, consider a data representing US cities. Based on the parameters set during the index creation, the data would be divided into several partitions as shown in the following figure:

Figure 6-14 IVF Clusters Corresponding to US Cities



After the IVF index is created, post-creation DML operations such as inserts, updates, and deletes can significantly alter the underlying data distribution. This often leads to partition skew, where some neighbor partitions become disproportionately large while others remain underpopulated. Since the centroids used during index creation remain static, they may no longer accurately represent the evolving dataset. This misalignment can cause newly added vectors to be misclassified into incorrect partitions, further degrading recall rates and overall index effectiveness.

For the example mentioned above, consider that new cities are added (by DML inserts) to the shared journal table after the index is created. The new inserts are added to the nearest cluster. In this case, you can see that all the new inserts are added to the same cluster, making one cluster significantly larger than the other clusters. New inserts are represented in the following figure:

Figure 6-15 DML Updates added to the Existing Index



Consider a similarity search corresponding to the search query vectors: "Redwood City" and "Gaithersburg". The top-3 matches corresponding to "Redwood City" would be San Francisco, Monterey, and Sacramento. And the top-3 matches corresponding to "Gaithersburg" would be Philadelphia, Trenton, and Annapolis. Note, these search results are from the original index represented in the figure above.

Figure 6-16 Sample Similarity Search Prior to Index Reorganization



While the search results corresponding to "Redwood City" matches the query, the search results corresponding to "Gaithersburg" are not accurately representing the query. One of the key reason is that the DML updates (shown in the table) are made to the centroid partitions table and the new data may or may not be accurately represented in the index. Also, the retrieval time could be higher due to the large cluster size.

To maintain optimal accuracy, it is essential to implement appropriate index reorganization that address these issues as the data evolves. Internal experiments confirm that the query performance per second (QPS) significantly improves after index reorganization.

Currently, the index rebuild is user initiated. You can start a rebuild using the following syntax:
ALTER INDEX ivf_index_name REBUILD ONLINE 
[PARALLEL <degree of parallelism>];

Refer Inverted File Flat ALTER INDEX for detailed syntax information and an example.

Reorganization is achieved by triggering a full rebuild of the index. Full rebuild is done in multiple phases. The changes that come in to the journal table during the static rebuild phase are applied to the index through a catchup phase.

After reorganizing the index, the same US cities index would include more clusters representing the new data added. You could also note that the search corresponding to "Gaithersburg" now returns Frederick, Washington DC and Arlington as the top 3 results, which are more accurate in comparison to the results prior to the index reorganization. The following plot represents the new clusters and also note that the size of the clusters being uniform after the reorganization.

Figure 6-17 Similarity Search After Index Reorganization