10.4. Managing Indexes

10.4. Managing Indexes

Each index that the directory uses is composed of a table of index keys and matching entry ID lists. This entry ID list is used by the directory to build a list of candidate entries that may match a client application's search request; Section 10.1, “About Indexes” describes each kind of Directory Server index. The Directory Server secondary index structure greatly improves write and search operations.

10.4.1. Indexing Performance

While achieving extremely high read performance, in previous versions of Directory Server, write performance was limited by the number of bytes per second that could be written into the storage manager's transaction log file. Large log files were generated for each LDAP write operation; in fact, log file verbosity could easily be 100 times the corresponding number of bytes changed in the Directory Server. The majority of the contents in the log files are related to index changes (ID insert and delete operations).

The secondary index structure was separated into two levels in the old design:

  • The ID list structures, which were the province of the Directory Server backend and opaque to the storage manager.

  • The storage manager structures (Btrees), which were opaque to the Directory Server backend code.

Because it had no insight into the internal structure of the ID lists, the storage manager had to treat ID lists as opaque byte arrays. From the storage manager's perspective, when the content of an ID list changed, the entire list had changed. For a single ID that was inserted or deleted from an ID list, the corresponding number of bytes written to the transaction log was the maximum configured size for that ID list, about 8 kilobytes. Also, every database page on which the list was stored was marked as dirty, since the entire list had changed.

In the redesigned index, the storage manager has visibility into the fine-grain index structure, which optimizes transaction logging so that only the number of bytes actually changed need to be logged for any given index modification. The Berkeley DB provides ID list semantics, which are implemented by the storage manager. The Berkeley API was enhanced to support the insertion and deletion of individual IDs stored against a common key, with support for duplicate keys, and an optimized mechanism for the retrieval of the complete ID list for a given key.

The storage manager has direct knowledge of the application's intent when changes are made to ID lists, resulting in several improvements to ID list handling:

  • For long ID lists, the number of bytes written to the transaction log for any update to the list is significantly reduced, from the maximum ID list size (8 kilobytes) to twice the size of one ID (4 bytes).

  • For short ID lists, storage efficiency, and in most cases performance, is improved because only the storage manager meditate need to be stored, not the ID list metadata.

  • The average number of database pages marked as dirty per ID insert or delete operation is very small because a large number of duplicate keys will fit into each database page.

10.4.2. Search Performance

For each entry ID list, there is a size limit that is globally applied to all index keys managed by the server. In previous versions of Directory Server, this limit was called the All IDs Threshold. Because maintaining large ID lists in memory can affect performance, the All IDs Threshold set a limit on how large a single entry ID list could get. When a list hit a certain pre-determined size, the search would treat it as if the index contained the entire directory.

The difficulty in setting the All IDs Threshold hurt performance. If the threshold was too low, too many searches examined every entry in the directory. If it was too high, too many large ID lists had to be maintained in memory.

The problems addressed by the All IDs Threshold are no longer present because of the efficiency of entry insertion, modification, and deletion in the Berkeley DB design. The All IDs Threshold is removed for database write operations, and every ID list is now maintained accurately.

Since loading a long ID list from the database can significantly reduce search performance, the configuration parameter, nsslapd-idlistscanlimit, sets a limit on the number of IDs that are read before a key is considered to match the entire primary index. nsslapd-idlistscanlimit is analogous to the All IDs Threshold, but it only applies to the behavior of the server's search code, not the content of the database.

When the server uses indexes in the processing of a search operation, it is possible that one index key matches a large number of entries. For example, consider a search for objectclass=inetorgperson in a directory that contained one million inetorgperson entries. When the server reads the inetorgperson index key in the objectclass index, it finds one million matching entries. In cases like this, it is more efficient simply to conclude in the index lookup phase of the search operation processing that all the entries in the database match the query. This causes subsequent search processing to scan the entire database content, checking each entry as to whether it matches the search filter. The time required to fetch the index keys is not worthwhile; the search operation is likely to be processed more efficiently by omitting the index lookup.

In Directory Server, when examining an index, if more than a certain number of entries are found, the server stops reading the index and marks the search as unindexed for that particular index.

The threshold number of entries is called the idlistscanlimit and is configured with the nsslapd-idlistscanlimit configuration attribute. The default value is 4000, which is designed to give good performance for a common range of database sizes and access patterns. Typically, it is not necessary to change this value. However, in rare circumstances it may be possible to improve search performance with a different value. For example, lowering the value will improve performance for searches that will otherwise eventually hit the default limit of 4000. This might reduce performance for other searches that benefit from indexing. Conversely, increasing the limit could improve performance for searches that were previously hitting the limit. With a higher limit, these searches could benefit from indexing where previously they did not.

For more information on search limits for the server, refer to Section 10.1.3, “Overview of the Searching Algorithm”.

10.4.3. Backwards Compatibility and Migration

While current versions of Directory Server can support the old database design, only the new design is supported for this and later releases of Directory Server.

Upon startup, the server will read the database version from the DBVERSION file, which contains the text Netscape-ldbm/6.2 (old database version), Netscape-ldbm/7.1 (new database format), or bdb/4.2/libback-ldbm (new database format). If the file indicates that the old format is used, then the old code is selected for the database. Because the DBVERSION file stores everything per-backend, it is possible to have different database formats for different individual backends, but the old database format is not recommended.

All databases must be migrated to Directory Server 8.0 when the system is upgraded. Migration is supported for Directory Server 6.x versions, and for releases earlier than version 6.x, dump the databases be dumped, and install Directory Server fresh. Migrating databases is covered in the Directory Server Installation Guide.

Also, the index sizes can be larger than in older releases, so you may want to increase your database cache size. To reconfigure your cache size, look up the nsslap-dbcachesize entry in the Directory Server Configuration, Command, and File Reference.


Note: This documentation is provided {and copyrighted} by Red Hat®, Inc. and is released via the Open Publication License. The copyright holder has added the further requirement that Distribution of substantively modified versions of this document is prohibited without the explicit permission of the copyright holder. The CentOS project redistributes these original works (in their unmodified form) as a reference for CentOS-5 because CentOS-5 is built from publicly available, open source SRPMS. The documentation is unmodified to be compliant with upstream distribution policy. Neither CentOS-5 nor the CentOS Project are in any way affiliated with or sponsored by Red Hat®, Inc.