Elasticsearch is a robust, distributed search engine built on Apache Lucene. Its ability to perform full-text searches, structured searches, and real-time analytics stems from its efficient indexing mechanism. Understanding Elasticsearch indexing is crucial for optimizing search performance and ensuring efficient query execution. In this blog, we will explore Elasticsearch’s indexing architecture, types of indexes, and best practices for effective indexing, including the new Index Sorting feature introduced in Elasticsearch 6.0.

Elasticsearch Overview

1. Overview of Indexing in Elasticsearch

In Elasticsearch, an index is a collection of documents sharing the same data structure. Indexing involves storing and organizing data to enable fast and efficient search and retrieval. Elasticsearch combines inverted indexing with other data structures to deliver rapid full-text search and real-time analytics.

1.1. Inverted Index

  • Definition: An inverted index is a data structure mapping terms (words) to their locations within a set of documents. It allows Elasticsearch to quickly locate documents containing specific terms.
  • Components: The inverted index includes a dictionary (terms) and a posting list (document IDs). Each term in the dictionary points to a list of documents containing it.

1.2. Data Structures

  • Document: The fundamental unit of information in Elasticsearch. Documents are JSON objects containing fields with various data types.
  • Field: A key-value pair within a document. Fields are indexed to facilitate search.
  • Shard: An index is divided into shards, which are basic units of storage and search. Shards can be distributed across nodes in a cluster.

2. Types of Indexes in Elasticsearch

Elasticsearch supports various index types for different use cases, each with unique features.

2.1. Standard Index

  • Definition: The default index type, designed for general-purpose search and analytics.
  • Features: Supports full-text search, structured search, and aggregations. Utilizes inverted indexing for fast text search.

2.2. Time-Based Index

  • Definition: Designed for time-series data, such as logs or metrics.
  • Features: Typically named with a date or time-based convention, allowing efficient data management and retention. Supports efficient queries and aggregations over time ranges.

2.3. Alias-Based Index

  • Definition: A virtual index pointing to one or more underlying indexes.
  • Features: Enables management and querying of multiple indexes as a single entity. Useful for rolling indices, zero-downtime index upgrades, and more.

3. Indexing Process in Elasticsearch

Indexing involves several steps: data ingestion, mapping, and analysis.

3.1. Data Ingestion

  • Definition: Adding documents to an index using methods like RESTful APIs, Logstash, and Beats.
  • APIs: Elasticsearch provides APIs for indexing documents, such as the Index API for single document indexing and the Bulk API for batch operations.

3.2. Mapping

  • Definition: Defines the structure of documents and field types within an index. Specifies how fields are indexed and stored.
  • Dynamic Mapping: Automatically creates mappings for new fields based on data. Useful for evolving schemas but can be controlled or overridden.
  • Explicit Mapping: Allows control over indexing behavior, including field types, analyzers, and other properties.

3.3. Analysis

  • Definition: Breaks down text into terms and tokens for indexing and searching. Includes tokenization, filtering, and normalization.
  • Analyzers: Elasticsearch uses analyzers for text processing. Standard analyzers include the Standard Analyzer, Whitespace Analyzer, and Custom Analyzers, each with different rules for tokenization and filtering.

4. Best Practices for Indexing in Elasticsearch

To optimize performance and querying efficiency, follow these best practices:

4.1. Design Indexes for Query Patterns

  • Understand Queries: Analyze query types and design indexes to support those queries. Choose appropriate fields to index and specify analyzers accordingly.

4.2. Use Appropriate Sharding

  • Shard Sizing: Select the number of shards based on data size and query needs. Too few shards can cause bottlenecks, while too many can lead to overhead.

4.3. Optimize Mappings

  • Explicit Mapping: Define mappings to control indexing behavior. Specify field types, enable or disable indexing for specific fields, and use appropriate analyzers.

4.4. Manage Index Lifecycle

  • Rolling Indices: Use rolling indices for time-based data to manage data retention and performance. Implement index lifecycle management (ILM) policies to automate index management tasks.

4.5. Monitor and Tune Performance

  • Monitoring: Regularly monitor Elasticsearch cluster performance using tools like Kibana or Elastic Stack monitoring features.
  • Tuning: Adjust shard allocation, indexing settings, and mappings based on monitoring data to optimize performance.

5. Introducing Index Sorting in Elasticsearch 6.0

Elasticsearch 6.0 introduces Index Sorting, a feature to optimize how documents are stored on disk in a specified order. This new feature improves performance for queries that rely on sorted data.

5.1. Index Sorting in Lucene

  • Lucene’s IndexSorter: Lucene introduced the IndexSorter tool for offline sorting of documents. The tool allowed users to reorder documents on disk based on specified criteria. This feature supported early termination, where queries could stop after retrieving the required number of documents, improving search response times.

  • Lucene Improvements: Initially, Lucene indexed documents in the order they were received, requiring a visit to every document across segments to retrieve results. The new merge policy introduced with Lucene allowed sorting documents at merge time, improving the efficiency of index sorting for dynamic indices.

Lucene Benchmark

  • Performance Considerations: Sorting at merge time was costly and reduced indexing performance. To address this, sorting was moved to flush time, increasing throughput benchmarks by 65% and optimizing the sorting process.

5.2. Index Sorting in Action

  • Early Termination of Search Queries: Index Sorting allows Elasticsearch to store documents in a specified order on disk, making queries more efficient. For example, creating a leaderboard for top player scores can be optimized by storing documents in order of their scores.
GET scores/score/_search
{
  "size": 3,
  "sort": [
      { "points": "desc" }
  ]
}
  • Specifying Index Sorting Order: To use Index Sorting, specify the sorting order in the index settings.
PUT scores
{
    "settings" : {
        "index" : {
            "sort.field" : "points",
            "sort.order" : "desc"
        }
    },
    "mappings": {
        "score": {
            "properties": {
                "points": {
                  "type": "long"
                },
                "playerid": {
                  "type": "keyword"
                },
                "game" : {
                  "type" : "keyword"
                }
            }
        }
    }
}
  • Grouping Documents: Storing documents sorted by similar fields, such as game type, improves query speed and compression.
PUT scores
{
    "settings" : {
        "index" : {
            "sort.field" : "game",
            "sort.order" : "desc"
        }
    }
}
  • Efficient AND Conjunctions: Index Sorting improves performance for complex queries with multiple conditions by grouping similar documents together.
GET players/player/_search
{
  "size": 3,
  "track_total_hits" : false,
  "query" : {
    "bool" : {
      "filter" : [
        { "term" : { "region" : "eu" } },
        { "term" : { "game" : "dragons-lair" } },
        { "term" : { "skill-rating" : 9 } },
        { "term" : { "map" : "castle" } }
      ]
    }
  }
}

5.3. When Index Sorting Isn’t Ideal

Index Sorting requires additional work at index time, which can reduce write performance by 40-50%. Consider whether your application prioritizes query performance or write performance. Test Index Sorting with your specific use case and dataset to evaluate its impact.

Conclusion

Understanding and effectively implementing indexing in Elasticsearch, including the new Index Sorting feature, is essential for optimizing search performance and handling large datasets efficiently. By leveraging these features and following best practices, you can enhance your Elasticsearch deployment’s performance and scalability.