Commonly Used Tokenizers

Elasticsearch provides various built-in tokenizers, each suited for different purposes.

Standard Tokenizer

The standard tokenizer is the default tokenizer used by Elasticsearch. It splits text into terms on word boundaries and removes most punctuation.

Whitespace Tokenizer

The whitespace tokenizer splits text into terms whenever it encounters whitespace.

PUT /whitespace_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "whitespace_tokenizer": {
          "type": "whitespace"
        }
      }
    }
  }
}

Analyzing text:

GET /whitespace_example/_analyze
{
  "tokenizer": "whitespace_tokenizer",
  "text": "Elasticsearch is a powerful search engine"
}

Output:

{
  "tokens": [
    { "token": "Elasticsearch", "start_offset": 0, "end_offset": 14, "type": "word", "position": 0 },
    { "token": "is", "start_offset": 15, "end_offset": 17, "type": "word", "position": 1 },
    { "token": "a", "start_offset": 18, "end_offset": 19, "type": "word", "position": 2 },
    { "token": "powerful", "start_offset": 20, "end_offset": 28, "type": "word", "position": 3 },
    { "token": "search", "start_offset": 29, "end_offset": 35, "type": "word", "position": 4 },
    { "token": "engine", "start_offset": 36, "end_offset": 42, "type": "word", "position": 5 }
  ]
}

In this example:

The text is split into tokens based on whitespace.

NGram Tokenizer

The ngram tokenizer breaks text into smaller chunks (n-grams) of specified lengths. It’s useful for partial matching and autocomplete features.

PUT /ngram_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5
        }
      }
    }
  }
}

Analyzing text

GET /ngram_example/_analyze
{
  "tokenizer": "ngram_tokenizer",
  "text": "search"
}