前言

這篇文章是出自於線上課程 Complete Guide to Elasticsearch 的所記錄的筆記。

這一篇文章主要介紹 ElasticSearch 搜尋資料的行為。

本文

搜尋 index 的資料
GET /<index>/_search
針對特定內容來查詢
GET /<index>/_search?q=<field>:<value>

E.g.

1	GET /analyzer_test/_search?q=description:dog

查詢到的資料會依序相關性來決定分數

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_index" : "analyzer_test",
        "_type" : "_doc",
        "_id" : "DAlf5n0BeiHXdjTuK4R7",
        "_score" : 0.6931471,
        "_source" : {
          "description" : "Hey that dog!"
        }
      }
    ]
  }
}

多個 query

1	GET /analyzer_test/_search?q=description:dog AND type:dachshund

Query DSL

Query 可以分成 Leaf query 及 Compound query，後者可能是前者所組合而成。

E.g.
基礎的 query 語法

GET /analyzer_test/_search
{
  "query": {
    "match_all": {}
  }
}

搜尋的運作模式

每個節點都可以扮演 coordinating node，當收到請求時，會將相同 index 但不同節點的 shard 整合起來，再回傳結果。

Score 相關性

ES 會先搜尋符合條件的資料，再將這些資料評分

常見的 relevance score 如下

Term Frequence (TF): 根據 term 出現的次數來決定分數
Inverse Document Frequency (IDF): 與 TF 相反
Okapi BM25: TF + 上限(避免 stop word 高頻率出現)
Field-length Norm: 字句越長，分數越低

新增參explain: true 來查看更詳細的 relevance score

GET /analyzer_test/_search
{
  "explain": true, 
  "query": {
    "term": {
      "description": "dog"
    }
  }
}

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_shard" : "[analyzer_test][0]",
        "_node" : "0lmKyQR9SQ-uYLPan-8BZw",
        "_index" : "analyzer_test",
        "_type" : "_doc",
        "_id" : "DAlf5n0BeiHXdjTuK4R7",
        "_score" : 0.6931471,
        "_source" : {
          "description" : "Hey that dog!"
        },
        "_explanation" : {
          "value" : 0.6931471,
          "description" : "weight(description:dog in 0) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 0.6931471,
              "description" : "score(freq=1.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "value" : 2.2,
                  "description" : "boost",
                  "details" : [ ]
                },
                {
                  "value" : 0.6931472,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1,
                      "description" : "n, number of documents containing term",
                      "details" : [ ]
                    },
                    {
                      "value" : 2,
                      "description" : "N, total number of documents with field",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 0.45454544,
                  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "freq, occurrences of term within document",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "k1, term saturation parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "b, length normalization parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 3.0,
                      "description" : "dl, length of field",
                      "details" : [ ]
                    },
                    {
                      "value" : 3.0,
                      "description" : "avgdl, average length of field",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

Term level query vs. Full text query

term: 字句不會被分析
full: 字句會被分析，預設使用 standard analyzer

Reference

Complete Guide to Elasticsearch

Percy's blog

ElasticSearch 學習紀錄 Part6 - searching

前言

本文

Query DSL

搜尋的運作模式

Score 相關性

Term level query vs. Full text query

Reference