前言

這篇文章是出自於線上課程 Complete Guide to Elasticsearch 的所記錄的筆記。

這篇文章使用的 ES 版本為 7.16.2

這一篇文章要來介紹如何透過 Elastic 的 join queries 來達成與 RDBMS 一樣 join 兩張 table 的效果。

正文

在 RDBMS 中，透過正規化的方式將大表分成多個小表，並再透過 join 的方式將它們結合在一起。

然而在 Elastic 中，反而是建議 **反正規化 (Denomornize)**，因為這樣帶來的效益對 ES 而言會更好，雖然反正規化會讓空間使用上的效益不佳，但通常 ES 都不會被拿來當作主要的資料庫，而是會為了追求效能兒犧牲空間。

雖然 ES 並不能做到像 RDBMS 的 join，但還是能達到一些簡易的 document join。

查詢 nested objects

當我們新增一個包含 nested 欄位的 index department

PUT /department
{
  "mappings": {
      "properties": {
        "name": {
          "type": "text"
        },
        "employees": {
          "type": "nested"
      }
    }
  }
}

新增一筆資料

POST /department/_doc
{
  "name": "HR",
  "employees": [
    {
      "name": "Joy",
      "age": "42",
      "position": "Senior Marketing Manager",
      "gender": "F"
    },
    {
      "name": "Toe",
      "age": "18",
      "position": "Interm",
      "gender": "M"
    }
  ]
}

今天當要搜尋 employees 的資料時，透過下面這個方法是行不通的

GET /department/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "employees.position": "nterm"
          }
        },
        {
          "term": {
            "employees.gender.keyword": "M"
          }
        }
      ]
    }
  }
}

因為 nested 欄位的資料，雖然原本是 object，但實際儲存在 ES 會變得不一樣，資料與資料之間會變得沒有關係。

Inner hits

透過 inner hits 的查詢，可以進一步查詢 nested 的欄位中，真正符合條件的資料。

加入 _source: false 讓資料不要那麼龐大

GET /department/_search
{
  "_source": false, 
  "query": {
    "nested": {
      "path": "employees",
      "inner_hits": {}, 
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "employees.position": "interm"
                }
              },
              {
                "term": {
                  "employees.gender.keyword": "M"
                }
              }
          ]
        }
      }
    }
  }
}

透過 inner hits 可以看到符合條件的搜尋更詳盡的資料

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.5645323,
    "hits" : [
      {
        "_index" : "department",
        "_type" : "_doc",
        "_id" : "GQkfI34BeiHXdjTu0YQ5",
        "_score" : 1.5645323,
        "inner_hits" : {
          "employees" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : 1.5645323,
              "hits" : [
                {
                  "_index" : "department",
                  "_type" : "_doc",
                  "_id" : "GQkfI34BeiHXdjTu0YQ5",
                  "_nested" : {
                    "field" : "employees",
                    "offset" : 1
                  },
                  "_score" : 1.5645323,
                  "_source" : {
                    "gender" : "M",
                    "name" : "Toe",
                    "position" : "Interm",
                    "age" : "18"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

Document relationships 的 mappings

新增一筆 deparment 與 employee 的 mapping，join_field 是自定義的名字

PUT /department
{
  "mappings": {
    "properties": {
      "join_field": {
        "type": "join",
        "relations": {
         "department": "employee" 
        }
      }
    }
  }
}

新增一筆 department

PUT /department/_doc/1
{
  "name": "Development",
  "join_field": "department"
}

新增一筆 employee

# 指向 parent
PUT /department/_doc/3?routing=1
{
  "name": "percy",
  "age": 27,
  "gender": "M",
  "join_field": {
    "name": "employee",
    "parent": 1
  }
}

routing 指的是資料要儲存在哪一個 shard，這個值是透過 document ID 而來，如果不指定的話會發生錯誤 - [routing] is missing for join field [join_field]”

Query by parent ID

透過 parent ID 的方式來查詢資料，type 需填入與 parent 相關的 child。

GET /department/_search
{
  "query": {
    "parent_id": {
      "type": "employee",
      "id": 1
    }
  }
}

Query child doc by parent

有時候我們並不曉得 child 的 parent ID 為何，因此可以透過是否有 parent 來判別。

GET /department/_search
{
  "query": {
    "has_parent": {
      "parent_type": "department",
      "query": {
        "term": {
          "name.keyword": "Development"
        }
      }
    }
  }
}

parent matching 預設是不計算分數的，若要啟用，可以補上 score: true

GET /department/_search
{
  "query": {
    "has_parent": {
      "score": true,
      "parent_type": "department",
      "query": {
        "term": {
          "name.keyword": "Development"
        }
      }
    }
  }
}

Query parents by child

我們也可以透過 child 來查詢 parent

GET /department/_search
{
  "query": {
    "has_child": {
      "type": "employee",
      "query": {
        "bool": {
          "must": [
              {
                "range": {
                  "age": {
                    "gte": 50
                  }
                }
              }
            ],
          "should": [
            {
              "term": {
                "gender.keyword": "M"
              }
            }
          ]
        }
      }
    }
  }
}

child 與 parent 查詢，預設都是不計算分數的，但 child 可以使用的計分方式更多

child 計分的欄位名稱為 score_mode

Multi-level relation

今天資料之間的關係可能不會這麼單純，只有一對 parent & child，而應該是多個 parent & child 所組合成。

假設關係圖如下

將關係圖轉化成 query 的結果如下

PUT /department/_doc/1
{
  "name": "Company",
  "join_field": "company"
}

PUT /department/_doc/2?routing=1
{
  "name": "Development",
  "join_field": {
    "name": "department",
    "parent": 1
  }
}

PUT /department/_doc/3?routing=1
{
  "name": "Percy",
  "join_field": {
    "name": "employee",
    "parent": 2
  }
}

> 唯一要注意的是 employee 的 routing=1，因為資料與最高階層的資料要放在同一個 shard。

儘管階層的關係變多了，但搜尋的方法依然不變

GET /department/_search
{
  "query": {
    "has_child": {
      "type": "department",
      "query": {
        "has_child": {
          "type": "employee",
          "query": {
            "term": {
              "name.keyword": "Percy"
            }
          }
        }
      }
    }
  }
}

inner_hits 也可以用在 has_parent & has_child

Terms lookup mechanism

新增 user & stories 的資料

# user 1 追蹤 user 2, 3 
PUT /users/_doc/1
{
  "name": "John",
  "following": [2, 3]
}

# 這些文章是由哪些 user 所建立以及文章內容

PUT /stories/_doc/1
{
  "user": 2,
  "content": "ya! 2"
}

PUT /stories/_doc/2
{
  "user": 3,
  "content": "ya! 3"
}

PUT /stories/_doc/3
{
  "user": 4,
  "content": "ya! 4"
}

# 搜尋 user 1 有在追蹤的人所寫的文章
GET /stories/_search
{
  "query": {
    "terms": {
      "user": {
        "index": "users",
        "id": "1",
        "path": "following"
      }
    }
  }
}

Join limitation

Join 有以下限制

資料必須放在同一個 index，否則效能會非常差
parent & child 必須放在同一個 shard
每個 index 只能有一個 join 欄位

Reference

Complete Guide to Elasticsearch

Percy's blog

ElasticSearch 學習紀錄 Part9 - Join Queries

前言

正文