這篇文章是出自於線上課程 Complete Guide to Elasticsearch 的所記錄的筆記。
這篇文章使用的 ES 版本為 7.16.2
這一篇文章要來介紹如何透過 Elastic 的 join queries 來達成與 RDBMS 一樣 join 兩張 table 的效果。
在 RDBMS 中,透過正規化的方式將大表分成多個小表,並再透過 join 的方式將它們結合在一起。
然而在 Elastic 中,反而是建議 **反正規化 (Denomornize)**,因為這樣帶來的效益對 ES 而言會更好,雖然反正規化會讓空間使用上的效益不佳,但通常 ES 都不會被拿來當作主要的資料庫,而是會為了追求效能兒犧牲空間。
雖然 ES 並不能做到像 RDBMS 的 join,但還是能達到一些簡易的 document join。
查詢 nested objects
當我們新增一個包含 nested 欄位的 index department
| PUT /department { "mappings": { "properties": { "name": { "type": "text" }, "employees": { "type": "nested" } } } }
| POST /department/_doc { "name": "HR", "employees": [ { "name": "Joy", "age": "42", "position": "Senior Marketing Manager", "gender": "F" }, { "name": "Toe", "age": "18", "position": "Interm", "gender": "M" } ] }
今天當要搜尋 employees 的資料時,透過下面這個方法是行不通的
| GET /department/_search { "query": { "bool": { "must": [ { "match": { "employees.position": "nterm" } }, { "term": { "employees.gender.keyword": "M" } } ] } } }
因為 nested 欄位的資料,雖然原本是 object,但實際儲存在 ES 會變得不一樣,資料與資料之間會變得沒有關係。
Inner hits
透過 inner hits 的查詢,可以進一步查詢 nested 的欄位中,真正符合條件的資料。
加入 _source: false 讓資料不要那麼龐大
| GET /department/_search { "_source": false, "query": { "nested": { "path": "employees", "inner_hits": {}, "query": { "bool": { "must": [ { "match": { "employees.position": "interm" } }, { "term": { "employees.gender.keyword": "M" } } ] } } } } }
透過 inner hits 可以看到符合條件的搜尋更詳盡的資料
| { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 1.5645323, "hits" : [ { "_index" : "department", "_type" : "_doc", "_id" : "GQkfI34BeiHXdjTu0YQ5", "_score" : 1.5645323, "inner_hits" : { "employees" : { "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 1.5645323, "hits" : [ { "_index" : "department", "_type" : "_doc", "_id" : "GQkfI34BeiHXdjTu0YQ5", "_nested" : { "field" : "employees", "offset" : 1 }, "_score" : 1.5645323, "_source" : { "gender" : "M", "name" : "Toe", "position" : "Interm", "age" : "18" } } ] } } } } ] } }
Document relationships 的 mappings
新增一筆 deparment 與 employee 的 mapping,join_field
| PUT /department { "mappings": { "properties": { "join_field": { "type": "join", "relations": { "department": "employee" } } } } }
新增一筆 department
| PUT /department/_doc/1 { "name": "Development", "join_field": "department" }
新增一筆 employee
| # 指向 parent PUT /department/_doc/3?routing=1 { "name": "percy", "age": 27, "gender": "M", "join_field": { "name": "employee", "parent": 1 } }
routing 指的是資料要儲存在哪一個 shard,這個值是透過 document ID 而來,如果不指定的話會發生錯誤 - [routing] is missing for join field [join_field]”
Query by parent ID
透過 parent ID 的方式來查詢資料,type 需填入與 parent 相關的 child。
| GET /department/_search { "query": { "parent_id": { "type": "employee", "id": 1 } } }
Query child doc by parent
有時候我們並不曉得 child 的 parent ID 為何,因此可以透過是否有 parent 來判別。
| GET /department/_search { "query": { "has_parent": { "parent_type": "department", "query": { "term": { "name.keyword": "Development" } } } } }
parent matching 預設是不計算分數的,若要啟用,可以補上 score: true
| GET /department/_search { "query": { "has_parent": { "score": true, "parent_type": "department", "query": { "term": { "name.keyword": "Development" } } } } }
Query parents by child
我們也可以透過 child 來查詢 parent
| GET /department/_search { "query": { "has_child": { "type": "employee", "query": { "bool": { "must": [ { "range": { "age": { "gte": 50 } } } ], "should": [ { "term": { "gender.keyword": "M" } } ] } } } } }
child 與 parent 查詢,預設都是不計算分數的,但 child 可以使用的計分方式更多
child 計分的欄位名稱為 score_mode
Multi-level relation
今天資料之間的關係可能不會這麼單純,只有一對 parent & child,而應該是多個 parent & child 所組合成。
將關係圖轉化成 query 的結果如下
| PUT /department/_doc/1 { "name": "Company", "join_field": "company" }
PUT /department/_doc/2?routing=1 { "name": "Development", "join_field": { "name": "department", "parent": 1 } }
PUT /department/_doc/3?routing=1 { "name": "Percy", "join_field": { "name": "employee", "parent": 2 } }
> 唯一要注意的是 employee 的 routing=1,因為資料與最高階層的資料要放在同一個 shard。
| GET /department/_search { "query": { "has_child": { "type": "department", "query": { "has_child": { "type": "employee", "query": { "term": { "name.keyword": "Percy" } } } } } } }
inner_hits 也可以用在 has_parent & has_child
Terms lookup mechanism
新增 user & stories 的資料
| # user 1 追蹤 user 2, 3 PUT /users/_doc/1 { "name": "John", "following": [2, 3] }
# 這些文章是由哪些 user 所建立以及文章內容
PUT /stories/_doc/1 { "user": 2, "content": "ya! 2" }
PUT /stories/_doc/2 { "user": 3, "content": "ya! 3" }
PUT /stories/_doc/3 { "user": 4, "content": "ya! 4" }
| # 搜尋 user 1 有在追蹤的人所寫的文章 GET /stories/_search { "query": { "terms": { "user": { "index": "users", "id": "1", "path": "following" } } } }
Join limitation
Join 有以下限制
- 資料必須放在同一個 index,否則效能會非常差
- parent & child 必須放在同一個 shard
- 每個 index 只能有一個 join 欄位
