$ ls ~yifei/notes/

Elasticsearch 上手教程

Posted on:2018-06-22 00:00

Last modified:2022-05-15 11:51

ES 家的几个产品版本不太统一，有的在 2.x，有的在 4.x，为了打包在一起卖，ES 家把 ES、Kibana、 Logstash 的版本统一成了 5.0 版本。写作时的版本是 7.x

大体上来说，7.0 之前的 ES 可以这样理解：

关系型数据库：Databases -> Tables of Rows -> Columns
Elasticsearch：Indices -> Types of Documents -> Fields

Elasticsearch 集群可以包含多个索引 (indices)，每一个索引可以包含多个类型 (types)，每一个类型包含多个文档 (documents)，然后每个文档包含多个字段 (Fields)。另外，在 ES 中，也常用 mapping 这个词，指的是字段到类型的映射，大概意思和 schema 相近。而且 mapping 可以嵌套，也就是说一个 document 可以有其他的 document 作为 child.

但是需要特别注意的是：在 SQL 中，不同表中的列都是毫不相关的，而在 ES 中，同一个索引内，不同 type 的同名 fields 就是同一个 fields。

索引（名词）如上文所述，一个索引 (index) 就像是传统关系数据库中的数据库，它是相关文档存储的地方，index 的复数是 indices 或 indexes。
索引（动词）「索引一个文档」表示把一个文档存储到索引（名词）里，以便它可以被检索或者查询。这很像 SQL 中的 INSERT 关键字，区别是，如果文档已经存在，新的文档将覆盖旧的文档。
倒排索引，传统数据库可以为特定列增加一个索引，例如 B-Tree 索引，来加速检索。 Elasticsearch 和 Lucene 使用一种叫做倒排索引 (inverted index) 的数据结构来达到相同目的。

重要更新, 在 ES 7.0 和之后的版本中，打破了这种设定，删除了 type，也就是说一个索引中只有一个类型了----_doc. 如果想要有不同的类型，要么自己定义一个 type 字段区分，要么就选择每一个索引都只存一种类型的 document. 如果你看到的教程还在讲 type, 或者 Python 的教程还在使用 doc_type 参数，那么赶快别看了换个别的，新版本马上要把 doc_type 删掉了。

在 Elasticsearch 中，每一个字段的数据都是默认索引的。也就是说，每个字段专门有一个反向索引用于快速检索。一个文档不只有数据，它还包含了元数据 (metadata)。几个必须的元数据节点是：

_index, 文档所在的索引
_type, 文档代表的对象的类，ES 7.0 中已经删除了，需要的地方固定填写 _doc。
_id, 文档的唯一标识
_version, 用于控制冲突，可以由外部指定，采用乐观锁

安装

因为 AWS 这些云厂商一直在吸开源血，所以 ES 默认产品现在需要使用自己的 brew tap 安装：

brew tap elastic/tap
brew install elastic/tap/elasticsearch-full elastic/tab/kibana-full

创建索引和插入数据

Mapping 用来定义 ES 中文档的字段类型，如果使用 dynamic mapping, ES 就会在第一次见到某个字段的时候推断出字段的类型。这时候就有问题了，比如，时间戳可能被推断成了 long 类型。所以，一般我们会在创建索引的时候指定 mapping 的类型。

PUT http://localhost:9200/company

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    // 指定文本分词
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": "lowercase"
        }
      }
    },
  },
  // 指定字段的类型
  "mappings": {
    "properties": {
      "age": {
        "type": "long"
      },
      "experienceInYears": {
        "type": "long"
      },
      "name": {
        "type": "text",
        "analyzer": "analyzer-name"  // 或者直接指定 ik_smart
      }
    }
  }
}

ES 中的类型有：boolean, binary, long, double, text, date 等。

如果需要存储 enum 类型的话，直接在 text 类型中指定 {"index": False} 就好了
date 类型可以自动识别一些日期格式，如时间戳，YYYY-mm-dd 等，也可以使用 format 指定。
使用 analyzer 指定文本类型的分词器

插入文档时需要注意，因为已经没有 type 这个概念了，所有 URL 中使用固定的 _doc 字段

POST http://localhost:9200/company/_doc/?_create

{
  "name": "Andrew",
  "age" : 45,
  "experienceInYears" : 10
}

更新数据

可以直接使用 PUT, 然后数据放在 {"doc": ...} 中就好了，然鹅，update_by_query 并不支持 doc.

Text Analyzer

众所周知，倒排索引的第一步就是要对文本进行一些预处理，尤其是分词。英文还好说，天然就是分好的，而中文则需要一些特殊的处理。英文只需要做一些词形的变化，还原成原型，也就是 stem。在 ES 中负责这些工作的部分叫做 Text Analyzer.

Text Analyzer 一般分为三个部分：

Char Filter, 也就是处理一些字符
Tokenizer, 也就是分词器
Token Filter, 也就是处理一些词。添加同义词，抽取词干也会在这里进行

当文件被添加到索引和查询索引的时候都会调用 text analyzer.

为某个字段指定 text analyzer:

PUT my-index-000001

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "whitespace"
      }
    }
  }
}

为某个索引指定 text analyzer:

PUT my-index-000001

{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "simple"
        }
      }
    }
  }
}

使用 IK 分词

最常用的中文分词工具就是 IK 分词了，在 GitHub 上已经有一万个 Star 了，应该还是值得信任的。

VERSION=7.10.2
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v${VERSION}/elasticsearch-analysis-ik-${VERSION}.zip

其中的版本号需要替换成对应的 ES 的版本号。

如果升级了 ES 的版本，会遇到Plugin [analysis-ik] was built for Elasticsearch version 7.9.2 but version 7.11.1 is running 这种错误，直接卸载升级插件就好了：

elasticsearch-plugin remove analysis-ik
elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.11.1/elasticsearch-analysis-ik-7.11.1.zip

IK 为 Elasticsearch 增加了两个分词器：ik_smart, ik_max_word. 其中 ik_smart 会分出较少的词，而 ik_max_word 会穷尽每一种方法分出尽量多的词。

比如：今天天气真好.

ik_smart 会分成：今天天气, 真好.
ik_max_word 会分成：今天天气, 今天, 天天, 真好.

安装 IK 分词器之后需要重启 ES。

搜索

GET /[index]/_search

GET /bank/_search
{
  "query": { "match_all": {} },  // 查询的条件
  "sort": [
    { "account_number": "asc" }  // 排序条件
  ],
  "from": 10,  // 用于分页
  "size": 10   // 每页大小
}

{
  "took" : 63,  // 检索花费的时长
  "timed_out" : false, // 是否超时
  "_shards" : {  // 关于检索的分片的信息
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {  // 命中结果
    "total" : {  // 命中结果数量
        "value": 1000,
        "relation": "eq"
    },
    "max_score" : null,
    "hits" : [ {
      "_index" : "bank",
      "_type" : "_doc",
      "_id" : "0",
      "sort": [0],
      "_score" : null,
      // 检索到的文档内容
      "_source" : {"account_number":0,"balance":16623, ...}
    }, {
      "_index" : "bank",
      "_type" : "_doc",
      "_id" : "1",
      "sort": [1],
      "_score" : null,
      "_source" : {"account_number":1,"balance":39225, ...}
    }, ...
    ]
  }
}

搜索中有很多参数可以调节：

track_total_hits. 默认情况下只有小于 10000 的时候结果才是精确的，因为统计有多少结果是一个 O(n) 的操作。
filter. 按照某些条件过滤结果。比如电商中，搜索衣服时候的尺码颜色等
highlighter. 在搜索结果中节选出包含关键词的部分
_source. 通过一个数组指定返回的字段。默认情况下是返回所有字段的。

查询字段

最简单的查询：

GET /_search
{
  "query": {
    "match": {
      "message": "this is a test"
    }
  }
}

match 会对文本进行分析，比如全部改成小写等。如果要使用完全匹配，可以使用 term

GET /_search
{
  "query": {
    "term": {"name": "john"}
  }
}

默认情况下，查询的字段是使用 OR 关系的，显然这不是我们想要的，可以指定为 and

GET /_search
{
  "query": {
    "match": {
      "message": {
        "query": "this is a test",
        "operator": "and"
      }
    }
  }
}

match_all 用来读取所有文档：

GET /_search
{
  "query": {
    "match_all": {}  // 这个参数永远为空
  }
}

组合查询

如果需要使用 OR/AND/NOT 那么需要使用 bool 子句：

AND 使用 "must"
OR 使用 "should"
NOT 使用 "should_not"

比如说 round AND (red OR blue) 可以表示成：

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {"shape": "round"}
        },
        {
          "bool": {
            "should": [
              {"term": {"color": "red"}},
              {"term": {"color": "blue"}}
            ],
            "minimum_should_match": 1
          }
        }
      ]
    }
  }
}

一般来说，要实现 OR 的效果，还应该加上：minimum_should_match: 1, 表示 should 中的条件至少要有一个成立。

删除所有数据

删除所有数据，但是保留索引的 mapping 结构。

POST: http://localhost:9200/index/_doc/_delete_by_query

{
  "query": {
    "match_all": {}
  }
}

搜索结果翻页

使用 from 和 size 两个参数可以翻页，但是这两个参数就和 SQL 里面的 limit 和 offset 一样，页码越大，性能越低，因为他们其实就是傻乎乎的弄出所有结果来然后取中间。

除此之外，还可以使用 scroll api, 也就是让 ES 缓存住这个查询结果，每次都取一段。显然对于 ad-hoc 的用户搜索来说也是不适用的。

最后一种方法是使用 search_after, 其实就相当于 SQL 中使用 where id > $id, 然后每次查询都用上次的最大 ID 就可以了。在 ES 中自然是取 sort 字段中每次查询的最大（最小）值。

GET my-index-000001/_search

{
  "size": 10,
  "query": {
    "match" : {
      "message" : "foo"
    }
  },
  "search_after": [1463538857, "654323"],  // 对应 sort 中的字段
  "sort": [
    {"@timestamp": "asc"},
    {"tie_breaker_id": "asc"}
  ]
}

搜索结果排序

默认情况下，搜索结果会按照计算出来的 _score 也就是和搜索 query 的相关度来排序，我们也可以通过自定义 sort 字段来指定排序规则。

GET /my-index-000001/_search

{
  "sort" : [
    { "post_date" : {"order" : "asc"}},
    "user",
    { "name" : "desc" },
    { "age" : "desc" },
    "_score",
    "_doc",
  ],
  "query" : {
    "term" : { "user" : "kimchy" }
  }
}

查询 DSL

ES 用 JSON 实现了自己的一套查询语句，基本上就是个 AST, 直接写就行了。子句分成两个：

查询子句
复合子句

复合字段

在 Elasticsearch 中每一个字段都可以作为数组存储，但是有两个问题：

append 到数组的操作比较复杂
无法对数组中的元素进行去重，没有 unique 索引

查询倒是非常简单，直接用 . 分隔就好了：post.tags

所以，对于复杂的嵌套性数据，如文章的评论，最好不要使用这种方式存储，对于 tag 等简单数据尚可。

参考

All contents are under the CC-BY-NC-SA license, if not otherwise specified.

Opinions expressed here are solely my own and do not express the views or opinions of my employer.

友情链接: MySQL 教程站