Elasticsearch 索引的映射配置详解

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 2302 天前的主题，其中的信息可能已经有所发展或是发生改变。

Profile

概述

Elasticsearch 与传统的 SQL 数据库的一个明显的不同点是，Elasticsearch 是一个 非结构化 的数据库，或者说是一个 无模式 的数据库。Elasticsearch 中数据最重要的三要素当属：索引、类型、文档，其中索引这个概念非常重要，我们可以粗略地将其类比到传统 SQL 数据库中的 数据表。本文就从 Elasticsearch 的索引映射如何配置开始讲起。

注：本文首发于 My Personal Blog，欢迎光临小站！

本文内容脑图如下：文章共 1500 字，阅读本文大约需要 5 分钟！

本文内容脑图

索引模式映射

创建索引时，可以自定义索引的结构，比如创建一个保存用户信息数据的 users 索引，其典型的结构如下：

id：唯一表示符
name：姓名
birthday：出生日期
hobby：爱好

为此我们可以创建一个 json 格式的索引模式映射文件：users.json

{
	"mappings" : {
		"user" : {
			"properties" : {
				"id" : {
					"type" : "long",
					"store" : "yes"
				},
				"name" : {
					"type" : "string",
					"store" : "yes",
					"index" : "analyzed"
				},
				"birthday" : {
					"type" : "date",
					"store" : "yes"
				},
				"hobby" : {
					"type" : "string",
					"store" : "no",
					"index" : "analyzed"
				}
				
			}
		}
	}
}

上面的 json 代码意义如下：

创建一个名称为 users的 Index
里面有一个名称为 user的 Type
而 user 有四个 field
且每个 field 都有自己的属性定义

然后我们来执行如下命令来新建一个索引：

curl -X PUT http://47.98.43.236:9200/users -d @users.json

结果如下，索引 users、类型 user、以及四个字段都已经顺利插入：

新建一个索引

关于字段的 可选类型，有如下几种：

string：字符串
number：数字
date：日期
boolean：布尔型
binary：二进制
ip：IP 地址
token_count类型

关于每种类型有哪些属性，可参考官方文档，由于内容太多，此处不再赘述。

分析器的使用

分析器是一种用于 分析数据 或者按照用户想要的方式 处理数据 的工具，对于 字符串类型 的字段，Elasticsearch 允许用户自定义分析器。

先来自定义一个分析器

{
  "settings" : {
    "index" : {
      "analysis" : {
        "analyzer" : {
          "myanalyzer" : {
            "tokenizer" : "standard",
            "filter" : [
              "asciifolding",
              "lowercase",
              "myFilter"
            ]
          }
        },
        "filter" : {
          "myFilter" : {
            "type" : "kstem"
          }
        }
      }

    }
  },
	"mappings" : {
		"user" : {
			"properties" : {
				"id" : {
					"type" : "long",
					"store" : "yes"
				},
				"name" : {
					"type" : "string",
					"store" : "yes",
					"index" : "analyzed",
                    "analyzer" : "myanalyzer"
				},
				"birthday" : {
					"type" : "date",
					"store" : "yes"
				},
				"hobby" : {
					"type" : "string",
					"store" : "no",
					"index" : "analyzed"
				}

			}
		}
	}
}

上述 json 代码中，用户定义了一个名为 myanalyzer 的分析器，该分析器包含 一个分词器 + 三个过滤器，分别如下：

分词器：standard
过滤器：asciifolding
过滤器：lowercase
过滤器：myFilter（自定义过滤器，其本质是 kstem）

再来看如何测试和使用自定义的分析器

可以通过类似如下的 Restful 接口来测试 analyze API 的工作情况：

curl -X GET 'http://47.98.43.236:9200/users/_analyze?field=user.name' -d 'Cars Trains'

可见我们输入的时一行字符串普通"Cars Trains"，而输出为：car 和 train，这说明短语 "Cars Trains" 被分成了两个词条，然后全部转为小写，最后做了词干提取的操作，由此证明我们上面自定义的分析器已然生效了！

相似度模型的配置

Elasticsearch 允许为索引模式映射文件中的不同字段指定不同的 相似度得分 计算模型，其用法例析如下：

	"mappings" : {
		"user" : {
			"properties" : {
				"id" : {
					"type" : "long",
					"store" : "yes"
				},
				"name" : {
					"type" : "string",
					"store" : "yes",
					"index" : "analyzed",
                    "analyzer" : "myanalyzer",
                    "similarity" : "BM25"
				},
				"birthday" : {
					"type" : "date",
					"store" : "yes"
				},
				"hobby" : {
					"type" : "string",
					"store" : "no",
					"index" : "analyzed"
				}

			}
		}
	}

上述 json 文件中，我们为 name 字段使用了 BM25 这种相似度模型，添加的方法是使用 similarity 属性的键值对，这样一来 Elasticsearch 将会为 name 字段使用 BM25 相似度计算模型来计算相似得分。

信息格式的配置

Elasticsearch 支持为每个字段指定信息格式，以满足通过改变字段被索引的方式来提高性能的条件。Elasticsearch 中的信息格式有如下几个：

default：默认信息格式，其提供了实时的对存储字段和词向量的压缩
pulsing：将重复值较少字段的信息列表编码为词条矩阵，可加快该字段的查询速度
direct：该格式在读过程中将词条加载到未经压缩而存在内存的矩阵中，该格式可以提升常用字段的性能，但损耗内存
memory：该格式将所有的数据写到磁盘，然后需要 FST 来读取词条和信息列表到内存中
bloom_default：默认信息格式的扩展，增加了把 bloom filter 写入磁盘的功能。读取时 bloom filter 被读取并存入内存，以便快速检查给定的值是否存在
bloom_pulsing：pulsing 格式的扩展，也加入 bloom filter 的支持

信息格式字段（postings_format）可以在 任何一个字段上 进行设置，配置信息格式的示例如下：

	"mappings" : {
		"user" : {
			"properties" : {
				"id" : {
					"type" : "long",
					"store" : "yes",
                    "postings_format" : "pulsing"
				},
				"name" : {
					"type" : "string",
					"store" : "yes",
					"index" : "analyzed",
                    "analyzer" : "myanalyzer"
				},
				"birthday" : {
					"type" : "date",
					"store" : "yes"
				},
				"hobby" : {
					"type" : "string",
					"store" : "no",
					"index" : "analyzed"
				}

			}
		}
	}

在该例子之中，我们手动配置改变了 id 字段的信息格式为 pulsing，因此可加快该字段的查询速度。

文档值及其格式的配置

文档值这个字段属性作用在于：其允许将给定字段的值被写入一个更高内存效率的结构，以便进行更加高效的排序和搜索。我们通常可以将该属性加在 需要进行排序 的字段上，这样可以提效。

其配置方式是通过属性 doc_values_format 进行，有三种常用的 doc_values_format 属性值，其含义从名字中也能猜个大概：

default：默认格式，其使用少量的内存但性能也不错
disk：将数据存入磁盘，几乎无需内存
memory：将数据存入内存

举个栗子吧：

	"mappings" : {
		"user" : {
			"properties" : {
				"id" : {
					"type" : "long",
					"store" : "yes"
				},
				"name" : {
					"type" : "string",
					"store" : "yes",
					"index" : "analyzed",
          "analyzer" : "myanalyzer"
				},
				"birthday" : {
					"type" : "date",
					"store" : "yes"
				},
				"hobby" : {
					"type" : "string",
					"store" : "no",
					"index" : "analyzed"
				},
                "age" : {
                    "type" : "integer",
                    "doc_values_format" : "memory"
                 }
			}
		}
	}