elasticsearch数据输入和输出，elasticsearch数据

和通数据库htsjk.Com2019-06-28 07:12 来源:未知阅读:15936 评论 442 热度4

标签：

elasticsearch数据输入和输出，elasticsearch数据

数据输入和输出

文档元数据

_index 因共同的特性被分组到一起的文档集合
_type 文档表示的对象类别，索引的子分区
_id 文档唯一标识
索引名：必须小写，不能以下划线开头，不能包含逗号
分类名：可以是大写或者小写，但是不能以下划线或者句号开头，不应该包含逗号，并且长度限制为256个字符
ID: 字符串，创建文档时，要麽自己提供_id，要麽elasticsearch自动生成

索引文档

使用自定义id

PUT /website/blog/123     # 指定id
{
  "title": "My first blog entry",
  "text":  "Just trying this out...",
  "date":  "2014/01/01"
}

自动生成id，自动生成的ID是 URL-safe、基于 Base64 编码且长度为20个字符的 GUID 字符串

POST /website/blog/
{
  "title": "My second blog entry",
  "text":  "Still trying this out...",
  "date":  "2014/01/01"
}

_version字段，每次对文档进行修改时（包括删除）， _version的值会递增，能确保应用程序的一部分修改不会覆盖另一部分所做的修改

取回文档

全部取回

GET /website/blog/123?pretty        pretty: pretty-print,使得 JSON 响应体加可读（_source 字段除外）

部分取回

GET /website/blog/123?_source=title,text

结果会返回元数据，_source字段中，只会保留title,text两个字段

不返回元素据

GET /website/blog/123/_source

结果:

{
   "title": "My first blog entry",
   "text":  "Just trying this out...",
   "date":  "2014/01/01"
}

检测文档是否存在

head方法

curl -i -XHEAD http://localhost:9200/website/blog/123

结果:

HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

or

HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

更新整个文档

文档是不可以改变的，不能修改。需要修改只能重建索引，或进行替换

PUT /website/blog/123
{
  "title": "My first blog entry",
  "text":  "I am starting to get the hang of this...",
  "date":  "2014/01/02"
}

然后可以看到版本号增加，并且created为false（文档已经存在），在内部，已经将旧文档标记为删除(并不能再访问，并不立即消失，会随着索引更多数据而在后台清理)，并增加了一个全新文档
- update API，看似对文档直接进行了修改，实际是个耗时操作，实际执行了如下过程
1. 从旧文档构建json
2. 更改json
3. 删除旧文档
4. 索引一个新文档

创建新文档

主要讲怎么确保是新建而不是覆盖
不指定id,自动生成
使用如下两种方式,返回相同结果

PUT /megacorp/employee/2?op_type=create
{
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests": [sic" ]
}

PUT /megacorp/employee/2/_create
{
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests": [ "music" ]
}

结果相同，如有相同文档则返回409冲突

删除文档

DELETE

DELETE /website/blog/123

如文档不存在（_version值仍然会增加），会返回

{
  "found" :    false,
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 4
}

处理冲突

多个请求同时修改同一份文档，容易造成数据问题
乐观并发控制，通过_version字段控制，因为文档被修改时，该字段会递增，使用该字段确保变更以正确的顺序执行，避免应用中相互冲突的变更不会导致数据丢失
通过外部系统使用版本控制

可以得到版本号为5的文档，然后更新
```
PUT /website/blog/2?version=10&version_type=external
{
    "title": "My first external blog entry",
    "text":  "This is a piece of cake..."
}
```
更新成功，如若PUT中指定的version为小于5的值，则会冲突错误

文档的部分更新

https://www.elastic.co/guide/cn/elasticsearch/guide/current/partial-updates.html

update API优点： update发生在分片内部，可以避免多次请求的网络开销，减少检索和重建索引步骤，以及减少多进程变更带来冲突的可能(相比于get-modify-put过程)
新增doc字段，新增部分会与现有文档合并，覆盖现有字段，新增新字段,新字段加在_source中

POST /website/blog/1/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "views": 0
   }
}

GET结果：

{
   "_index":    "website",
   "_type":     "blog",
   "_id":       "1",
   "_version":  3,
   "found":     true,
   "_source": {
      "title":  "My first blog entry",
      "text":   "Starting to get the hang of this...",
      "tags": [ "testing" ],
      "views":  0
   }
}

使用脚本部分更新（Groovy脚本）,脚本可以在 update API中用来改变 _source 的字段内容,它在更新脚本中称为 ctx._source,比如修改views的数量

POST /website/blog/1/_update
{
   "script" : "ctx._source.views+=1"    # 每次操作自动加1
}

upsert:更新文档可能不存在，比如存储一个页面访问量计数器，每次访问加1,不存在设置默认值则可以如下

POST /website/pageviews/1/_update
{
   "script" : "ctx._source.views+=1",
   "upsert": {
       "views": 1
   }
}

更新和冲突, 检索和重建索引步骤的间隔越小，但是在变更冲突的机会越小,在 update 设法重新索引之前，来自另一进程的请求修改了文档，还是可能冲突，造成数据丢失，因此为了避免数据丢失， update API 在检索步骤时检索得到文档当前的 _version 号，并传递版本号到重建索引步骤的 index 请求，如果另一个进程修改了处于检索和重新索引步骤之间的文档，那么 _version 号将不匹配，更新请求将会失败
处理方法：如果在更新顺序不重要的情况下，出现失败，重试即可。retry_on_conflict这个参数规定了失败之前 update 应该重试的次数，默认值为0

POST /website/pageviews/1/_update?retry_on_conflict=5       # 失败之前重试
{
   "script" : "ctx._source.views+=1",
   "upsert": {
       "views": 0
   }
}

取回多个文档，请求合并

优点：避免单独处理每个请求花费的网络延时和开销
例子: 请求体是doc数组，返回结果也是数组切顺序和请求体数组元素顺序保持一致

GET /_mget
{
   "docs" : [
      {
         "_index" : "website",
         "_type" :  "blog",
         "_id" :    2
      },
      {
         "_index" : "website",
         "_type" :  "pageviews",
         "_id" :    1,
         "_source": "views"
      }
   ]
}

结果：

{
   "docs" : [
      {
         "_index" :   "website",
         "_id" :      "2",
         "_type" :    "blog",
         "found" :    true,
         "_source" : {
            "text" :  "This is a piece of cake...",
            "title" : "My first external blog entry"
         },
         "_version" : 10
      },
      {
         "_index" :   "website",
         "_id" :      "1",
         "_type" :    "pageviews",
         "found" :    true,
         "_version" : 2,
         "_source" : {
            "views" : 2
         }
      }
   ]
}

如果属于相同index甚至相同_type，则可以在 URL 中指定默认的 /_index 或者默认的 /_index/_type,请求体里可以设置值来覆盖默认值

GET /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}

index和_type都相同，可以只传数组

GET /website/blog/_mget
{
   "ids" : [ "2", "1" ]
}

结果：

mget返回码，即使没有文档被找到，http状态码也是200，因为 mget 请求本身已经成功执行，如若需要知道每个文档查找是成功或者失败，只需检查found标记

小代价的批量操作

format: \n结尾(包括最后一行),这些行不能包含未转义的换行符（该JSON 不能使用 pretty 参数打印）

{ action: { metadata }}\n
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n

action, 必须是以下选项之一
例子

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }

结果：

{
   "took": 4,
   "errors": false,
   "items": [
      {  "delete": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 2,
            "status":   200,
            "found":    true
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 3,
            "status":   201
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "EiwfApScQiiy7TIKFxRCTw",
            "_version": 1,
            "status":   201
      }},
      {  "update": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 4,
            "status":   200
      }}
   ]
}

特点， bulk请求不是原子操作，不能用来实现事务控制，每个子请求都是独立执行
bulk 请求URL支持接收默认的 /_index 或者 /_index/_type

POST /website/_bulk
{ "index": { "_type": "log" }}
{ "event": "User logged in" }

POST /website/log/_bulk
{ "index": {}}
{ "event": "User logged in" }
{ "index": { "_type": "blog" }}
{ "title": "Overriding the default type" }

性能，整个批量请求都需要由接收到请求的节点加载到内存，因此请求越大，内存越大；批量请求的大小有一个最佳值，大于这个值，性能将不再提升，甚至会下降