ES 慢日志配置与分析实战

ES 慢日志配置与分析实战

生产环境中 Elasticsearch 突然卡顿，第一个排查入口就是慢日志（Slow Log）。ES 提供了两类慢日志：慢搜索日志和慢索引日志。本文将详细讲解配置方式、日志格式解读、分析技巧，以及常见踩坑案例。

1. 慢搜索日志（Search Slow Log）

慢搜索日志记录超过指定阈值的查询请求，分为两个阶段：

阶段	含义	配置项
query	查询阶段（在各个 shard 上执行）	`threshold.query`
fetch	取回阶段（从 shard 拉取文档）	`threshold.fetch`

配置示例

PUT /my_index/_settings
{
  "index.search.slowlog.threshold.query.warn": "1s",
  "index.search.slowlog.threshold.query.info": "500ms",
  "index.search.slowlog.threshold.query.debug": "200ms",
  "index.search.slowlog.threshold.query.trace": "100ms",
  "index.search.slowlog.threshold.fetch.warn": "500ms",
  "index.search.slowlog.threshold.fetch.info": "200ms",
  "index.search.slowlog.threshold.fetch.debug": "100ms",
  "index.search.slowlog.threshold.fetch.trace": "50ms",
  "index.search.slowlog.level": "info"
}

日志级别说明：

级别	含义	使用场景
warn	超过此阈值写入 WARN 日志	生产环境推荐
info	超过此阈值写入 INFO 日志	按需开启排查
debug/trace	更细粒度	开发调试

生产建议：query 阶段 warn 设 2s，info 设 1s；fetch 阶段 warn 设 1s。不要开 debug/trace，否则高 QPS 下日志量会爆炸。

2. 慢索引日志（Indexing Slow Log）

记录单个文档索引耗时超过阈值的情况。

PUT /my_index/_settings
{
  "index.indexing.slowlog.threshold.index.warn": "500ms",
  "index.indexing.slowlog.threshold.index.info": "200ms",
  "index.indexing.slowlog.threshold.index.debug": "100ms",
  "index.indexing.slowlog.threshold.index.trace": "50ms",
  "index.indexing.slowlog.level": "info",
  "index.indexing.slowlog.source": "1000"
}

source 参数：控制日志中 _source 字段截断长度。设 1000 表示记录前 1000 个字符。设为 0 则不记录 source，设为 -1 或 true 则记录完整 source。

3. 多索引分别配置

不同索引可能有不同的慢查询容忍度。例如日志索引允许慢一点，但搜索业务索引必须快。

// 业务搜索索引：阈值收紧
PUT /product_search/_settings
{
  "index.search.slowlog.threshold.query.warn": "500ms",
  "index.search.slowlog.threshold.query.info": "200ms"
}

// 日志归档索引：阈值放宽
PUT /logs-2026.05/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s"
}

也可以通过 Index Template 批量管理：

PUT /_index_template/slowlog_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.search.slowlog.threshold.query.warn": "5s",
      "index.search.slowlog.threshold.query.info": "2s"
    }
  }
}

4. 慢日志格式解读

一条典型的慢搜索日志（ES 7.x+）长这样：

[2024-06-15T10:23:45,678][WARN ][i.s.s.query              ] [node-1]
[my_index-2024.06][0] took[1.8s], took_millis[1800],
total_hits[12345 hits], types[], stats[],
search_type[QUERY_THEN_FETCH], total_shards[5],
source[{"query":{"bool":{"must":[{"match":{"title":"elasticsearch"}}]}}}],

字段	含义
`took` / `took_millis`	该 shard 上查询耗时
`total_hits`	匹配文档总数
`total_shards`	涉及的分片数
`source`	完整的查询 JSON（ES 7.x+）

ES 7.x 改进：早期版本只输出 source 前若干字符，7.x 开始支持输出完整查询 body，排查问题效率大幅提升。

5. 怎样根据慢日志找出耗时热点

步骤一：收集慢日志

在慢日志中找到 top N 耗时查询，重点关注：

同一个索引重复出现的慢查询
total_hits 特别大的（说明扫描了大量数据）
用了 leading wildcard（*keyword）或 script 的查询

步骤二：分析 DSL

复制 source 中的 JSON 到 Kibana Dev Tools：

GET /product_search/_search
{
  "profile": true,    # 开启 Profile，看各阶段耗时
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "elasticsearch" } }
      ]
    }
  }
}

返回的 profile 字段会显示：

{
  "profile": {
    "shards": [
      {
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "title:elasticsearch",
                "time_in_nanos": 234567890,
                "breakdown": {
                  "build_scorer": 34567890,
                  "next_doc": 198765432,
                  ...
                }
              }
            ]
          }
        ]
      }
    ]
  }
}

步骤三：定位热点

Profile 字段	瓶颈含义
`next_doc` 很高	评分计算量大
`advance` 很高	skip 逻辑重
`build_scorer` 很高	查询构建复杂
`match` 很高	精确匹配计算

6. 踩坑案例

案例 1：通配符前缀查询打爆 CPU

现象：每天凌晨 2 点 CPU 飙到 95%，持续 5 分钟。

排查：慢日志中发现：

{ "wildcard": { "username": { "value": "*admin*" } } }

leading wildcard（*admin*）导致 ES 必须遍历倒排索引中的所有 term，无任何优化空间。

解决：将 wildcard 改为 match_phrase_prefix 或换用 ngram tokenizer。

案例 2：深度分页导致 OOM

现象：某个查询偶尔耗时 30 秒，且伴随 Full GC。

慢日志分析：

took[32.5s], source[{"from":10000,"size":100,...}]

from + size = 10100，每个 shard 需要生成前 10100 个结果的优先队列。5 个 shard 就是 50500 个文档对象，内存压力极大。

解决：改用 search_after 替代深度分页。

案例 3：慢索引日志定位 bulk 瓶颈

现象：写入延迟偶尔飙到 2 秒。

慢索引日志：

took[2.1s], source[{"status":"pending","data":"...大量日志字段..."}]

排查发现某些文档 source 超过 1MB，包含一个 Base64 编码的图片字段。

解决：图片不存 ES，改用对象存储 + ES 存 URL。

7. 慢日志分析工具推荐

工具	适用场景	特点
Filebeat + Logstash 回吐 ES	生产环境	将慢日志采集回 ES，在 Kibana 中可视化
es-query-finder	脚本分析	Python 脚本解析慢日志文件
esrally	基准测试	运行标准查询集，测量性能基线
ElasticHQ	GUI 管理	内置慢查询监控面板

8. 最佳实践总结

实践	说明
慢日志按索引分级	核心业务索引阈值紧，日志归档索引阈值松
用 Index Template 统一管理	避免遗漏新索引
生产只开 warn/info	debug/trace 会导致日志爆炸
慢日志也采回 ES	方便在 Kibana 中做聚合分析
定期 Review 慢查询 Top 10	持续优化，而不是出问题再查

总结

慢日志是 ES 性能优化的"第一现场"。通过合理设置阈值、理解日志格式、结合 Profile API 定位瓶颈，可以快速找到拖慢集群的查询并优化。最重要的是把慢日志分析变成一个常态化流程，而非故障发生后的被动救火。