日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

lostash Timeout executing grok 问题排查

發(fā)布時(shí)間:2024/2/28 编程问答 45 豆豆
生活随笔 收集整理的這篇文章主要介紹了 lostash Timeout executing grok 问题排查 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

  • Timeout executing grok 問(wèn)題排查
    • 1. 問(wèn)題背景
      • 1. 基礎(chǔ)配置
      • 2. pipeline配置
    • 2. 問(wèn)題表現(xiàn)
    • 3. 嘗試通過(guò)修改 memory-queue+ poll-size
    • 4. 通過(guò)gc日志定位問(wèn)題

Timeout executing grok 問(wèn)題排查

使用logstash的時(shí)候發(fā)現(xiàn)會(huì)有這個(gè)報(bào)錯(cuò),導(dǎo)致logstash不穩(wěn)定,input端是kafka,就會(huì)導(dǎo)致kafka的反復(fù)rebalance,一開(kāi)始排查思路有問(wèn)題,而且網(wǎng)上的資料可能都是其中的一個(gè)原因,所以導(dǎo)致我的排查思路偏了。

1. 問(wèn)題背景

我有一組logstash 從kafka中消費(fèi)日志數(shù)據(jù),解析后丟到ES當(dāng)中,提供一些日志的檢索能力。大概的配置是這樣的。

1. 基礎(chǔ)配置

  • logstash版本 7.5.0

  • jvm

  • java -version openjdk version "1.8.0_312" OpenJDK Runtime Environment (build 1.8.0_312-b07) OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
  • -Xms6g -Xmx6g
  • config/logstash.yml

  • pipeline.workers: 6 pipeline.batch.size: 50
  • 2. pipeline配置

    input {kafka{bootstrap_servers => "xxx:9092"topics_pattern => "log_.*"auto_offset_reset => "latest"group_id => "log-es-find"consumer_threads => 1client_id => "node123"partition_assignment_strategy => "org.apache.kafka.clients.consumer.RoundRobinAssignor"max_poll_interval_ms => "600000"max_poll_records => "100"codec => json {charset => "UTF-8"}}}filter {mutate {remove_field => [ "host","@timestamp" ]}if "_LOGV3_" not in [message] {grok {match => { "message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{STRING:process} %{STRING:thread} - %{SESSION_ID:session_id} %{GREEDYDATA:message_log}$' }remove_field => ["message"]}} else {grok {match => { "message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{STRING:process} %{STRING:thread} _LOGV3_ %{SESSION_ID:session_id} %{EVENT_MARK:event} %{GREEDYDATA:message_log}$' }}json {source => "message_log"target => "doc.custom"remove_field => ["message","message_log"]}mutate {rename => { "doc.custom" => "doc.%{event}" }}}if "_groktimeout" in [tags] {grok {match => {"message" => "^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type}"}remove_tag => [ "_groktimeout" ]add_tag => [ "timeout" ]}}if "_grokparsefailure" in [tags] {grok {remove_tag => [ "_grokparsefailure" ]add_tag => [ "grok_failure" ]match => { "message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{GREEDYDATA:message_log}$' }}}date {match => ["time", "ISO8601"]target => "@timestamp"locale => "cn"}mutate {remove_field => ["time", "@metadata","host","input", "agent","ecs","log" ]}if "_grokparsefailure" in [tags] {mutate {add_field => {"service_type" => "grok_failure"}}}if "_groktimeout" in [tags] {mutate {add_field => {"service_type" => "grok_time_out"}}}}output {elasticsearch {hosts => ["http://es:9200"]index => "mai_log_%{[service_type]}_detail_%{+YYYY.MM.dd}"user => "elastic"password => "xxx"} }

    2. 問(wèn)題表現(xiàn)

    ??因?yàn)橹案氵^(guò)這一套,所以就很熟練的搭建起來(lái)了,logstash也沒(méi)有做太多的配置,重點(diǎn)是沒(méi)有對(duì)gc日志進(jìn)行配置,沒(méi)有打印gc日志,影響了后面的判斷。

    ??logstash啟動(dòng)后表現(xiàn)異常,只是數(shù)據(jù)中有的數(shù)據(jù)的json字段是可以無(wú)限拓展的,導(dǎo)致ES中的有些索引有問(wèn)題,后來(lái)把對(duì)應(yīng)的一些字段修改為flattened解決了相關(guān)問(wèn)題。但是logstash還有一個(gè)問(wèn)題就是運(yùn)行個(gè)7,8個(gè)小時(shí)就開(kāi)始不穩(wěn)定,頻繁的與kafka失去聯(lián)系,導(dǎo)致頻繁發(fā)生rebalance。查看日志,發(fā)現(xiàn)有下面這樣的日志:

    [2021-11-25T11:22:57,107][WARN ][logstash.filters.grok ][main] Timeout executing grok '%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{STRING:process} %{STRING:thread} - %{SESSION_ID:session_id} %{GREEDYDATA:message_log}$' against field 'message' with value 'Value too large to output (3876 bytes)! First 255 chars are: debug 2021-11-25T11:01:35.303029+08:00 kvm-jobs-job-index-online3 go_recall_job 30343 30349 - 10e11e80e227f6053f1637809295e246 uid=187853021, user={"Uid":187853021,"ExpectPosition":"跨境運(yùn)營(yíng)","LikeWords":[{"Term":"運(yùn)營(yíng)","Weight":0.62232894},{"Term":"跨境","Weight"'!

    頻繁的rebalance導(dǎo)致logstash幾乎無(wú)法工作

    3. 嘗試通過(guò)修改 memory-queue+ poll-size

    ??這個(gè)時(shí)候想著有可能是某些字段的grok太復(fù)雜導(dǎo)致了logstash在處理數(shù)據(jù)的時(shí)候超時(shí)了?拿著Timeout executing grok作為關(guān)鍵字搜索了一波,大部分也是說(shuō)grok使用的pattern太復(fù)雜了,于是年少無(wú)知的我又開(kāi)始優(yōu)化pattern(其實(shí)我的pattern并不復(fù)雜)。同時(shí)學(xué)習(xí)了一下logstash的消費(fèi)模型。logstash的input和(filter+output)是分開(kāi)的,input只是負(fù)責(zé)數(shù)據(jù)拉取,放到內(nèi)存的緩存隊(duì)列memory-queue中,filter從memory-queue中取數(shù)據(jù)消費(fèi)。memory-queue是有大小限制的,當(dāng)filter+output比較慢的話就會(huì)阻塞input的操作,這個(gè)時(shí)候我想會(huì)不會(huì)因?yàn)槊看螐膋afka pull的數(shù)據(jù)太多導(dǎo)致呢?poll的數(shù)據(jù)太多,filter消費(fèi)不動(dòng),導(dǎo)致input等待太久導(dǎo)致了kafka的超時(shí)。當(dāng)時(shí)想著Timeout executing grok是因?yàn)樯倭勘容^特殊的數(shù)據(jù)導(dǎo)致的,這種情況也符合偶先的情況。

    于是上面的一些配置做了改動(dòng)

    pipeline.workers: 6 pipeline.batch.size: 50max_poll_interval_ms => "600000" max_poll_records => "100"

    這個(gè)實(shí)際上是改動(dòng)后的配置,之前用的是默認(rèn)的,比這個(gè)大。

    再次重啟了logstash,心想這次應(yīng)該沒(méi)有啥問(wèn)題了吧,結(jié)果運(yùn)行了一天左右,還是出現(xiàn)了上面的問(wèn)題,黔驢技窮了,想了想,只能試試分析一下gc日志了,于是又開(kāi)啟了gc日志。

    4. 通過(guò)gc日志定位問(wèn)題

    開(kāi)啟gc

    jvm.options## GC logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime# log GC status to a file with time stamps # ensure the directory exists -Xloggc:./gc_logs/gc.log

    再次重啟,等問(wèn)題再次復(fù)現(xiàn)的時(shí)候觀察gc日志 , 發(fā)現(xiàn)有頻繁的full-gc操作,以至于logstash無(wú)法進(jìn)行工作了,因?yàn)閹缀跛械臅r(shí)間都在進(jìn)行full-gc, 但是full-gc之后內(nèi)存沒(méi)有下降,很明顯,發(fā)生了內(nèi)存泄露,應(yīng)該是logstash的啥bug讓我觸發(fā)了

    2021-11-27T17:36:21.048+0800: 78068.514: [CMS-concurrent-mark-start] 2021-11-27T17:36:23.878+0800: 78071.344: [Full GC (Allocation Failure) 2021-11-27T17:36:23.878+0800: 78071.344: [CMS2021-11-27T17:36:25.512+0800: 78072.978: [CMS-concurrent-mark: 4.462/4.464 secs] [Times: user=39.30 sys=0.82, real=4.47 secs](concurrent mode failure): 4194303K->4194303K(4194304K), 35.7906436 secs] 6081790K->4204353K(6081792K), [Metaspace: 103267K->103267K(1155072K)], 35.7908146 secs] [Times: user=44.59 sys=0.67, real=35.79 secs] 2021-11-27T17:36:59.669+0800: 78107.135: Total time for which application threads were stopped: 35.7917723 seconds, Stopping threads took: 0.0000556 seconds 2021-11-27T17:36:59.673+0800: 78107.139: Total time for which application threads were stopped: 0.0027825 seconds, Stopping threads took: 0.0001835 seconds 2021-11-27T17:37:00.674+0800: 78108.140: Total time for which application threads were stopped: 0.0007674 seconds, Stopping threads took: 0.0000516 seconds 2021-11-27T17:37:01.669+0800: 78109.135: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194303K(4194304K)] 4310321K(6081792K), 0.0121234 secs] [Times: user=0.10 sys=0.00, real=0.02 secs] 2021-11-27T17:37:01.681+0800: 78109.147: Total time for which application threads were stopped: 0.0129999 seconds, Stopping threads took: 0.0000897 seconds 2021-11-27T17:37:01.681+0800: 78109.147: [CMS-concurrent-mark-start] 2021-11-27T17:37:06.359+0800: 78113.825: [CMS-concurrent-mark: 4.678/4.678 secs] [Times: user=43.90 sys=1.14, real=4.67 secs] 2021-11-27T17:37:06.359+0800: 78113.825: [CMS-concurrent-preclean-start] 2021-11-27T17:37:06.569+0800: 78114.035: [Full GC (Allocation Failure) 2021-11-27T17:37:06.569+0800: 78114.035: [CMS2021-11-27T17:37:12.639+0800: 78120.106: [CMS-concurrent-preclean: 6.278/6.280 secs] [Times: user=7.09 sys=0.17, real=6.28 secs](concurrent mode failure): 4194303K->4194303K(4194304K), 40.9531845 secs] 6081787K->4205126K(6081792K), [Metaspace: 103268K->103268K(1155072K)], 40.9533777 secs] [Times: user=39.95 sys=0.84, real=40.96 secs] 2021-11-27T17:37:47.523+0800: 78154.989: Total time for which application threads were stopped: 40.9546434 seconds, Stopping threads took: 0.0000614 seconds

    使用jamp看了一下內(nèi)存分布

    jmap -histo:live $pid

    num #instances #bytes class name ----------------------------------------------1: 17289297 691571880 org.jruby.util.ByteList2: 17245027 689801080 org.jruby.RubyString3: 17417048 659620560 [C4: 17334568 554706176 java.util.concurrent.ConcurrentHashMap$Node5: 17214558 550865856 org.logstash.FieldReference6: 17296590 514218192 [B7: 17416092 417986208 java.lang.String8: 8271 135421720 [Ljava.util.concurrent.ConcurrentHashMap$Node;9: 139671 4469472 java.util.HashMap$Node10: 74368 4338232 [Ljava.lang.Object;

    看起來(lái)不對(duì)勁兒,然后再社區(qū)發(fā)了一個(gè)discuss,果然是logstash的一個(gè)bug,遺憾的是這個(gè)bug已經(jīng)有人提交commit了,但是還沒(méi)有修復(fù)。。。
    觸發(fā)這個(gè)bug的原因是我的json 里面有一些字段搞的不規(guī)范,導(dǎo)致所有的日志中json的key的總合是上億的,所以發(fā)生了內(nèi)存泄露。是json filter的一個(gè)bug。

    回顧這個(gè)問(wèn)題,最開(kāi)始的時(shí)候,一直以為是自己使用有問(wèn)題,所以也就沒(méi)有去打印gc,加上依賴自己之前的經(jīng)驗(yàn),感覺(jué)ELK應(yīng)該bug比較少,真的沒(méi)有想到,繞了一大圈,栽跟頭了。。。

    總結(jié)

    以上是生活随笔為你收集整理的lostash Timeout executing grok 问题排查的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。