记一次内存泄露问题的解决

Viewed 205

我是通过下面的配置进行/var/log/messages文件的错误日志监控的,现在发现一个问题,当messages文件定时产生

Apr  6 18:00:09 mt-hermesContract-101 crond: sendmail: warning: inet_protocols: disabling IPv6 name/address support: Address family not supported by protocol
Apr  6 18:00:09 mt-hermesContract-101 crond: postdrop: warning: inet_protocols: disabling IPv6 name/address support: Address family not supported by protocol

这种错误的时候,categraf占用的内存会一直增长
categraf 版本为0.2.31.
system.mtail配置文件如下:

# file BlockForMore.mtail:
counter system_blocked_morethan_error
/blocked for more than/ {
 system_blocked_morethan_error++
}

# file BUGHardareLookup.mtail:
counter system_hardlookup_bug
/Hard LOCKUP/ {
 system_hardlookup_bug++
}

# file BUGSoftLookup.mtail:
counter system_softlookup_bug
/BUG: soft lockup/ {
 system_softlookup_bug++
}

# file CallTrace.mtail:
counter system_call_trace_error
/Call Trace/ {
  system_call_trace_error++
}

# file HardwareError.mtail:
counter system_hardware_error
/Hardware Error/ {
 system_hardware_error++
}

# file KernelBug.mtail:
counter system_kernel_bug
/kernel BUG at/ {
 system_kernel_bug++
}

# file KernelPanic.mtail:
counter system_kernel_panic_error
/Kernel panic/ {
 system_kernel_panic_error++
}

# file MemoryCgroupOutOfMemory.mtail:
counter cgroup_out_of_memory_error
/Memory cgroup out of memory/ {
  cgroup_out_of_memory_error++
}

# file OutOfMemory.mtail:
counter system_out_of_memory_error
/Out of memory/ {
  system_out_of_memory_error++
}

# file UnableHandleKernel.mtail:
counter system_unable_handle_kernel_error
/BUG: unable to handle kernel/ {
 system_unable_handle_kernel_error++
}

图片.png
mtail 插件配置:

[root@mt-tfFileConvert-101 ~]# cat /httx/run/categraf/conf/input.mtail[[instances]]
progs = "/httx/run/categraf/progs/system.mtail" # prog dir1
logs = ["/var/log/messages"]
# override_timezone = "Asia/Shanghai"
# emit_metric_timestamp = "true" #string type

# [[instances]]
# progs = "/path/to/prog2" # prog dir2
# logs = ["/path/to/logdir/"]
# override_timezone = "Asia/Shanghai"
# emit_metric_timestamp = "true" # string type

现在是偶发的几台服务器出现categraf内存泄露问题,看categraf日志找到如下日志

2023/04/06 15:48:51 collector.go:74: I! Everything is ready for traces, begin tracing.
2023/04/06 16:48:51 store.go:157: Running Store.Expire()

出现store.go的日志之后,categraf进程内存会增加占用,然后查代码发现,store.go是mtail使用的,倒查发现了messages的报错,而且出现内存泄露的服务器都有这个问题,其他正常服务器是关闭系统邮件服务的

经过定位排查,发现是trace链路数据采集的问题,数据量大会产生堆积,导致内存泄露,优化配置解决

感谢反馈,categraf又清白了 /doge

3 Answers

你给出的规则和日志没办法支持你的结论。
贴一下完整的配置。
另外,排查下是否用了通配符,然后/var/log下又有很多日志呢,比如清理周期长,不停的产生老文件?

mtail 插件配置:

[root@mt-tfFileConvert-101 ~]# cat /httx/run/categraf/conf/input.mtail[[instances]]
progs = "/httx/run/categraf/progs/system.mtail" # prog dir1
logs = ["/var/log/messages"]
# override_timezone = "Asia/Shanghai"
# emit_metric_timestamp = "true" #string type

# [[instances]]
# progs = "/path/to/prog2" # prog dir2
# logs = ["/path/to/logdir/"]
# override_timezone = "Asia/Shanghai"
# emit_metric_timestamp = "true" # string type

会不会有一种可能,是因为categraf本身的日志也写在message中呢?
因为检测message导致categraf产生新的日志->写入message->categraf重新检测。。。。死循环了

我的categraf日志是输出到当前目录的out.log文件的

[log]
# file_name is the file to write logs to
file_name = "out.log"
# options below will not be work when file_name is stdout or stderr
# max_size is the maximum size in megabytes of the log file before it gets rotated. It defaults to 100 megabytes.
max_size = 1024
# max_age is the maximum number of days to retain old log files based on the timestamp encoded in their filename.  

你好楼主,看了你的监控项,挺全的,不知道能否分享下告警规则和监控大盘?