disk_device_error指标采集无效

Viewed 73

版本信息: categraf: v0.3.18 n9e: 6.7.2 os: rhel7.9
现象: 操作系统中dmesg命令查看日志,存在磁盘故障,但disk_device_error值仍然为 0

[24250333.761624] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[24250333.761632] sd 0:2:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 05 f6 c0 8b 08 00 00 00 08 00 00
[24250333.761636] blk_update_request: I/O error, dev sdb, sector 25614650120
[24250333.763544] EXT4-fs warning (device sdb1): __ext4_read_dirblock:903: error reading directory block (ino 400228521, block 0)
[24250333.766782] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[24250333.766788] sd 0:2:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 05 f6 c0 8b 08 00 00 00 08 00 00
[24250333.766791] blk_update_request: I/O error, dev sdb, sector 25614650120
[24250333.768684] EXT4-fs warning (device sdb1): __ext4_read_dirblock:903: error reading directory block (ino 400228521, block 0)
[24250378.539706] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[24250378.539727] sd 0:2:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 05 f6 c0 8b 08 00 00 00 08 00 00
[24250378.539734] blk_update_request: I/O error, dev sdb, sector 25614650120
[24250378.541579] EXT4-fs error (device sdb1): ext4_find_entry:1318: inode #400228521: comm prometheus: reading directory lblock 0
[24250378.545220] EXT4-fs (sdb1): previous I/O error to superblock detected
[24250378.547096] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[24250378.547102] sd 0:2:1:0: [sdb] tag#0 CDB: Write(16) 8a 00 00 00 00 00 00 00 08 00 00 00 00 08 00 00
[24250378.547105] blk_update_request: I/O error, dev sdb, sector 2048
[24250378.548948] Buffer I/O error on dev sdb1, logical block 0, lost sync page write
[24250393.775773] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[24250393.775782] sd 0:2:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 05 f6 c0 8b 08 00 00 00 08 00 00
[24250393.775786] blk_update_request: I/O error, dev sdb, sector 25614650120
[24250393.777702] EXT4-fs warning (device sdb1): __ext4_read_dirblock:903: error reading directory block (ino 400228521, block 0)
[24250393.780913] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[24250393.780920] sd 0:2:1:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 05 f6 c0 8b 08 00 00 00 08 00 00
[24250393.780923] blk_update_request: I/O error, dev sdb, sector 25614650120
[24250393.782819] EXT4-fs warning (device sdb1): __ext4_read_dirblock:903: error reading directory block (ino 400228521, block 0)

categraf日志debug 模式

09:45:59 disk_device_error agent_hostname=1.1.1.1  device=sdb1 fstype=ext4 mode=rw path=/data 0

想请教下:

  1. disk_device_error的实现原理是啥
  2. 之前有成功捕获disk error的情况么,具体是什么错误
3 Answers

在即时查询里看看一段时间内 disk_device_error 的数据呢 是不是有1

没有1,一直持续是0

dmesg的error时间是什么时间? dmesg -T 可以看

dmesg -T,观察到时间为今天上午9点左右,我即时查询,看了上午的数据,也都是0

此时此刻,我手动触发了故障磁盘下的读写,dmesg 里面记录到了磁盘错误信息,时间也为当前时间,但n9e里面记录到的还是0

截个图?

经过提醒,我又去实时观察了日志,dmesg记录的时间竟然比categraf记录的时间早3分钟,提前于操作系统的时间,这个是会影响指标采集的么。
图里面的日志是同时产生的,dmesg的时间是不对的。 后续categraf统计到的值也一直是0
image.png

df -h /dev/sdb (-h后面跟报错的分区) 在截图下命令返回

df -h /dev/sdb1

Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 14T 32G 13T 1% /container

device error 现在主要是通过类似df -h 来获取分区错误。很好奇,dmesg中有报错,但是df -h 分区是正常的。