Skip to content

filesystem error count metric #3113

@anarcat

Description

@anarcat

We are porting various alerts from Nagios to the prometheus ecosystem and we've found one check that is kind of useful in Nagios that seems to be missing from the node exporter. It's a check that looks at EXT filesystems with the tune2fs -l command and (basically) greps for the FS Error count field.

This should normally be zero but under certain circumstances (failing disk, filesystem bug, power outage), it will rise. running fsck on the filesystem will fix this (and, normally, after a power outage, a reboot will run fsck, but under certain circumstances, it might not fully do it).

So I think the node exporter should do this. I've tried to find metrics about this in our node exporters and couldn't find anything under the node_filesystem_* namespace. There is node_filesystem_readonly and, according to this post node_filesystem_device_error (but I can't see that metric here), but neither of those are the same as the error count.

Am I missing something or this is missing from the node exporter?

Here's a copy of the check, called dsa-check-filesystems here:

#!/usr/bin/ruby

require 'filesystem'

ignorefs = ["NFS", "nfs", "nfs4", "nfsd", "afs", "binfmt_misc", "proc", "smbfs",
	   "autofs", "iso9660", "ncpfs", "coda", "devpts", "ftpfs", "devfs",
	   "mfs", "shfs", "sysfs", "cifs", "lustre_lite", "tmpfs", "usbfs",
	   "udf", "fusectl", "fuse.snapshotfs", "rpc_pipefs"]
mountpoints = {}

FileSystem.mounts.each do |m|
	if ((not ignorefs.include?(m.fstype)) && (m.options !~ /bind/))
		mountpoints[m.device] = { 'type' => m.fstype, 'mount' => m.mount }
	end
end

def check_ext3(dev, mnt)
	output=%x{tune2fs -l #{dev}}
	if output =~ /FS Error count:\s*(\d+)/ and $1.to_i > 0
		return "#{dev} (#{mnt}) has #{$1} errors"
	end
end

output = []
mountpoints.keys.each do |m|
	temp = ''
	begin
		if mountpoints[m]['type'] =~ /ext/
			temp = check_ext3(m, mountpoints[m]['mount'])
		end
	rescue Exception => e
	end
	if temp && (temp.length > 0)
		output << temp
	end
end

if output.length > 0
	puts output.join("\n")
	exit 1
end
puts "OK: All filesystems ok."
exit 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions