Table of Contents
knoerre - fast check tool and http
server for nagios remote checks
knoerre [ key ]
knoerre is a tool for checking very different parameters of
a server. The intended primary purpose is to serve check values to a (remote)
requesting instance like nagios by using simplified HTTP.
It was developed as a substitution to the oversized, sometimes very buggy,
sometimes difficult to configure and often also slow net-snmp package.
knoerre uses (should use) tcpserver of DJB’s software suite ucspi-tcp. Only
the brave among yourselves will have the heart to do the daring deed of
using (x)inetd.
The usage of DJB’s daemontools and ucspi-tcp (for tcpserver) is strongly
recommended.
knoerre can be easily set up with knoerre-conf(1)
.
Access restrictions by IP# can be done with knoerre-update-tcprules(1)
.
A key is a specific request to knoerre like i.e. "load1". All "keys" can
be used local or by http
request i.e. knoerre load1, knoerre diskusage/home
or GET /load1 HTTP/1.1 . A key given on command line takes precedence over
reading a http request from stdin (by tcpserver). A http
request is internally
limited to 512 bytes.
Like using keys on the command line you can use knoerre in more ways of
nagios remote checks: called by ssh, NRPE and the slow snmpd. Nevertheless
the usage of tcpserver is strongly recommended. Using tcpserver and a request
like load1 you’ll receive a approx. 25% faster response like a local "/bin/cat
/proc/loadavg". Using a local "knoerre load1" it is 4 times faster than
"/bin/cat".
Here’s a short speed comparison, 5000 times remote request "load1":
net-snmp
default, default nagios check_snmp: 8 mins 50 secs
NRPE: 43 secs
tcpserver/knoerre: 3 secs
With the recommended usage of daemontools and ucspi-tcp
you don’t have to care about starting, stopping or restarting knoerre. Started
on demand by tcpserver(1)
there is no continuously running knoerre process
like other daemons. The controlling tcpserver-process can be managed with
svc(8)
.
Some basic checks are built into knoerre. These built-in
checks don’t need to call an external program.
- cachedvalue
- Return cached
value from a file
Format: cachedvalue/XXXXX/absolute/path/to/file
where XXXXX is the max age in minutes the file may have.
Return the contents of the given file. The file should contain one line
beginning with "OK ", "WARNING " or "CRITICAL " causing knoerre to exit
with the matching exit code.
These conditions will also cause a critical exit code:
lstat error, file’s modtime is older than XXXXX minutes, not a regular file,
empty file, file too large, open error, read error
- cat
- Cat content of
a file.
Format: cat/absolute/path/to/file
"Cat" the content of a given file after "cat/". The first line contains
the filename and also the date of the file (if no error occured). The last
line of the file should contain an integer value to check by nagios. You
can also use this check to test if an NFS-mounted FS is actually working
by "cat"ting a file which should contain just "1" in a line. But to prevent
blocking knoerre-processes you should better use the nfs check. If an error
or timeout happens then 9999 or a bigger value is returned.
- cmdline
- Return
the number of instances of a process by cmdline match.
Format: cmdline/XXXX
where XXX is a string which should be part of the cmdline.
Like process but use /proc/.../cmdline to detect also script-processes like
i.e. python loadlogger.py which process name is only "python".
- cmp
- Compare
a string to the content of a file.
Format: cmp/string/absolute/path/to/file
Compare a string to the content of a file. If the string is equal to the
content (LF is ignored) then 0 is returned otherwise 1. If an error or timeout
happens then 9999 or a bigger value is returned.
- cpu
- Show CPU usage in
percent values.
Format: cpuXY/SECONDS
where X is one of (u|n|s|i|w|I) and Y one of (t|c) and optional SECONDS the measuring
interval.
The times of CPU usage can be shown ’t’otal since kernel start or ’c’urrent
values of a measuring interval of 10 seconds default. The CPU times are
’u’ser, ’n’ice, ’s’ystem, ’i’dle or I/O ’w’ait. The ’I’ values are "inverted" against
100 percent, e.g. print 99 for idle of 1%.
- ctxtswitch
- Return context switches
per second.
Format: ctxtswitch
Format: ctxtswitch/SECONDS
Count the context switches per second. If no seconds are given a default
of 10 is used.
- direntries
- Return the number of entries recursively in
a directory.
Format: direntries/absolute/path/to/dir
It counts entries in a dir - not inodes. This check is equal to "direntries"
in recursive mode. See direntries(1)
.
- dirlevels
- Return the maximum recursion
level
Format: dirlevels/absolute/path/to/dir
Step recursively into dir, count recursion level and print the max count.
One "@" can be used as wildcard like asterisk.
- diskinodes
- Return used
disk inodes percentage.
Format: diskinodes/absolute/path/to/fs
Like diskusage but for inodes and not diskspace.
- diskusage
- Return used
disk space percentage.
Format: diskusage/absolute/path/to/fs
Return the amount of used space on a filesystem given after "diskusage/".
NOTE: Because just on simple stat() call is used, you can use this check
also for testing existance of files like e.g. "/var/lib/mysql/mysql.sock".
See nagios-check-diskfree(1)
.
- fileexists
- Return whether file exists.
Format: fileexists/X/absolute/path/to/file
with X one of [fdcbplsaFDCBPLSA] for f(ile), d(ir), c(har dev), b(lock
dev), p(ipe), l(ink), s(ocket)
or a(ny type).
If file exists and matches the type then 0 is returned otherwise 1. Upper
case letter for file type makes logical inversion of the test. If file is
a small regular file then also its content is printed before last line.
- filesizes
- Return (max) filesize(s) in KB.
Format: filesizes/absolute/path/to/file
Get the filesize in KB of a single file or the maximum filesize of a group
of files. You can use one dot or ’@’ as one wildcard like asterisk in a shell.
See Examples.
- filesizesbypattern
- Return (max) filesize(s) by given filename
pattern.
Format: filesizesbypattern/XXXXX/Y/absolute/path/to/file-or-dir
Format: filesizesbypatternmaxage/XXXXX/Y/ZZ/absolute/path/to/file-or-dir
where XXXXX is a filename pattern like i.e. log, cipher Y is the recursive
search depth and number ZZ is the max age (modtime) in days from 1 to 99.
Get the filesize in KB of a single file or the maximum filesize of a group
of files by a given filename pattern and a maximum depth to search in. You
can use one dot or ’@’ as one wildcard like asterisk in a shell.
- filesizesbysuffix
- Return (max) filesize(s) by given filename suffix.
Format: filesizesbysuffix/XXXXX/Y/absolute/path/to/file-or-dir
where XXXXX is a filename suffix like i.e. .gif and cipher Y is the recursive
search depth.
Get the filesize in KB of a single file or the maximum filesize of a group
of files by a given filename suffix and a maximum depth to search in. You
can use one dot or ’@’ as one wildcard like asterisk in a shell. See Examples.
- filetimestamp
- Return age of file in minutes.
Format: filetimestamp/X/absolute/path/to/file
with X one of [acmoACMO] using access, change or modification time or the
oldest of these.
Upper case means return no error but just 0 if file does not exist. If file
is a small regular file then also print its content before last line.
- kernellog
- Count "bad lines" in kernellog.
Format: kernellog/XX/absolute/path/to/kernellog
where XX is a two-digit number.
Like tslogentries you can specify as first parm the number of chars from
the beginning of a log line which must be equal to the beginning of the
last line of kernellog. If you use i.e. kernellog/07/var/log/kernel on Aug
29, then all lines starting with "Aug 29 " are scanned but not lines with
"Aug 28".
"Bad entries" are hardcoded in source and are strings like "access beyond
end of device", "ector repair", "kernel BUG" and more.
Up to 10 "bad lines" of kernellog are returned in lines above the count
return value for nagios.
- load1 load5 load15
- Return load average per 1/5/15
minutes.
Just return the load average value requested in the last line and all of
/proc/loadavg in the line above.
If knoerre was compiled with gcc-option -DOPENVZDEFAULT then the load value
will be divided by the number of cpu cores online as listed in /proc/stat.
Additionally the number of cores will be appended to the line with loadavg
data.
- loaduser
- Return most processes per one account
Format: loaduser/XXX/YYY
where XXX and YYY are the min/max uid of the processes to be checked.
Return most running processes per one account. For every uid in the given
range all processes are counted. Up to 3 top users and the process counts
are printed and the value in the last line is the max proc count.
- longprocp
- Return minutes of the longest running user process.
Format: longprocp/XXX/YYY[/A[/B[/C]]]
where XXX and YYY are the min/max uid of the processes to be checked and
the optional A, B, ... are names of processes to be excluded from check (up
to 15).
Check for long running processes. This check returns the time in minutes
of the longest running user process. Its goal is to detect suspicious processes
like PHP-shells of hacked user accounts. The only difference to longprocs
is that min/max uid and process excludes are given by HTTP request and
are not configured in /etc/knoerrerc. It’s useful in cases when you want
to build a monolithic version of knoerre which does not read knoerrerc.
- longprocs
- Return minutes of the longest running user process.
Format: longprocs
Check for long running processes. This check returns the time in minutes
of the longest running user process. Its goal is to detect suspicious processes
like PHP-shells of hacked user accounts. The values for min/max uid and optional
exclude process names must be specified in /etc/knoerrerc. See nagios-check-longuserprocesses(1)
.
- mailqsize
- Return postfix mailqueue size.
Format: mailqsize
Format: mailqsize/XXXXX
Return the size of the mailqueue (active and deferred subdirs) on a postfix
server. See postfix-mailqsize(1)
. With the second format you can specify up
to 4 subdirs to check and an optional mode character. Just use any combination
of single chars like a(ctive), d(eferred), m(aildrop) or i(ncoming)
. Using
’M’ as mode char for maximum count you won’t get the sum of all emails but
the maximum count of one of the specified dirs.
- maxdirentries
- Return the
maximum number of entries recursively in directories.
Format: maxdirentries/X/absolute/path/to/dir
where cipher X is the recursive search depth.
This check is equal to "direntries" in max mode. See direntries(1)
.
- maxfilesizes
- Return biggest file size recursively.
Format: maxfilesizes/X/absolute/path/to/dir
Format: maxfilesizessum/X/absolute/path/to/dir
where cipher X is the recursive search depth.
Find the biggest files and print paths and sizes in MB. The return value
is the size of the biggest file in MB or the sum of the sizes of the scanned
files.
- mountopts
- Check mountpoint and options
Format: mountopts/XXXXX/absolute/path/to/mountpoint
where XXXXX is an option string which should match the beginning of the
mount options
Use /proc/mounts for actual mount options and mountpoint. If the given option
string matches as long as it is the actual mount options then 0 will be
returned otherwise 1. If an error like i.e. not existing mountpoint or timeout
happens then 9999 or a bigger value is returned.
- mysqlerr
- Count errors
in mysqld errlog
Format: mysqlerr/absolute/path/to/mysqld.err
Like kernellog you must specify the absolute path to MySQL daemon error
logfile. Only lines with ts of the current day are examined. Every "Note"
counts, "Warnings" count ten times and every "ERROR" has a weight of 100.
- netlinksdown
- Count net interfaces without link
Format: netlinksdown
Check all network interfaces for missing link (cable).
- nettraf
- Count network
traffic
Format: nettraf/XXXX/SECONDS
where XXXX is the device name and optional SECONDS the measuring interval.
Traffic data is read from /proc/net/dev. Units are KiB and KiB/s. In the
line before last the total count of traffic while the measuring interval
and the measuring interval are shown.
- nfs
- Check availability of a nfs-mounted
fs.
Format: nfs/absolute/path/to/file
Check the availability of a nfs-mounted fs. It does this by "cat"ting the
content of a given file after "nfs/", which should contain "1". If this
file does not exist or NFS is not available and a timeout of 2 seconds
did happen then a bigger value than 1 is returned. For NFS this check should
be preferred over cat because it forks a child which may be blocked and
killed then afterwards. See nagios-check-nfs(1)
.
- proccount
- Number of all
processes
Format: proccount
Format: proccounttg
Format: proccountovz
"proccount" shows the count of all processes as shown by /proc/loadavg
(including "threads"). "proccounttg" counts processes by stepping through
/proc and count every PID-dir (no "threads", just processes with pid==tgid).
The alternative "proccountovz" is disabled by default. It additionally shows
the three "top" instances of OpenVZ in the line before last line.
- process
- Count instances of a process.
Format: process/XXXXX
Format: process0/XXXXX
Format: processd/XXXXX
Format: process/OpenVZ-CTID_YYYY/XXXXX
Format: processd/OpenVZ-CTID_YYYY/XXXXX *** CURRENTLY NOT IMPLEMENTED ***
where XXXXX is the name of a process as in /proc/.../stat and YYYY is the
CTID to match on an OpenVZ host.
If the key is "processd" then count only "real" daemons running as session/process
leader with PPID 1.
On "process" a return value of 999999999999999999 will be returned if no
such process runs. To return just 0 you must use "process0".
See nagios-check-process(1)
.
- rsbackup
- Return the minutes since the last
backup.
Format: rsbackup
The last backup time is taken from "/var/log/backup.timestamp" and the difference
to the current time is returned. See nagios-check-backup(1)
.
- timediff
- System
clock difference between local and remote.
Format: timediff/XXXXX
where XXXXX must be the unix timestamp from the requesting server in seconds
since epoch.
The difference between remote and local system time is returned as a (positive)
value in seconds.
A sample check in a shell:
lynx -dump http://172.16.1.1:8888/timediff/$(date +%s)
- tslogentries
- Count
last lines in a logfile with the same beginning of line.
Format: tslogentries/XY/absolute/path/to/file
where cipher X is the recursive search depth and the optional Y is a separator
char.
If you have logfiles with a timestamp at the beginning of every logline
then you can count i.e. how many mails were sent or files were transferred
today. The first argument must be a cipher as field count and an optional
char taken as field separator to create a matching pattern. The pattern
is created from the last line and the field count and separator. If no separator
char is specified then ’ ’ (space) will be used as default. The second argument
is the path. You can use one dot or ’@’ as one wildcard like asterisk in a
shell. See Examples.
- sockets
- Count sockets / sockets per port
Format: sockets/PROTO/XXXXXX/YYYY
Format: sockets/PROTO/XXXXXX/YYYY/ZZZZZZZZ
where XXXXXX is local, remote, wlocal or wremote. YYYY is the port as 4-digit
hexstring and ZZZZZZZZ is an optional IP address to be excluded from counting.
PROTO is one of tcp, udp, tcp6 and udp6. It is also the name of the proc-file
in /proc/net/ which is read to get socket data. If you wanna know e.g. the
number of sockets of a local running apache then you should use the key
sockets/tcp/local/0050 and if you wanna count outgoing ssh-connections excluding
connections to 172.16.0.1 then you should use sockets/tcp/remote/0016/010010AC
. Sockets in state "06" (TIME_WAIT) are ignored unless you prefix local/remote
with ’w’.
- swap
- Used swap space in MB
Format: swap
Used swap space in MB is calculated with values of /proc/meminfo. MemTotal
and SwapTotal in MB are printed in line before last. If you don’t need this
data you should use swaps because /proc/swaps holds just swap information.
The "swap" key is disabled by default.
- swaps
- Used swap(s) space in MB
Format: swaps
This is an alternative version to swap. The amount of used swap space is
calculated by adding the "Used" fields in /proc/swaps. The number of active
swaps is printed in line before last.
- uptime
- Return inverted uptime
Format: uptime
Format: uptime/INVERSIONLEVEL
Return an "inverted" uptime in seconds. The value returned is (INVERSIONLEVEL
- uptime) or 0 if the value would be negative. The inversionlevel may be
specified by the key string, i.e. uptime/3600. If no inversionlevel was specified
then a default of 86400 will be used.
- wc-l
- Count lines of a file.
Format: wc-l/absolute/path/to/file
Just like shell cmd "wc -l" it counts lines of a file. You can use it for
checking i.e. apache running out of semaphores with wc-l/proc/sysvipc/sem.
The (optional) resource config file is "/etc/knoerrerc". You
can just specify some basic settings like external commands or parameters
for "longprocs".
To specify an external program which is called by knoerre please use "CMD
programurl command arg1 arg2 .. arg15", like i.e.
CMD loadavg cat /proc/loadavg
NOTE1: The number of args is limited to 15.
NOTE2: knoerre doesn’t use insecure and oversized popen(). You don’t get a
shell to execute the external program.
NOTE3: You can’t specify a path to your external program. For security reasons
knoerre uses an internal path list to search for the program.
Parameters for the longprocs function can be specified like this:
LONGPROC_UID_MIN
630
LONGPROC_UID_MAX 65533
LONGPROC_EXCLUDES vsftpd bash sftp-server
knoerre uses one configuration
file and one access restrictions file for its tcpserver daemon:
- /etc/knoerrerc
- rc-file for non-monolithic knoerre
- /etc/knoerre.tcprules.cdb
- tcprules for use
with tcpserver
tcpserver(1)
, knoerre-conf(1)
, knoerre-update-tcprules(1)
,
svc(8)
, check_remote_by_http(1)
, check_remote_by_http_time(1)
http://cr.yp.to/ucspi-tcp.html
http://cr.yp.to/daemontools.html
Here’s a simple example of a client
and server communication: server$ tcpserver -v -RHl localhost 0 8888 knoerre
client$ lynx -dump -mime-header http://server:8888/load1
HTTP/1.0 200 OK
Server: knoerre/0.8.5m
Content-Type: text/plain
1.51
You can also use something like
echo "GET /loadavg HTTP/1.1" | knoerre
or
knoerre loadavg
This example shows the usage of a @ as wildcard:
$ knoerre filesizes/home/www/@/log/access_log
/home/www/user_hans/log/access_log
52222
A very "complex" example with three arguments (suffix, depth and path)
and wildcard usage is this:
$ knoerre filesizesbysuffix/.gif/2/home/@/html/typo3temp
/home/www/user_hans/html/typo3temp/pics/30363cbb32.gif
201
Also filesizesbysuffix:
$ knoerre filesizesbysuffix/cache_pages.ibd/1/var/lib/mysql
/var/lib/mysql/user-database-1/cache_pages.ibd=3022848
3022848
$ knoerre filesizesbysuffix/.ibd/1/var/lib/mysql
/var/lib/mysql/user-database-2/index_rel.ibd=3248128
3248128
Which user sent the most emails today?
$ knoerre tslogentries/1/home/www/@/log/mail.log
/home/www/user_hans/log/mail.log
858
Which user runs the most processes?
$ knoerre loaduser/1/60000
hans=32 jack=3 john=1
32
Is /home rw-mounted and nosuid?
$ grep home /proc/mounts
/dev/sda7 /home ext3 rw,nosuid,nodev,data=ordered 0 0
$ knoerre/knoerre mountopts/rw,nosuid/home
/home==rw,nosuid?
/dev/sda7 /home ext3 rw,nosuid,nodev,data=ordered 0 0
0
knoerre does not support dropping of rights. Used as remote check
tool with tcpserver you can drop rights with tcpserver. knoerre actually
does not need to be run as root but for different checks and different
dirs and files you’ll maybe need different rights. Don’t use setuid-bits, uid/euid
checks are not made.
Too long keys are truncated or answered with http-redirection.
HTTP requests are limited to 512 bytes.
Keys containing ".." are answered
with http-redirection.
All stat-calls are lstat()-calls.
No writes are made
to filesystem(s), all open()-calls are read-only. Data is only written to
stdout/stderr.
No external libs are used. Only standard C-lib is used. No
stdio-functions are used. "External" input data is used with bound checks.
Arrays are "oversized" to avoid off-by-one errors.
An internal timeout prevents
"dead" knoerre processes with blocking read() and waiting for data which
will never come.
The amount of syscalls and the amount of different syscalls
is low. The source code and also the executable file is small.
Using external
commands with "CMD" in /etc/knoerrerc can be a security risk because the
external program is forked/exec’ed by knoerre.
knoerre doesn’t use insecure
and oversized popen() to execute external commands. You don’t get a shell
to execute an external program. You can’t put strings in quotes. Space does
always separate. You can’t specify a path to your external program. knoerre
uses an internal path list to search for the program.
It’s strongly recommended
that you only allow access for your nagios server by tcp. One entry "knoerre:
ALL" in /etc/hosts.deny and one entry with the nagios server IP# in /etc/hosts.allow.
After changing it you must use knoerre-update-tcprules(1)
to update tcpserver’s
cdb file. Keep always in mind that host based authentication is actually
not a authentication.
To encrypt network traffic please use e.g. ipsec or
vpn.
Due to "leaf optimization" in direntries recursive mode it
can produce wrong results on non-unix-like filesystems.
The maximum internal
absolute pathname length is 16384 chars.
Frank Bergmann, http://www.tuxad.com
Table of Contents