We have a mailing list at Mozilla which receives mail sent to root at any of our servers. The majority of this mail is cron job output. I have filters set up in my Zimbra account to filter the cron job mail specifically into a folder separate from the rest of the mail to that mailing list. I was on vacation last week, and the last day before I left, I completely deleted the contents of that folder. On my return, that folder contained 26,373 messages in it, all dated within the last week. Trying to separate the nuisance mail from the real problems is kind of impossible by hand with that volume.
Obviously one task is to eliminate the nuisance mail. This has to be done carefully, because typically you still want to get errors from cron jobs, but you don’t want the general output. And not all jobs are good about their use of standard error and standard output, so often you can’t just devnull the standard out and expect to only get mail when there’s a problem. So fixing the nuisance mail sometimes means writing a wrapper script for a cron job that does some grep or awk work to filter the output. But even with the nuisance mail gone, it’s a lot of mail to sift through to find any possible real problems.
So, I filed bug 377043 with an idea for a tool to do some automated analysis of all this cron job output. Keep track of patterns and point out things that need looking at, etc. Unfortunately both cron jobs and data analysis are pretty popular topics (and usually not related to each other) so Google isn’t helping me much trying to search for existing tools. Does anyone know of any existing tools that do something similar to this that we might either be able to use, or build upon?
Does Splunk not do anything like that?
-Max