Hopefully you don't experience a lot of job failures. But when multiple high-frequency jobs start failing, your inbox can get overloaded. Besides being annoying, an inbox full of job failure emails can make it hard to sift out the signal from the noise. In this post we'll walk through a system that I have used to monitor job failures on 100's of instances running 150+ jobs a piece, many of which run every 5 minutes.

Overview

At it's core, the system has three components: a stored procedure, a SQL Agent job, and a table. The table stores job failure details. The agent job executes a procedure to copy job failure details to the table, and cleans up the job history table in MSDB.

The Table

The table is simple, and holds the information you would expect: :::sql USE [DBA] GO

CREATE TABLE [Maintenance].[JobFailureArchive] (
    JobFailureArchiveID INT IDENTITY(1,1) NOT NULL,
    JobID UNIQUEIDENTIFIER NOT NULL,
    JobName SYSNAME NOT NULL,
    RunDate DATETIME NOT NULL,
    StepID INT NOT NULL,
    OutcomeText NVARCHAR(4000) NOT NULL,
    RunDurationSec INT NOT NULL,
    LogDateGMT DATETIME NOT NULL, -- Date and time the record was added
    CONSTRAINT PK_JobFailureArchive PRIMARY KEY CLUSTERED (
        JobFailureArchiveID
    ) WITH ( FILLFACTOR = 100 ),
    INDEX IX_JobFailureArchive_RunDate NONCLUSTERED (
        RunDate,
        JobName
    ) WITH ( FILLFACTOR = 100 )
);
GO

Throughout the rest of this post I will be using the DBA database. This is a database I create on any instance I manage. I use it to store any objects I need to perform monitoring and troubleshooting tasks. I also tend to break out my objects by schema, in these examples I will be using the Maintenance schema.

The Procedure

The procedure copies job failure data from the dbo.sysjobhistory table in MSDB to the JobFailureArchive table. Before it does this, the procedure scans the JobFailureArchive table to determine the date of the most recent job failure it had captured.

USE [DBA]
GO

----------------------------------------------------------------------------------
-- Procedure Name: Maintenance.ArchiveFailedJobs
--
-- Desc: Archives failed job details before purging job history.
--
-- Parameters:
--
-- Auth: Mark Wilkinson (@m82labs)
-- Date: 2017.08.20
----------------------------------------------------------------------------------

CREATE OR ALTER PROCEDURE [Maintenance].[ArchiveFailedJobs]
AS

DECLARE @lastFailure DATETIME -- Last time a failure was captured

-- Get the most recent job failure recorded.
SET @lastFailure = (
    SELECT
        ISNULL(MAX(RunDate),'19000101') AS MaxRunDate
    FROM
        Maintenance.JobFailureArchive
)

-- Insert new job failures
INSERT INTO Maintenance.JobFailureArchive
( JobID, JobName, RunDate, StepID, OutcomeText, RunDuration )
SELECT
    sj.job_id,
    sj.name,
    msdb.dbo.agent_datetime(jh.run_date,jh.run_time),
    jh.step_id,
    jh.message,
    (jh.run_duration / 10000 * 60 * 60) +
    (jh.run_duration / 100 % 100 * 60) +
    (jh.run_duration % 100 ) As RunDurationSec,
    GETUTCDATE()
FROM
    msdb.dbo.sysjobhistory AS jh
    INNER JOIN msdb.dbo.sysjobs AS sj
        ON jh.job_id = sj.job_id
WHERE
    jh.run_status = 0 -- Just get failures
    AND jh.step_id <> 0 -- Skip the '0' step that gives us summary info
    AND msdb.dbo.agent_datetime(jh.run_date,jh.run_time) > @lastFailure;

Alerting

How you approach alerting is up to you, but you have a lot of different options. In my environment I query the JobFailureArchive table via a Nagios service check. You could do the same thing with any monitoring solution that allows you to add custom queries. No matter which direction you take, writing a query to get the job failure details is simple:

SELECT  TOP(1) WITH TIES
        JobName,
        Rundate,
        StepID,
        OutcomeText,
        RunDurationSec
FROM    DBA.Maintenance.JobFailureArchive
ORDER BY LogDateGMT

Since we are storing the date the records are added to the table, this query will always return the latest set of failures. This is a simple example, but the possibilities are endless: * Send the results of this query via database mail * Join with dbo.sysjobs and dbo.syscategories, alerting on different thresholds per job category * Extend the TOP (1) to include multiple capture periods and alert on average failures per capture

Configuring the SQL Agent Job

A SQL Agent job is the last component that ties everything together. At it's simplest the job executes the procedure in the example above. This will get your job failures into the archive table, but there is more this job can do.

Since we are archiving job failures, we can clean up job history in MSDB more aggressively. To do this, add another step to the agent job that executes the following (which limits job history to a rolling 24 hours):

DECLARE @OldestJob DATETIME = DATEADD(day,-1,GETDATE())

EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = @OldestJob;

Reducing the size of the sysjobhistory table can have a lot of benefits including: quicker restore times on MSDB, more responsive SSMS when viewing job history, and a reduction in contention when writing to a busy sysjobhistory table.

You could also add a step to send an email alert when new job failures are inserted. Database mail is a rabbit hole we aren't going to explore in this post. If you do plan on using this job to alert, start by reading through the resources at the bottom of this page. Specifically the one by Jes Borland on sending query results using database mail.

The article on using tokens in agent jobs is also a good read. Using tokens you can find job failures inserted since the job started executing:

SELECT  JobName,
        Rundate,
        StepID,
        OutcomeText,
        RunDurationSec
FROM    DBA.Maintenance.JobFailureArchive
WHERE   LogDateGMT >= msdb.dbo.agent_datetime($(ESCAPE_SQUOTE(STRTDT)), $(ESCAPE_SQUOTE(STRTTM)))
ORDER BY LogDateGMT

The system function msdb.dbo.agent_datetime converts the date and time format used by SQL Agent into a standard datetime. The tokens $(ESCAPE_SQUOTE(STRTDT)) and $(ESCAPE_SQUOTE(STRTTM)), will return the date and time the job started executing. Combined with Jes's post this could be all you need for job failure alerts.

If this job fails for any reason you will stop receiving notifications for any job failures. Make sure to configure this job to alert via email if it fails!

PowerShell (Bonus Feature!)

One nice thing about having job failure data in a dedicated table is that it is easy to query. You can also give normal users access to the data without having to put them in special agent roles or granting them access to system databases. Below is a PowerShell function you, or your users, could use to retrieve job failures based on several criteria:

function Get-SqlJobFailures
{
    <#
    .SYNOPSIS
    Retrieves job failure details from the given instance

    .PARAMETER SqlInstance (Accepts values from the pipeline)
    Instance we are getting job failures from

    .PARAMETER JobName
    All or part of the job name you are interested in

    .PARAMETER Newest
    The number of results to return based on date

    .PARAMETER Since
    The oldest date to show in the results

    .PARAMETER SearchString
    This parameter can be used to search the error output

    #>

    [CmdletBinding()]
    param(
        [Parameter(Mandatory=$true,ValueFromPipeline=$true)]
        [string]$SqlInstance,
        [string]$JobName = '',
        [int]$Newest = 10000,
        [datetime]$Since = '2017-01-01',
        [string]$SearchString
    )

    BEGIN
    {
        [string]$GetFailures_Query = @"
        SELECT  TOP($($Newest))
                @@SERVERNAME AS ServerName,
                JobName,
                Rundate,
                StepID,
                OutcomeText,
                RunDurationSec
        FROM    DBA.Maintenance.JobFailureArchive
        WHERE   rundate >= '$($Since.ToString('yyyMMdd HH:mm'))'
                AND JobName LIKE '%$($JobName)%'
                $( if($SearchString){ "AND OutcomeText LIKE '%$($SearchString)%'" })
        ORDER BY RunDate DESC;
"@

        $data = @()
    }

    PROCESS
    {
        Write-Host "Testing connection to $($SqlInstance): " -NoNewline 
        try
        {
            Test-Connection -ComputerName $SqlInstance -Count 1 -Quiet -ErrorAction Stop | Out-Null
            Write-Host "success" -ForegroundColor Green
        }
        catch
        {
            Write-Host "error - $($_.Exception.Message)" -ForegroundColor Red
            throw
        }

        Write-Host "Getting job failure data: " -NoNewline
        try
        {
            Invoke-Sqlcmd -Query $GetFailures_Query -ServerInstance $SqlInstance -Database master | % {
                $data += $_
            }

            Write-Host "done" -ForegroundColor Green
        }
        catch
        {
            Write-Host "error - $($_.Exception.Message)" -ForegroundColor Red
            throw
        }
    }

    END
    {
        if ( $data )
        {
            return $($data | Sort-Object -Property ServerName, RunDate) 
        }
    }
}

If you are not familiar with PowerShell I recommend giving it a try. It has become an indispensable tool for my day to day tasks. The power of this function comes when you run it across multiple instances:

PS> @('sql-instance-01', `
    'sql-instance-02', `
    'sql-instance-03' ) | Get-SqlJobFailures -Newest 15 | Out-Gridview

This will get the 15 most recent job failures for the three instances listed, and display them in a graphical grid view.

Final Thoughts

Writing job failures to a user table seems like a simple idea, but you'll be surprised how much you use it. We only scratched the surface in this post, I would love to hear what other uses you find for this.

Resources