Kendra’s query was a good starting point, and I used most of it in the first CTE shown below. This query basically looks at msdb.dbo.sysjobhistory and msdb.dbo.sysjobactivity, joining them on the job_id, which is the PK. However, we are only looking at steps, which is anything with a step_id > 0. The step_id = 0 is for the overall job.Recently a customer was asking for a way to alert on job steps that failed, but the job succeeded. They really wanted a custom metric for SQL Monitor, which I submitted, but this post looks at the query in general, trying to determine if a job step failed.
Note: Let me start by noting that this is based on work by Kendra from her post, SQL Agent Jobs: Checking for failed steps at the end of a job.
Based on Kendra’s query, I looked through what is happening in msdb.dbo.sysjobhistory and msdb.dbo.sysjobactivity. In Kendra’s query, she is looking for a specific job, but I wanted all jobs. This lead me to build a CTE that queries for the data.
As you can see in the code below, I use most of Kendra’s query to join these two tables together. First, we look for steps, so step_id != 0, and we look for failures. A status of 0 is a failure here, where 1 is success.
Update: Changed the query after the comment and another bug I noticed.
WITH cteActivity (job_id, start_execution_date) AS ( -- get the latest job execution for all jobs SELECT job_id, MAX(start_execution_date) AS start_execution_date FROM msdb.dbo.sysjobactivity GROUP BY job_id) , cteJobStep (Job_ID, Step_Name, run_date, run_time) AS (SELECT jh.job_id, jh.step_name AS JobStepName, jh.run_date, jh.run_time FROM msdb.dbo.sysjobhistory jh INNER JOIN cteActivity ja ON jh.job_id = ja.job_id WHERE jh.run_status = 0 --step failed AND jh.step_id != 0 -- and jh.job_id = CONVERT(uniqueidentifier, '8C673935-F8C1-4E7D-94D3-1F3CAE50D7DC') AND --this block ensures that we just pull information from the most recent job run ( -- The start time of the step, converted to datetime CONVERT(DATETIME, RTRIM(jh.run_date)) + (jh.run_time * 9 + jh.run_time % 10000 * 6 + jh.run_time % 100 * 10 ) / 216e4 >= ja.start_execution_date -- the time the job last started )
Next, I kept most of Kendra’s query, but I commented out the line that limits this to a specific ID. I just want all step failures. I did keep the part that only checks the latest execution of the job.
The outer query just counts these up and returns a number. This lets me know how many job steps have failed during their latest execution.
This is a good first step, but this is something I could add in SQL Monitor as a custom metric or in any tool for alerting. When I have steps failing, I might want to know.