使用长时间运行作业的警报监控 SQL 代理

养了一个闲月亮 2021-09-09 10:26:49 浏览数 (2401)
反馈

本篇文章将为大家详细介绍一种通过电子邮件技术发出的警报,可以应用在长时间运行的 sql 代理作业时收到通知。下面是详细请内容,希望能够对大家有所帮助!

问题

当你必须在繁忙的生产服务器上管理数百个计划作业时,不可避免地会出现作业需要很长时间才能完成的情况,从而导致大量等待或影响其他进程的性能。在深入调查性能下降的原因之前,我们想知道一项工作何时开始花费太长时间。有些工作不仅需要很长时间才能完成,而且它们可能需要比通常使用的时间更长的时间才能完成。

哪些工作的表现正在倒退?缓慢的工作现在是否一直成为问题,还是一次性发生?当需要调查一项工作时,通常在时间限制内,运行时长是我们分析中首先要考虑的指标。

为了将当前运行时间与同一作业之前执行的持续时间进行比较,我们通常会查看作业的前一个运行时间的历史记录,并看到当前运行时间比之前的几个持续时间长。然而,如果你以 DBA、开发人员或 DevOps 工程师的身份管理 500 份工作,一次调查一项工作的任务可能需要半天时间,但仍然无法让你更接近结论。

你可能需要检查 20 个作业,你不想一直单独照看每份工作。在这种情况下,发送到你的邮箱的自动警报会派上用场。本文中的解决方案是创建一个将接受参数的存储过程。此参数是用于计算过去作业平均持续时间的天数。该过程会生成一份报告,显示当前正在运行的作业的性能正在下降(退化)以及可选的电子邮件警报。

理解这种方法的某些步骤以及我们已经在生产中使用的方便的存储过程如下所示(在运行受监控作业的同一服务器上安排为单独的作业(有关此类作业的脚本,请参阅“附录”) ) 以帮助您收到有关在轮询时正在退化的一个或多个作业的电子邮件警报。我们​asp_long_running_Regressing_Jobs_Alerts​每分钟在我们的环境中执行 ​Stored Proc​, ,因为我们的作业持续时间从几秒到几小时不等。在给予之前SP DDL,SP 的某些组件被分解为先决条件(小尺寸代码片段),有助于全面了解警报 SP。

所需工具列表:​SQL Server​(以下代码在 2012 及更高版本上测试)

这里的目标是将每个当前正在运行的作业的持续时间与给定时间段内同一作业的所有运行时间的平均持续时间进行比较。

注意:在继续编译下面给出的存储过程之前,请确保你在要分析计划作业的环境中具有高访问权限。理想情况下,管理员。要检查你的访问级别(如果你不是该服务器的管理员,请运行此或类似的 T-SQL 语句。如果你没有此查询的结果,则你没有对 MSDB 表的选择权限。

USE MSDB
GO
SELECT HAS_PERMS_BY_NAME(QUOTENAME(SCHEMA_NAME(schema_id)) + '.' + 
       QUOTENAME(name), 'OBJECT', 'SELECT') AS have_select, *
FROM   sys.tables
GO

图 1

图 1:结果集显示你拥有 SELECT 权限的表。

解决方案

下面存储过程的 T-SQL DDL 代码可以在您选择的任何数据库中编译。我们使用专用​DBA_db​于此类管理 SP。

USE [DBA_db]
GO
 
-- <begin stored procedure DDL/>
 
CREATE PROCEDURE asp_long_running_Regressing_Jobs_Alerts
         @history_days int = 7,
         @avg_duration_multiplier float = 1.5,
         @bEmail bit = 0,
         @bSaveToTable bit = 0,
         @RecipientsList Varchar(1000) = 'myName@myCoDomain.com',
         @ignore_zero_durations bit = 0
 
AS
/* example of usage:
   exec DBA_db..asp_long_running_Regressing_Jobs_Alerts
               @history_days = 45,
               @avg_duration_multiplier = 2,
               @bEmail = 0,
               @bSaveToTable = 0,
               @RecipientsList  = 'myName@myCoDomain.com;'   ,
               @ignore_zero_durations = 1
 
AUTHOR(s):
Vladimir Isaev;
-- + V.B., S.L;
-- contact@sqlexperts.org
*/      
 
/*PARAMETERS:
@history_days int          (how many days back we use for AVF run duration)
@avg_duration_multiplier   (how many times longer than AVG will qualify job 
                            for producing an alert)
@bEmail                    (send out Alert Email or just print the msg about Regressing jobs)
                           -- 'REGRESSION' is defined here by Duration only
*/
SET NOCOUNT ON
BEGIN
 
        select sj.name,
               sja.start_execution_date,
               sja.stop_execution_date,
               ajt.min_run_duration,
               ajt.max_run_duration,
               ajt.avg_run_duration,
               datediff(ss, start_execution_date, getdate()) as cur_run_duration
        into #Regressing_Jobs
 
        from msdb..sysjobactivity sja
               left join
               (select job_id,
                       avg(dbo.udf_convert_int_time2ss(run_duration)) as avg_run_duration,
                       min(dbo.udf_convert_int_time2ss(run_duration)) as min_run_duration,
                       max(dbo.udf_convert_int_time2ss(run_duration)) as max_run_duration
                       from msdb..sysjobhistory
                       where step_id=0
                       and run_date >CONVERT(varchar(8),GETDATE() - @history_days,112)
                       and ((run_duration <> 0 or @ignore_zero_durations = 0))
                       and run_duration < 240000
                       group by job_id
               )ajt on sja.job_id=ajt.job_id
        join msdb..sysjobs sj on sj.job_id=sja.job_id
        where
               sja.session_id = (SELECT TOP 1 session_id
                                   FROM msdb.dbo.syssessions
                                  ORDER BY agent_start_date DESC)
               AND start_execution_date is not null
               and stop_execution_date is null
               and datediff(ss, start_execution_date, getdate()) >
                   ajt.avg_run_duration * @avg_duration_multiplier
 
        select name as JobName,
               start_execution_date,
               stop_execution_date,
               dateadd(second, min_run_duration, 0) as min_run_duration,
               dateadd(second, max_run_duration, 0) as max_run_duration,
               dateadd(second, avg_run_duration, 0) as avg_run_duration,
               dateadd(second, cur_run_duration, 0) as cur_run_duration
        into #Regressing_Jobs_DurAsDate
        from #Regressing_Jobs
                                            --  waitfor delay '00:00:10'
        declare @sHtml varchar(max) = ''
 
        declare @tableHTML  nvarchar(max) =
               N'<H1>Job(s) taking longer than recent baseline duration
                (in descending avg duration order):</H1>' + Char(13)
               + N'    <table border="1">'           + Char(13)
               + N'    <tr bgcolor="#ddd">'          + Char(13)
               + N'           <th>Start Time</th>'   + Char(13)
               + N'           <th>Job Name</th>'     + Char(13)
               + N'           <th>Host Name</th>'    + Char(13)
               + N'           <th>History Days</th>' + Char(13)
               + N'           <th>Avg Dur Mul</th>'  + Char(13)
               + N'           <th>Min Dur</th>'      + Char(13)
               + N'           <th>Max Dur</th>'      + Char(13)
               + N'           <th>Avg Dur</th>'      + Char(13)
               + N'           <th>Cur Dur</th>'      + Char(13)
               + N'    </tr>'                        + Char(13)
 
        select @tableHTML =  @tableHTML
               + FORMATMESSAGE(
                       '<tr><td>%s</td>'      _
                          + Char(13) --start_execution_date
                       + '<td>%s</td>'        + Char(13) --name
                       + '<td>%s</td>'        + Char(13) --@@SERVERNAME
                       + '<td style="text-align:center">%i</td>' _
                          + Char(13) --@history_days
                       + '<td style="text-align:center">%s</td>' 
                          + Char(13) --@avg_duration_multiplier
                       + '<td>%s</td>'        + Char(13) --Min Dur
                       + '<td>%s</td>'        + Char(13) --Max Dur
                       + '<td>%s</td>'        + Char(13) --Avg Dur
                       + '<td>%s</td>'        + Char(13),--Cur Dur
                               convert(varchar, start_execution_date, 120),
                               JobName,
                               @@SERVERNAME,
                               @history_days,
                               convert(varchar, @avg_duration_multiplier),
                               format(min_run_duration, N'HH\hmm\mss\s'),
                               format(max_run_duration, N'HH\hmm\mss\s'),
                               format(avg_run_duration, N'HH\hmm\mss\s'),
                               format(cur_run_duration, N'HH\hmm\mss\s')
                       )
          from #Regressing_Jobs_DurAsDate
          order by avg_run_duration desc, JobName
 
               select @tableHTML = @tableHTML + '</tr></table>' + Char(13)
 
        select @sHtml = @tableHTML
        --select @sHtml
 
        declare @DateStr varchar(30) = convert(varchar,getdate(),121)
        IF @bEmail = 1 and (select count(*) from #Regressing_Jobs) > 0
         begin
              
               declare @sSubject varchar(250)
                   = @@SERVERNAME + ' Job(s) taking longer than recent baseline duration: ' _
                     + @DateStr 
 
               EXEC msdb.dbo.sp_send_dbmail  @profile_name='SQL Server Monitoring Account',
                                              @recipients= @RecipientsList,
                                              @subject=@sSubject, 
                                               @body=@sHtml,
                                              @body_format = 'HTML'
 
               print 'email sent: ' + CHAR(13) + @sHtml
        end
 
        IF @bSaveToTable = 1
          begin
               insert into RegressingJobs
         (
 CaptureDateTime,
 JobName,
 start_execution_date,
 HostName,
 history_days,
 avg_duration_multiplier,
                           min_run_duration,
 max_run_duration,
 avg_run_duration,
 cur_run_duration
 )
select         @DateStr,
JobName,
start_execution_date,
@@SERVERNAME,
@history_days,   
@avg_duration_multiplier,
min_run_duration,
max_run_duration,
avg_run_duration,
cur_run_duration
               from #Regressing_Jobs_DurAsDate
          end
 
        begin
 
         SELECT 'JOBS THAT ARE TAKING LONGER THAN USUAL:  '
         select  @DateStr as CaptureDateTime, JobName, _
            start_execution_date, @@SERVERNAME as 'Server',
                 @history_days as '@history_days', _
                    @avg_duration_multiplier as '@avg_duration_multiplier',
                 min_run_duration, max_run_duration, _
                    avg_run_duration, cur_run_duration
         from    #Regressing_Jobs_DurAsDate
 
        end
 
--all currently running jobs:
       begin
               SELECT ' ALL JOBS THAT ARE CURRENTLY RUNNING:  '
               SELECT
                 -- '',  -- CAST (ja.job_id AS VARCHAR(max)),
                       j.name AS job_name,
                  cast ( ja.start_execution_date as varchar) start_execution_time,  
                  cast ( ja.stop_execution_date  as varchar) stop_execution_time,  
                  -- ISNULL(last_executed_step_id,0)+1 AS current_executed_step_id,
                       Js.step_name step_name
               FROM msdb.dbo.sysjobactivity ja
               LEFT JOIN msdb.dbo.sysjobhistory jh
                       ON ja.job_history_id = jh.instance_id
               JOIN msdb.dbo.sysjobs j
                 ON ja.job_id = j.job_id
               JOIN msdb.dbo.sysjobsteps js
                 ON ja.job_id = js.job_id
                AND ISNULL(ja.last_executed_step_id,0)+1 = js.step_id
               WHERE ja.session_id =
                    (SELECT TOP 1 session_id
               FROM msdb.dbo.syssessions
           ORDER BY agent_start_date DESC)
               AND start_execution_date is not null
               AND stop_execution_date  is null;
 
        end
END
 
GO
 
-- <end of stored procedure DDL/>

asp_long_running_Regressing_Jobs_Alerts 使用的 2 个 UDF

-- dependencies of asp_long_running_Regressing_Jobs_Alerts:
 
-- udf_convert_int_time
CREATE   FUNCTION [dbo].[udf_convert_int_time] (@time_in INT)
RETURNS TIME
AS
 BEGIN
        DECLARE @time_out TIME
        DECLARE @time_in_str varchar(6)
        SELECT  @time_in_str = RIGHT('000000' + CAST(@time_in AS VARCHAR(6)), 6)
        SELECT  @time_out    = CAST(STUFF(STUFF(@time_in_str,3,0,':'),6,0,':') AS TIME)
  RETURN @time_out
 END
GO
 
-- udf_convert_int_time2ss
CREATE   FUNCTION [dbo].[udf_convert_int_time2ss] (@time_in INT)
RETURNS int
AS
 BEGIN
        DECLARE @time_out int
        select @time_out = datediff(ss, 0,  dbo.udf_convert_int_time(@time_in))
  RETURN @time_out
 END
GO

除了 SP 标头中列出的调用示例之外,以下是一个典型调用的示例:

exec dba_DB.dbo.asp_long_running_Regressing_Jobs_Alerts
       @history_days = 45,
       @avg_duration_multiplier = 2,
       @bEmail = 1,
       @bSaveToTable = 0,
       @RecipientsList  = 'myName@myCoDomain.com;
                           AssociateName@myCoDomain.com’
       @ignore_zero_durations = 1

SP 的此调用意味着以下内容:

给我一份报告(或警报),说明在 45 天内完成相同作业的平均运行时间所需时间的两倍的所有作业。通过电子邮件将此类报告发送给我 ( ​myName​) 和我的同事 ( ​myAssociateName​)。不要将此数据保存到基线表,并且不包括持续时间为零的作业。

https://www.sqlservercentral.com/wp-content/uploads/2021/08/img_6118df89ac710.png

图 2:SSMS 中对 SP 的另一个类似调用的示例输出:在这种情况下,显示现在花费的时间比过去 2 天花费 AVG 多 10% 的作业。

下图显示了由 SP 生成的电子邮件警报在 HTML 中的外观示例。

此电子邮件通知仅在 时通过调用此 SP 生成​parameter @bEmail = 1​。

https://www.sqlservercentral.com/wp-content/uploads/2021/08/img_6118e0f061fcd.png

图 3

电子邮件主题行如下:​<ServerName> ​作业花费的时间比最近的基线持续时间长;​yyyy-mm-dd mm:ss

将结果保存到表格以供将来的历史分析

如果你决定将报告保存到表格,请执行以下操作:除了通过电子邮件或在 SSMS 中直接运行 SP 收到警报外,还需要一个表格。用​@bSaveToTable = 1​呼叫 SP 。(它是 SP 的依赖项之一,因此即使此时您没有将结果放入表格,也最好创建它)。这是表 DDL:

CREATE TABLE [RegressingJobs](
                            [CaptureDateTime]        [datetime]      NULL,
                            [JobName]                [sysname]       NOT NULL,
                            [start_execution_date]   [datetime]      NULL,
                            [HostName]               [sysname]       NOT NULL,
                            [history_days]           [int]           NULL,
                            [avg_duration_multiplier]    [float]     NULL,
                            [min_run_duration]           [time](7)   NULL,
                            [max_run_duration]           [time](7)   NULL,
                            [avg_run_duration]           [time](7)   NULL,
                            [cur_run_duration]           [time](7)   NULL
 ) ON [PRIMARY]
   GO

此调用具有完整功能的 SP 的示例,包括保存到表格和电子邮件警报 ( ​bSaveToTable= 1, bEmail=1​):

EXEC DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts
                             @history_days = 30,
                             @avg_duration_multiplier = 2,
                             @bEmail = 1,
                             @bSaveToTable = 1,
                             @RecipientsList  = 'myName@myCoDomain.com;',
                             @ignore_zero_durations = 1

如何出于各种目的调用此 SP 的其他示例

将当前运行的作业与其过去 30 天的历史进行比较,并报告当前持续时间超过 30 天平均值 1.5 倍的每个作业。不要发送警报电子邮件,也不要将此信息保存到基线表:

EXECUTE DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts 30, 1.5, 0, 0

将当前运行的作业与其过去 60 天的历史进行比较,并报告当前持续时间超过 60 天平均值 2 倍的每个作业。向默认收件人(列表)发送电子邮件警报,并且不要将此信息保存到基线表:

EXECUTE DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts 60, 1.5, 1, 0

注意:强烈建议对参数进行赋值,明确命名每个参数。给出上述示例是为了简洁。

结论

本文描述了管理员在高度自动化的工作负载环境中管理和分析多个作业的性能问题时面临的问题。我在此处共享的存储过程允许管理员在某些作业在给定时间段内花费的时间超过其过去平均持续时间时收到警报。

附录

下面是用于创建名为 [ ​MyMonitoredServerName_Maintenance- Regressing Jobs​]的计划作业的 DDL,以​asp_long_running_Regressing_Jobs_Alerts​每分钟执行一次并向​BigShotAdminBigShotAdmin@MyCoDomain.com​发送警报。

USE [msdb]
GO
 
/****** Object:  Job [MyMonitoredServerName_Maintenance - Regressing_Jobs] ******/
BEGIN TRANSACTION
DECLARE @ReturnCode INT
SELECT @ReturnCode = 0
IF NOT EXISTS (SELECT name
                 FROM msdb.dbo.syscategories
   WHERE name=N'[BigLoad]'
     AND category_class=1)
BEGIN
 EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[BigLoad]'
 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
 
END
 
DECLARE @jobId BINARY(16)
 EXEC   @ReturnCode =  msdb.dbo.sp_add_job
        @job_name=N'MyMonitoredServerName_Maintenance - Regressing_Jobs',
                                @enabled=1,
                                @notify_level_eventlog=0,
                                @notify_level_email=0,
                                @notify_level_netsend=0,
                                @notify_level_page=0,
                                @delete_level=0,
                                @description=N'Send email notifications to _
                                @RrecipientsList (last parameter in SP) on CURRENTLY RUNNING _
                                Agent Jobs that regress in performance by duration compared to _
                                baseline (baseline collected during the number of days before _
                                getdate() specified by the first parameter @history_days.',
                                @category_name=N'[BigLoad]',
                                @owner_login_name=N'sa', @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/****** Object:  Step [asp_long_running_Regressing_Jobs_Alerts] ******/
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, _
   @step_name=N'asp_long_running_Regressing_Jobs_Alerts',
                                @step_id=1,
                                @cmdexec_success_code=0,
                                @on_success_action=1,
                                @on_success_step_id=0,
                                @on_fail_action=2,
                                @on_fail_step_id=0,
                                @retry_attempts=0,
                                @retry_interval=0,
                                @os_run_priority=0, @subsystem=N'TSQL',
                                @command=N'exec _
                                           dbmaint..asp_long_running_Regressing_Jobs_Alerts
                                        @history_days = 45,
                                        @avg_duration_multiplier = 2,
                                        @bEmail = 1,
                                        @bSaveToTable = 1,
                                        @RecipientsList  = ''BigShotAdmin@MyCoDomain.com;'',
                                        @ignore_zero_durations = 1',
                                        @database_name=N'master',
                                        @flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'every 1 min',
                                @enabled=1,
                                @freq_type=4,
                                @freq_interval=1,
                                @freq_subday_type=4,
                                @freq_subday_interval=1,
                                @freq_relative_interval=0,
                                @freq_recurrence_factor=0,
                                @active_start_date=20201222,
                                @active_end_date=99991231,
                                @active_start_time=60000,
                                @active_end_time=235959,
                                @schedule_uid=N'999ac144-4e13-4965-82f2-55555cc37a09'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
    IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO

享受有关回归/长时间运行的 SQL 代理作业的警报!


SQL

0 人点赞