Cron Job Monitoring: The Complete Developer's Guide | Seiri

A critical database backup fails silently at 3 AM. Your team discovers it three weeks later when a server crashes and you need those backups. This nightmare scenario happens more often than you think.

Cron jobs are the invisible backbone of modern infrastructure—running backups, processing data, sending reports, cleaning logs, and keeping systems healthy. But here’s the problem: when cron jobs fail, they fail silently. No error messages, no alerts, just quiet failure that can go unnoticed for days, weeks, or months.

In this comprehensive guide, you’ll learn how to implement professional cron job monitoring that catches failures before they become disasters, complete with practical code examples and battle-tested strategies.

What is Cron Job Monitoring?

Cron job monitoring is the practice of actively tracking your scheduled tasks to ensure they:

Execute on schedule (not skipped or delayed)
Complete successfully (exit code 0)
Finish within expected timeframes (no timeouts)
Produce expected results (validation beyond exit codes)
Run on the correct systems (especially in clustered environments)

The Silent Killer: Why Cron Jobs Fail Without Warning

Unlike web applications that fail loudly with 500 errors and user complaints, cron jobs fail in silence. Consider these real-world scenarios:

Scenario 1: The Disappeared Backup

# This backup job runs every night at 2 AM
0 2 * * * /usr/local/bin/backup-database.sh

# What the job does:
# 1. Dumps the database
# 2. Compresses the dump
# 3. Uploads to S3
# 4. Cleans up local files

# What goes wrong:
# - AWS credentials expire → Upload fails, exit code 0 (script continues)
# - Disk is full → Dump fails, script exits silently
# - Network issues → Upload times out, no retry logic
# - Script has a bug → Stops at step 2, returns 0

# Result: No backups for weeks, discovered only during disaster recovery

Scenario 2: The Payment Processor

# Process pending payments every 15 minutes
*/15 * * * * /app/bin/process-payments

# What goes wrong:
# - Database connection pool exhausted → Job hangs indefinitely
# - Payment gateway API changes → All transactions fail
# - Server memory issue → Process killed by OOM, no logging
# - Timezone bug → Job runs at wrong time, misses payment windows

# Result: Thousands of failed payments, angry customers, revenue loss

Scenario 3: The Report Generator

# Generate daily sales report at 6 AM
0 6 * * * /usr/local/bin/generate-sales-report.sh

# What goes wrong:
# - Report server migrated, crontab not updated → Never runs
# - Dependencies updated → Python script breaks with import error
# - Email server down → Report generated but never sent
# - Report takes 4 hours instead of 30 minutes → Still running when next job starts

# Result: Executives don't get reports, business decisions delayed

Traditional Monitoring Fails for Cron Jobs

Why your existing monitoring doesn’t catch cron failures:

Server monitoring only shows:

CPU usage
Memory consumption
Disk space
Process counts

But it doesn’t tell you:

If a specific cron job ran
If it succeeded or failed
How long it took
What errors occurred

Log monitoring falls short:

# Cron logs look like this:
Nov 15 02:00:01 server CRON[12345]: (root) CMD (/usr/local/bin/backup.sh)

# That's it. No indication of:
# - Did the script succeed?
# - How long did it take?
# - What was the output?
# - Were there errors?

The Cost of Silent Cron Failures

Let’s look at the real-world impact:

Data Loss Scenarios

Financial Impact:

Lost backups discovered during outage: Critical data unrecoverable
ETL pipeline failing silently: Business analytics using stale data
Log rotation not running: Disk fills, brings down production

Time Impact:

Average time to discover failed cron job: 2-3 weeks
Time to diagnose root cause without logs: 4-8 hours
Time to rebuild lost data: Days to weeks

Compliance and Security Risks

# This security audit log cleanup should run daily
0 3 * * * /usr/local/bin/rotate-audit-logs.sh

# When it fails:
# - Compliance violations (logs not retained properly)
# - Storage costs spike (logs never archived)
# - Security events lost (old logs overwritten)
# - Audit failures during inspections

Types of Cron Job Failures (And How to Detect Them)

1. Never Started (Execution Failure)

The cron job never runs at all.

Common Causes:

Cron daemon not running
Syntax errors in crontab
Incorrect file permissions
Server timezone issues
User account disabled

Example:

# You think this runs every day at midnight:
0 0 * * * /home/user/backup.sh

# But it never runs because:
# - backup.sh doesn't have execute permissions
# - User 'user' was deleted
# - Cron daemon crashed and wasn't restarted

How to detect: Monitor for the absence of expected heartbeats. If a job should run every hour but you haven’t received a ping in 90 minutes, something’s wrong.

2. Failed Execution (Non-Zero Exit Code)

The job runs but exits with an error.

Example:

#!/bin/bash
# backup-database.sh

# This will exit with code 1 if pg_dump fails
pg_dump mydb > /backups/db-$(date +%Y%m%d).sql

if [ $? -ne 0 ]; then
    # Even with this check, cron won't alert you
    echo "Backup failed!" >&2
    exit 1
fi

# Upload to S3
aws s3 cp /backups/db-$(date +%Y%m%d).sql s3://my-bucket/

Without monitoring, this silent failure continues indefinitely.

3. Timeout/Hung Process

The job starts but never completes.

Example:

#!/usr/bin/env python3
# data-sync.py

import time
import psycopg2

def sync_data():
    # This query locks and never returns
    conn = psycopg2.connect("dbname=mydb")
    cursor = conn.cursor()
    
    # Query has a table lock, waits forever
    cursor.execute("""
        SELECT * FROM large_table 
        WHERE status = 'pending' 
        FOR UPDATE
    """)
    
    # Never reaches here
    process_rows(cursor.fetchall())

if __name__ == "__main__":
    sync_data()

Without timeout monitoring:

Process runs indefinitely
Next execution starts while previous still running
Eventually exhausts system resources
Multiple hung processes accumulate

4. Silent Logic Failure

The job completes “successfully” but produces wrong results.

Example:

#!/bin/bash
# process-orders.sh

# Processes new orders from database
ORDER_COUNT=$(psql -t -c "SELECT COUNT(*) FROM orders WHERE status='new'")

# Bug: Variable is empty due to connection error
# But script continues anyway
for order_id in $(psql -t -c "SELECT id FROM orders WHERE status='new'"); do
    process_order $order_id
done

# Returns 0 even though no orders were processed!
exit 0

Result: Exit code 0, cron thinks it succeeded, but zero orders processed.

5. Resource Exhaustion

The job fails due to system constraints.

Example:

#!/bin/bash
# generate-reports.sh

# Generates CSV reports for all customers
for customer_id in $(get_customer_ids); do
    # Each report is 500MB, stored in /tmp
    generate_report $customer_id > /tmp/report-${customer_id}.csv
done

# Problem: /tmp fills up after 10 reports
# Remaining 990 customers get no reports
# Script exits 0 because loop completed

Implementing Professional Cron Job Monitoring

The Heartbeat Monitoring Pattern

The most reliable cron monitoring uses a “dead man’s switch” or heartbeat pattern:

Your cron job pings a monitoring service when it starts
It pings again when it completes successfully
The monitoring service expects pings at regular intervals
If pings stop coming, you get alerted

Why this works:

No polling required
Works across firewalls/NAT
Minimal performance impact
Detects all failure types

Basic Heartbeat Monitoring with Seiri

Seiri provides simple HTTP endpoints for cron monitoring using the heartbeat pattern.

Quick Start Example

# Your existing cron job
0 2 * * * /usr/local/bin/backup-database.sh

# Enhanced with Seiri monitoring
0 2 * * * curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/start && /usr/local/bin/backup-database.sh && curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/success || curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/fail

Breaking down the monitoring:

# Before job starts - "I'm about to run"
curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/start

# Run the actual job
/usr/local/bin/backup-database.sh

# After success - "I completed successfully"
curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/success

# On failure - "I failed"
curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/fail

Curl flags explained:

-f: Fail silently on HTTP errors
-s: Silent mode (no progress bar)
-S: Show errors even in silent mode
-m 10: Maximum 10 seconds for the operation
--retry 5: Retry up to 5 times on transient errors

Bash Script Wrapper Pattern

For better reliability, wrap your cron jobs in a monitoring script:

#!/bin/bash
# cron-wrapper.sh - Universal cron job wrapper with monitoring

set -euo pipefail

# Configuration
SEIRI_PING_URL="${SEIRI_PING_URL:-}"
JOB_NAME="${1:-unknown-job}"
shift
JOB_COMMAND="$@"

# Validate configuration
if [ -z "$SEIRI_PING_URL" ]; then
    echo "Error: SEIRI_PING_URL not set" >&2
    exit 1
fi

# Function to send ping with retry logic
send_ping() {
    local status="$1"
    local max_attempts=3
    local attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        if curl -fsS -m 10 "${SEIRI_PING_URL}/${status}" 2>/dev/null; then
            return 0
        fi
        echo "Ping attempt $attempt failed, retrying..." >&2
        sleep 2
        attempt=$((attempt + 1))
    done
    
    echo "Warning: Failed to send $status ping after $max_attempts attempts" >&2
    return 1
}

# Send start ping
send_ping "start"

# Record start time
START_TIME=$(date +%s)

# Execute the job and capture output
JOB_OUTPUT=$(mktemp)
JOB_EXIT_CODE=0

if $JOB_COMMAND > "$JOB_OUTPUT" 2>&1; then
    JOB_EXIT_CODE=0
    PING_STATUS="success"
else
    JOB_EXIT_CODE=$?
    PING_STATUS="fail"
fi

# Calculate duration
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

# Send completion ping with metadata
send_ping "${PING_STATUS}"

# Optional: Send detailed metrics
curl -fsS -X POST "${SEIRI_PING_URL}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF 2>/dev/null || true
{
    "job_name": "${JOB_NAME}",
    "exit_code": ${JOB_EXIT_CODE},
    "duration_seconds": ${DURATION},
    "output_lines": $(wc -l < "$JOB_OUTPUT"),
    "timestamp": $(date +%s)
}
EOF

# If job failed, output the logs
if [ $JOB_EXIT_CODE -ne 0 ]; then
    echo "Job ${JOB_NAME} failed with exit code ${JOB_EXIT_CODE}" >&2
    echo "Job output:" >&2
    cat "$JOB_OUTPUT" >&2
fi

# Cleanup
rm -f "$JOB_OUTPUT"

exit $JOB_EXIT_CODE

Usage in crontab:

# Set the Seiri ping URL
SEIRI_PING_URL=https://cloud.seiri.app/webhook/<cid>:<unique-id>

# Use the wrapper for all jobs
0 2 * * * /usr/local/bin/cron-wrapper.sh "database-backup" /usr/local/bin/backup-database.sh
0 6 * * * /usr/local/bin/cron-wrapper.sh "sales-report" /usr/local/bin/generate-sales-report.sh
*/15 * * * * /usr/local/bin/cron-wrapper.sh "process-payments" /app/bin/process-payments

Python Implementation

For Python-based cron jobs:

#!/usr/bin/env python3
"""
cron_monitor.py - Python decorator for cron job monitoring
"""

import os
import sys
import time
import requests
import functools
import traceback
from typing import Callable, Any

class CronMonitor:
    """Monitor cron jobs with Seiri heartbeat pings"""
    
    def __init__(self, ping_url: str, job_name: str):
        self.ping_url = ping_url
        self.job_name = job_name
        self.session = requests.Session()
        self.session.headers.update({'User-Agent': 'Seiri-Cron-Monitor/1.0'})
    
    def send_ping(self, status: str, retry: int = 3) -> bool:
        """Send ping with retry logic"""
        url = f"{self.ping_url}/{status}"
        
        for attempt in range(retry):
            try:
                response = self.session.get(url, timeout=10)
                response.raise_for_status()
                return True
            except requests.RequestException as e:
                if attempt == retry - 1:
                    print(f"Failed to send {status} ping: {e}", file=sys.stderr)
                    return False
                time.sleep(2 ** attempt)  # Exponential backoff
        
        return False
    
    def send_metrics(self, metrics: dict) -> bool:
        """Send detailed metrics via POST"""
        try:
            response = self.session.post(
                self.ping_url,
                json=metrics,
                timeout=10
            )
            response.raise_for_status()
            return True
        except requests.RequestException as e:
            print(f"Failed to send metrics: {e}", file=sys.stderr)
            return False
    
    def __call__(self, func: Callable) -> Callable:
        """Decorator to wrap cron job functions"""
        
        @functools.wraps(func)
        def wrapper(*args, **kwargs) -> Any:
            # Send start ping
            self.send_ping('start')
            
            start_time = time.time()
            exit_code = 0
            error_message = None
            result = None
            
            try:
                # Execute the job
                result = func(*args, **kwargs)
                status = 'success'
                
            except Exception as e:
                # Job failed with exception
                status = 'fail'
                exit_code = 1
                error_message = str(e)
                
                # Log the full traceback
                print(f"Job {self.job_name} failed:", file=sys.stderr)
                traceback.print_exc()
                
                raise  # Re-raise the exception
                
            finally:
                # Calculate duration
                duration = time.time() - start_time
                
                # Send completion ping
                self.send_ping(status)
                
                # Send detailed metrics
                metrics = {
                    'job_name': self.job_name,
                    'status': status,
                    'exit_code': exit_code,
                    'duration_seconds': round(duration, 2),
                    'timestamp': int(time.time())
                }
                
                if error_message:
                    metrics['error_message'] = error_message
                
                self.send_metrics(metrics)
            
            return result
        
        return wrapper


# Example usage
SEIRI_PING_URL = os.getenv('SEIRI_PING_URL')

if not SEIRI_PING_URL:
    raise ValueError("SEIRI_PING_URL environment variable not set")

@CronMonitor(SEIRI_PING_URL, 'database-backup')
def backup_database():
    """Daily database backup job"""
    import subprocess
    
    print("Starting database backup...")
    
    # Run pg_dump
    result = subprocess.run(
        ['pg_dump', '-h', 'localhost', '-U', 'postgres', 'mydb'],
        capture_output=True,
        check=True
    )
    
    # Save to file
    backup_file = f"/backups/db-{time.strftime('%Y%m%d')}.sql"
    with open(backup_file, 'wb') as f:
        f.write(result.stdout)
    
    print(f"Backup saved to {backup_file}")
    
    # Upload to S3
    subprocess.run(
        ['aws', 's3', 'cp', backup_file, 's3://my-bucket/backups/'],
        check=True
    )
    
    print("Backup uploaded to S3")
    
    return backup_file


@CronMonitor(SEIRI_PING_URL, 'process-orders')
def process_pending_orders():
    """Process pending orders from database"""
    import psycopg2
    
    conn = psycopg2.connect(
        host='localhost',
        database='mydb',
        user='postgres'
    )
    
    cursor = conn.cursor()
    
    # Get pending orders
    cursor.execute("SELECT id FROM orders WHERE status = 'pending' LIMIT 100")
    order_ids = [row[0] for row in cursor.fetchall()]
    
    if not order_ids:
        print("No pending orders to process")
        return 0
    
    processed_count = 0
    failed_count = 0
    
    for order_id in order_ids:
        try:
            # Process each order
            cursor.execute("""
                UPDATE orders 
                SET status = 'processed', processed_at = NOW() 
                WHERE id = %s
            """, (order_id,))
            conn.commit()
            processed_count += 1
            
        except Exception as e:
            print(f"Failed to process order {order_id}: {e}", file=sys.stderr)
            conn.rollback()
            failed_count += 1
    
    cursor.close()
    conn.close()
    
    print(f"Processed {processed_count} orders, {failed_count} failed")
    
    # Raise exception if too many failures
    if failed_count > processed_count * 0.1:  # More than 10% failed
        raise Exception(f"Too many order processing failures: {failed_count}/{len(order_ids)}")
    
    return processed_count


if __name__ == '__main__':
    # This script can be called directly from cron
    import sys
    
    if len(sys.argv) < 2:
        print("Usage: cron_jobs.py <job_name>", file=sys.stderr)
        sys.exit(1)
    
    job_name = sys.argv[1]
    
    if job_name == 'backup':
        backup_database()
    elif job_name == 'process-orders':
        process_pending_orders()
    else:
        print(f"Unknown job: {job_name}", file=sys.stderr)
        sys.exit(1)

Crontab configuration:

# Export Seiri ping URL
SEIRI_PING_URL=https://cloud.seiri.app/webhook/<cid>:<unique-id>

# Run different jobs
0 2 * * * /usr/local/bin/cron_jobs.py backup
*/15 * * * * /usr/local/bin/cron_jobs.py process-orders

Node.js Implementation

#!/usr/bin/env node
/**
 * cron-monitor.js - Node.js cron job monitoring
 */

const axios = require('axios');
const { performance } = require('perf_hooks');

class CronMonitor {
    constructor(pingUrl, jobName) {
        this.pingUrl = pingUrl;
        this.jobName = jobName;
        this.client = axios.create({
            timeout: 10000,
            headers: { 'User-Agent': 'Seiri-Cron-Monitor/1.0' }
        });
    }
    
    async sendPing(status, retries = 3) {
        const url = `${this.pingUrl}/${status}`;
        
        for (let attempt = 0; attempt < retries; attempt++) {
            try {
                await this.client.get(url);
                return true;
            } catch (error) {
                if (attempt === retries - 1) {
                    console.error(`Failed to send ${status} ping:`, error.message);
                    return false;
                }
                await new Promise(resolve => setTimeout(resolve, Math.pow(2, attempt) * 1000));
            }
        }
        
        return false;
    }
    
    async sendMetrics(metrics) {
        try {
            await this.client.post(this.pingUrl, metrics);
            return true;
        } catch (error) {
            console.error('Failed to send metrics:', error.message);
            return false;
        }
    }
    
    async wrap(jobFunction) {
        // Send start ping
        await this.sendPing('start');
        
        const startTime = performance.now();
        let status = 'success';
        let exitCode = 0;
        let errorMessage = null;
        let result = null;
        
        try {
            // Execute the job
            result = await jobFunction();
            
        } catch (error) {
            // Job failed
            status = 'fail';
            exitCode = 1;
            errorMessage = error.message;
            
            console.error(`Job ${this.jobName} failed:`, error);
            throw error;
            
        } finally {
            // Calculate duration
            const duration = (performance.now() - startTime) / 1000;
            
            // Send completion ping
            await this.sendPing(status);
            
            // Send detailed metrics
            const metrics = {
                job_name: this.jobName,
                status,
                exit_code: exitCode,
                duration_seconds: Math.round(duration * 100) / 100,
                timestamp: Math.floor(Date.now() / 1000)
            };
            
            if (errorMessage) {
                metrics.error_message = errorMessage;
            }
            
            await this.sendMetrics(metrics);
        }
        
        return result;
    }
}

// Example usage
const SEIRI_PING_URL = process.env.SEIRI_PING_URL;

if (!SEIRI_PING_URL) {
    console.error('Error: SEIRI_PING_URL environment variable not set');
    process.exit(1);
}

// Database backup job
async function backupDatabase() {
    const { exec } = require('child_process');
    const { promisify } = require('util');
    const execAsync = promisify(exec);
    
    console.log('Starting database backup...');
    
    const date = new Date().toISOString().split('T')[0].replace(/-/g, '');
    const backupFile = `/backups/db-${date}.sql`;
    
    // Run pg_dump
    await execAsync(`pg_dump -h localhost -U postgres mydb > ${backupFile}`);
    console.log(`Backup saved to ${backupFile}`);
    
    // Upload to S3
    await execAsync(`aws s3 cp ${backupFile} s3://my-bucket/backups/`);
    console.log('Backup uploaded to S3');
    
    return backupFile;
}

// Process orders job
async function processOrders() {
    const { Client } = require('pg');
    
    const client = new Client({
        host: 'localhost',
        database: 'mydb',
        user: 'postgres'
    });
    
    await client.connect();
    
    try {
        // Get pending orders
        const result = await client.query(
            "SELECT id FROM orders WHERE status = 'pending' LIMIT 100"
        );
        
        if (result.rows.length === 0) {
            console.log('No pending orders to process');
            return 0;
        }
        
        let processedCount = 0;
        let failedCount = 0;
        
        for (const row of result.rows) {
            try {
                await client.query(
                    "UPDATE orders SET status = 'processed', processed_at = NOW() WHERE id = $1",
                    [row.id]
                );
                processedCount++;
            } catch (error) {
                console.error(`Failed to process order ${row.id}:`, error.message);
                failedCount++;
            }
        }
        
        console.log(`Processed ${processedCount} orders, ${failedCount} failed`);
        
        // Fail if too many errors
        if (failedCount > result.rows.length * 0.1) {
            throw new Error(`Too many failures: ${failedCount}/${result.rows.length}`);
        }
        
        return processedCount;
        
    } finally {
        await client.end();
    }
}

// Main execution
async function main() {
    const jobName = process.argv[2];
    
    if (!jobName) {
        console.error('Usage: node cron-monitor.js <job_name>');
        process.exit(1);
    }
    
    let monitor;
    let jobFunction;
    
    if (jobName === 'backup') {
        monitor = new CronMonitor(SEIRI_PING_URL, 'database-backup');
        jobFunction = backupDatabase;
    } else if (jobName === 'process-orders') {
        monitor = new CronMonitor(SEIRI_PING_URL, 'process-orders');
        jobFunction = processOrders;
    } else {
        console.error(`Unknown job: ${jobName}`);
        process.exit(1);
    }
    
    try {
        await monitor.wrap(jobFunction);
        process.exit(0);
    } catch (error) {
        process.exit(1);
    }
}

if (require.main === module) {
    main();
}

module.exports = { CronMonitor };

Crontab:

SEIRI_PING_URL=https://cloud.seiri.app/webhook/<cid>:<unique-id>

0 2 * * * /usr/local/bin/node /app/cron-monitor.js backup
*/15 * * * * /usr/local/bin/node /app/cron-monitor.js process-orders

PHP Implementation

<?php
/**
 * CronMonitor.php - PHP cron job monitoring
 */

class CronMonitor {
    private $pingUrl;
    private $jobName;
    private $timeout = 10;
    private $retries = 3;
    
    public function __construct($pingUrl, $jobName) {
        $this->pingUrl = $pingUrl;
        $this->jobName = $jobName;
    }
    
    private function sendPing($status) {
        $url = $this->pingUrl . '/' . $status;
        
        for ($attempt = 0; $attempt < $this->retries; $attempt++) {
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
            curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
            curl_setopt($ch, CURLOPT_USERAGENT, 'Seiri-Cron-Monitor/1.0');
            
            $result = curl_exec($ch);
            $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
            curl_close($ch);
            
            if ($httpCode >= 200 && $httpCode < 300) {
                return true;
            }
            
            if ($attempt < $this->retries - 1) {
                sleep(pow(2, $attempt));
            }
        }
        
        error_log("Failed to send $status ping after {$this->retries} attempts");
        return false;
    }
    
    private function sendMetrics($metrics) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $this->pingUrl);
        curl_setopt($ch, CURLOPT_POST, true);
        curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($metrics));
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
        curl_setopt($ch, CURLOPT_HTTPHEADER, [
            'Content-Type: application/json',
            'User-Agent: Seiri-Cron-Monitor/1.0'
        ]);
        
        $result = curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);
        
        if ($httpCode < 200 || $httpCode >= 300) {
            error_log("Failed to send metrics: HTTP $httpCode");
            return false;
        }
        
        return true;
    }
    
    public function wrap(callable $jobFunction) {
        // Send start ping
        $this->sendPing('start');
        
        $startTime = microtime(true);
        $status = 'success';
        $exitCode = 0;
        $errorMessage = null;
        $result = null;
        
        try {
            // Execute the job
            $result = $jobFunction();
            
        } catch (Exception $e) {
            // Job failed
            $status = 'fail';
            $exitCode = 1;
            $errorMessage = $e->getMessage();
            
            error_log("Job {$this->jobName} failed: " . $e->getMessage());
            error_log($e->getTraceAsString());
            
            throw $e;
            
        } finally {
            // Calculate duration
            $duration = microtime(true) - $startTime;
            
            // Send completion ping
            $this->sendPing($status);
            
            // Send detailed metrics
            $metrics = [
                'job_name' => $this->jobName,
                'status' => $status,
                'exit_code' => $exitCode,
                'duration_seconds' => round($duration, 2),
                'timestamp' => time()
            ];
            
            if ($errorMessage !== null) {
                $metrics['error_message'] = $errorMessage;
            }
            
            $this->sendMetrics($metrics);
        }
        
        return $result;
    }
}

// Example usage
$SEIRI_PING_URL = getenv('SEIRI_PING_URL');

if (!$SEIRI_PING_URL) {
    fwrite(STDERR, "Error: SEIRI_PING_URL environment variable not set\n");
    exit(1);
}

// Database backup job
function backupDatabase() {
    echo "Starting database backup...\n";
    
    $date = date('Ymd');
    $backupFile = "/backups/db-{$date}.sql";
    
    // Run mysqldump
    $command = "mysqldump -h localhost -u root mydb > " . escapeshellarg($backupFile);
    exec($command, $output, $returnCode);
    
    if ($returnCode !== 0) {
        throw new Exception("mysqldump failed with exit code $returnCode");
    }
    
    echo "Backup saved to $backupFile\n";
    
    // Upload to S3
    $command = "aws s3 cp " . escapeshellarg($backupFile) . " s3://my-bucket/backups/";
    exec($command, $output, $returnCode);
    
    if ($returnCode !== 0) {
        throw new Exception("S3 upload failed with exit code $returnCode");
    }
    
    echo "Backup uploaded to S3\n";
    
    return $backupFile;
}

// Process orders job
function processOrders() {
    $pdo = new PDO('mysql:host=localhost;dbname=mydb', 'root', '');
    
    // Get pending orders
    $stmt = $pdo->query("SELECT id FROM orders WHERE status = 'pending' LIMIT 100");
    $orders = $stmt->fetchAll(PDO::FETCH_COLUMN);
    
    if (empty($orders)) {
        echo "No pending orders to process\n";
        return 0;
    }
    
    $processedCount = 0;
    $failedCount = 0;
    
    foreach ($orders as $orderId) {
        try {
            $stmt = $pdo->prepare(
                "UPDATE orders SET status = 'processed', processed_at = NOW() WHERE id = ?"
            );
            $stmt->execute([$orderId]);
            $processedCount++;
            
        } catch (Exception $e) {
            fwrite(STDERR, "Failed to process order $orderId: " . $e->getMessage() . "\n");
            $failedCount++;
        }
    }
    
    echo "Processed $processedCount orders, $failedCount failed\n";
    
    // Fail if too many errors
    if ($failedCount > count($orders) * 0.1) {
        throw new Exception("Too many failures: $failedCount/" . count($orders));
    }
    
    return $processedCount;
}

// Main execution
if (php_sapi_name() === 'cli') {
    if ($argc < 2) {
        fwrite(STDERR, "Usage: php cron-monitor.php <job_name>\n");
        exit(1);
    }
    
    $jobName = $argv[1];
    
    try {
        if ($jobName === 'backup') {
            $monitor = new CronMonitor($SEIRI_PING_URL, 'database-backup');
            $monitor->wrap('backupDatabase');
        } elseif ($jobName === 'process-orders') {
            $monitor = new CronMonitor($SEIRI_PING_URL, 'process-orders');
            $monitor->wrap('processOrders');
        } else {
            fwrite(STDERR, "Unknown job: $jobName\n");
            exit(1);
        }
        
        exit(0);
        
    } catch (Exception $e) {
        exit(1);
    }
}
?>

Crontab:

SEIRI_PING_URL=https://cloud.seiri.app/webhook/<cid>:<unique-id>

0 2 * * * php /app/cron-monitor.php backup
*/15 * * * * php /app/cron-monitor.php process-orders

Advanced Monitoring Strategies

1. Execution Time Monitoring

Track how long jobs take to identify performance degradation:

import time
import statistics

class ExecutionTimeMonitor:
    """Track execution time trends"""
    
    def __init__(self, ping_url, job_name):
        self.ping_url = ping_url
        self.job_name = job_name
        self.history_file = f'/var/log/cron/{job_name}-times.log'
    
    def record_execution(self, duration_seconds):
        """Record execution time"""
        with open(self.history_file, 'a') as f:
            f.write(f"{int(time.time())},{duration_seconds}\n")
        
        # Keep only last 100 executions
        self.trim_history()
    
    def trim_history(self, keep=100):
        """Keep only recent executions"""
        try:
            with open(self.history_file, 'r') as f:
                lines = f.readlines()
            
            if len(lines) > keep:
                with open(self.history_file, 'w') as f:
                    f.writelines(lines[-keep:])
        except FileNotFoundError:
            pass
    
    def get_statistics(self):
        """Calculate execution time statistics"""
        try:
            with open(self.history_file, 'r') as f:
                times = [float(line.split(',')[1]) for line in f]
            
            if not times:
                return None
            
            return {
                'mean': statistics.mean(times),
                'median': statistics.median(times),
                'stdev': statistics.stdev(times) if len(times) > 1 else 0,
                'min': min(times),
                'max': max(times),
                'recent': times[-1],
                'count': len(times)
            }
        except (FileNotFoundError, ValueError):
            return None
    
    def check_for_anomaly(self, current_duration):
        """Detect if current execution is anomalously slow"""
        stats = self.get_statistics()
        
        if not stats or stats['count'] < 10:
            return False  # Not enough data
        
        # If current duration is more than 3 standard deviations from mean
        threshold = stats['mean'] + (3 * stats['stdev'])
        
        return current_duration > threshold


# Usage example
@CronMonitor(SEIRI_PING_URL, 'data-sync')
def sync_data():
    time_monitor = ExecutionTimeMonitor(SEIRI_PING_URL, 'data-sync')
    
    start = time.time()
    
    # Do the actual work
    perform_sync()
    
    duration = time.time() - start
    
    # Record execution time
    time_monitor.record_execution(duration)
    
    # Check for anomaly
    if time_monitor.check_for_anomaly(duration):
        stats = time_monitor.get_statistics()
        print(f"WARNING: Job took {duration:.1f}s (mean: {stats['mean']:.1f}s)", 
              file=sys.stderr)
        
        # Send alert metric
        requests.post(SEIRI_PING_URL, json={
            'alert': 'slow_execution',
            'current_duration': duration,
            'mean_duration': stats['mean'],
            'deviation': duration - stats['mean']
        })

2. Output Validation

Don’t just check exit codes—validate the actual results:

#!/bin/bash
# validate-backup.sh - Validate backup job results

SEIRI_PING_URL="https://cloud.seiri.app/webhook/<cid>:<unique-id>"
BACKUP_DIR="/backups"
MIN_SIZE_MB=100  # Minimum expected backup size
MAX_AGE_HOURS=2  # Maximum age for latest backup

# Send start ping
curl -fsS "${SEIRI_PING_URL}/start"

# Find latest backup
LATEST_BACKUP=$(find "$BACKUP_DIR" -name "db-*.sql" -type f -printf '%T@ %p\n' | sort -n | tail -1 | cut -d' ' -f2)

if [ -z "$LATEST_BACKUP" ]; then
    echo "ERROR: No backup files found" >&2
    curl -fsS "${SEIRI_PING_URL}/fail"
    exit 1
fi

# Check backup age
BACKUP_AGE_HOURS=$(( ($(date +%s) - $(stat -c %Y "$LATEST_BACKUP")) / 3600 ))

if [ $BACKUP_AGE_HOURS -gt $MAX_AGE_HOURS ]; then
    echo "ERROR: Latest backup is ${BACKUP_AGE_HOURS} hours old (max: ${MAX_AGE_HOURS})" >&2
    curl -fsS "${SEIRI_PING_URL}/fail"
    exit 1
fi

# Check backup size
BACKUP_SIZE_MB=$(du -m "$LATEST_BACKUP" | cut -f1)

if [ $BACKUP_SIZE_MB -lt $MIN_SIZE_MB ]; then
    echo "ERROR: Backup is only ${BACKUP_SIZE_MB}MB (min: ${MIN_SIZE_MB}MB)" >&2
    curl -fsS "${SEIRI_PING_URL}/fail"
    exit 1
fi

# Validate SQL syntax (quick check)
if ! head -100 "$LATEST_BACKUP" | grep -q "CREATE TABLE"; then
    echo "ERROR: Backup doesn't appear to be valid SQL" >&2
    curl -fsS "${SEIRI_PING_URL}/fail"
    exit 1
fi

# All validations passed
echo "Backup validation passed: ${BACKUP_SIZE_MB}MB, ${BACKUP_AGE_HOURS}h old"

curl -fsS "${SEIRI_PING_URL}/success"

# Send detailed metrics
curl -fsS -X POST "${SEIRI_PING_URL}" \
    -H "Content-Type: application/json" \
    -d "{
        \"backup_size_mb\": ${BACKUP_SIZE_MB},
        \"backup_age_hours\": ${BACKUP_AGE_HOURS},
        \"validation_passed\": true
    }"

3. Dependency Chain Monitoring

Monitor jobs that depend on each other:

import requests
import time

class DependencyChain:
    """Monitor dependent cron jobs"""
    
    def __init__(self, ping_url):
        self.ping_url = ping_url
    
    def wait_for_upstream(self, upstream_job, timeout_seconds=3600, check_interval=60):
        """Wait for upstream job to complete"""
        start_time = time.time()
        
        while time.time() - start_time < timeout_seconds:
            # Check if upstream job completed successfully
            # This assumes Seiri provides a status check endpoint
            try:
                response = requests.get(
                    f"{self.ping_url}/../{upstream_job}/status",
                    timeout=10
                )
                
                if response.status_code == 200:
                    data = response.json()
                    
                    if data.get('last_status') == 'success':
                        # Check recency (within last hour)
                        last_run = data.get('last_run_timestamp', 0)
                        if time.time() - last_run < 3600:
                            return True
            
            except requests.RequestException:
                pass
            
            # Wait before checking again
            time.sleep(check_interval)
        
        # Timeout reached
        raise TimeoutError(f"Upstream job {upstream_job} did not complete within {timeout_seconds}s")


# Example: Data pipeline with dependencies
@CronMonitor(SEIRI_PING_URL, 'extract-data')
def extract_data():
    """Step 1: Extract data from source"""
    # Extract logic here
    pass


@CronMonitor(SEIRI_PING_URL, 'transform-data')
def transform_data():
    """Step 2: Transform extracted data (depends on extract)"""
    chain = DependencyChain(SEIRI_PING_URL)
    
    # Wait for extract job to complete
    chain.wait_for_upstream('extract-data', timeout_seconds=1800)
    
    # Now safe to transform
    # Transform logic here
    pass


@CronMonitor(SEIRI_PING_URL, 'load-data')
def load_data():
    """Step 3: Load transformed data (depends on transform)"""
    chain = DependencyChain(SEIRI_PING_URL)
    
    # Wait for transform job to complete
    chain.wait_for_upstream('transform-data', timeout_seconds=1800)
    
    # Now safe to load
    # Load logic here
    pass

Crontab for pipeline:

# Extract runs at 1 AM
0 1 * * * /app/pipeline.py extract

# Transform runs at 2 AM (after extract)
0 2 * * * /app/pipeline.py transform

# Load runs at 3 AM (after transform)
0 3 * * * /app/pipeline.py load

4. Grace Period Configuration

Different jobs need different monitoring windows:

class GracePeriodConfig:
    """Configure appropriate grace periods for different job types"""
    
    CONFIGS = {
        'database-backup': {
            'expected_duration_minutes': 30,
            'grace_period_minutes': 15,  # 15 minutes past expected
            'schedule_cron': '0 2 * * *',  # Daily at 2 AM
            'criticality': 'high'
        },
        'log-rotation': {
            'expected_duration_minutes': 2,
            'grace_period_minutes': 5,
            'schedule_cron': '0 0 * * *',  # Daily at midnight
            'criticality': 'medium'
        },
        'send-reports': {
            'expected_duration_minutes': 10,
            'grace_period_minutes': 20,
            'schedule_cron': '0 6 * * 1',  # Weekly on Monday at 6 AM
            'criticality': 'medium'
        },
        'process-payments': {
            'expected_duration_minutes': 5,
            'grace_period_minutes': 3,
            'schedule_cron': '*/15 * * * *',  # Every 15 minutes
            'criticality': 'critical'
        }
    }
    
    @classmethod
    def get_config(cls, job_name):
        """Get configuration for a job"""
        return cls.CONFIGS.get(job_name, {
            'expected_duration_minutes': 60,
            'grace_period_minutes': 30,
            'criticality': 'low'
        })
    
    @classmethod
    def should_alert(cls, job_name, duration_minutes):
        """Determine if duration warrants an alert"""
        config = cls.get_config(job_name)
        threshold = config['expected_duration_minutes'] + config['grace_period_minutes']
        
        return duration_minutes > threshold

Platform-Specific Monitoring

Linux Cron Jobs

Standard cron monitoring on Linux systems:

#!/bin/bash
# monitor-cron.sh - Monitor cron daemon health

SEIRI_PING_URL="https://cloud.seiri.app/ping/cron-daemon-health"

# Check if cron daemon is running
if ! pgrep -x cron > /dev/null && ! pgrep -x crond > /dev/null; then
    echo "ERROR: Cron daemon not running" >&2
    curl -fsS "${SEIRI_PING_URL}/fail"
    
    # Attempt to restart cron
    if command -v systemctl > /dev/null; then
        systemctl restart cron || systemctl restart crond
    else
        service cron restart || service crond restart
    fi
    
    exit 1
fi

# Check cron logs for recent activity
if [ -f /var/log/cron ]; then
    RECENT_JOBS=$(grep "CMD" /var/log/cron | tail -10)
    
    if [ -z "$RECENT_JOBS" ]; then
        echo "WARNING: No recent cron activity in logs" >&2
    fi
fi

# All checks passed
curl -fsS "${SEIRI_PING_URL}/success"

Run this health check hourly:

0 * * * * /usr/local/bin/monitor-cron.sh

Windows Task Scheduler

Monitoring Windows scheduled tasks:

# Monitor-ScheduledTask.ps1
param(
    [Parameter(Mandatory=$true)]
    [string]$TaskName,
    
    [Parameter(Mandatory=$true)]
    [string]$SeiriPingUrl
)

# Send start ping
Invoke-RestMethod -Uri "$SeiriPingUrl/start" -Method Get -TimeoutSec 10

try {
    # Get task information
    $task = Get-ScheduledTask -TaskName $TaskName -ErrorAction Stop
    $taskInfo = Get-ScheduledTaskInfo -TaskName $TaskName
    
    # Check if task is enabled
    if ($task.State -ne 'Ready') {
        throw "Task is in state: $($task.State)"
    }
    
    # Check last run result
    if ($taskInfo.LastTaskResult -ne 0) {
        throw "Last run failed with code: $($taskInfo.LastTaskResult)"
    }
    
    # Check if task ran recently (within expected window)
    $expectedIntervalHours = 24  # Adjust based on task schedule
    $hoursSinceRun = (Get-Date) - $taskInfo.LastRunTime | Select-Object -ExpandProperty TotalHours
    
    if ($hoursSinceRun -gt ($expectedIntervalHours + 2)) {
        throw "Task hasn't run in $hoursSinceRun hours (expected: $expectedIntervalHours)"
    }
    
    # Send success ping with metrics
    $metrics = @{
        task_name = $TaskName
        last_run_time = $taskInfo.LastRunTime.ToString('o')
        last_result = $taskInfo.LastTaskResult
        next_run_time = $taskInfo.NextRunTime.ToString('o')
        state = $task.State
    } | ConvertTo-Json
    
    Invoke-RestMethod -Uri "$SeiriPingUrl/success" -Method Get -TimeoutSec 10
    Invoke-RestMethod -Uri $SeiriPingUrl -Method Post -Body $metrics -ContentType "application/json" -TimeoutSec 10
    
    exit 0
    
} catch {
    Write-Error "Task monitoring failed: $_"
    
    # Send failure ping
    Invoke-RestMethod -Uri "$SeiriPingUrl/fail" -Method Get -TimeoutSec 10
    
    exit 1
}

Wrap your Windows scheduled task:

# Instead of running your script directly:
C:\Scripts\backup.ps1

# Run it through the monitor:
powershell.exe -File C:\Scripts\Monitor-ScheduledTask.ps1 -TaskName "Database Backup" -SeiriPingUrl "https://cloud.seiri.app/ping/your-id" && C:\Scripts\backup.ps1

Kubernetes CronJobs

Monitoring Kubernetes CronJobs:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: database-backup
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:14
            env:
            - name: SEIRI_PING_URL
              valueFrom:
                secretKeyRef:
                  name: seiri-credentials
                  key: ping-url
            command:
            - /bin/sh
            - -c
            - |
              # Send start ping
              wget -qO- "${SEIRI_PING_URL}/start"
              
              # Run backup
              if pg_dump -h postgres -U postgres mydb > /backup/db-$(date +%Y%m%d).sql; then
                # Success
                wget -qO- "${SEIRI_PING_URL}/success"
                exit 0
              else
                # Failure
                wget -qO- "${SEIRI_PING_URL}/fail"
                exit 1
              fi
            volumeMounts:
            - name: backup-storage
              mountPath: /backup
          volumes:
          - name: backup-storage
            persistentVolumeClaim:
              claimName: backup-pvc
          restartPolicy: OnFailure

AWS Lambda Scheduled Functions

Monitoring Lambda functions triggered by EventBridge:

import json
import os
import urllib3

http = urllib3.PoolManager()

SEIRI_PING_URL = os.environ['SEIRI_PING_URL']

def send_ping(status):
    """Send ping to Seiri"""
    try:
        http.request('GET', f"{SEIRI_PING_URL}/{status}", timeout=10)
    except Exception as e:
        print(f"Failed to send ping: {e}")

def lambda_handler(event, context):
    """AWS Lambda handler with monitoring"""
    
    # Send start ping
    send_ping('start')
    
    try:
        # Your actual Lambda logic here
        result = perform_scheduled_task()
        
        # Send success ping
        send_ping('success')
        
        return {
            'statusCode': 200,
            'body': json.dumps('Success')
        }
        
    except Exception as e:
        print(f"Error: {e}")
        
        # Send failure ping
        send_ping('fail')
        
        return {
            'statusCode': 500,
            'body': json.dumps(f'Error: {str(e)}')
        }

def perform_scheduled_task():
    """Your scheduled task logic"""
    # Task implementation
    return {'processed': 100}

Google Cloud Scheduler

from flask import Flask, request
import requests
import os

app = Flask(__name__)

SEIRI_PING_URL = os.environ['SEIRI_PING_URL']

@app.route('/scheduled-task', methods=['POST'])
def scheduled_task():
    """Endpoint triggered by Cloud Scheduler"""
    
    # Send start ping
    requests.get(f"{SEIRI_PING_URL}/start", timeout=10)
    
    try:
        # Your task logic
        result = perform_task()
        
        # Send success ping
        requests.get(f"{SEIRI_PING_URL}/success", timeout=10)
        
        return 'Success', 200
        
    except Exception as e:
        # Send failure ping
        requests.get(f"{SEIRI_PING_URL}/fail", timeout=10)
        
        return f'Error: {str(e)}', 500

def perform_task():
    """Your scheduled task"""
    pass

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

The Dead Man’s Switch Pattern

The dead man’s switch (also called heartbeat monitoring) is the most reliable pattern for cron job monitoring. Here’s why it works and how to implement it properly.

What is a Dead Man’s Switch?

The concept comes from trains: a dead man’s switch requires the operator to continuously hold a button. If they become incapacitated (the “dead man”), they release the button and the train stops automatically.

In cron monitoring:

Your job sends regular “I’m alive” pings
The monitoring service expects pings at specific intervals
If pings stop, you get alerted

Why it’s superior:

Works across firewalls: Your server pings out, no inbound connections needed
Detects all failure types: Job didn’t run, hung, crashed, server down—all result in missing pings
Simple integration: Just add a curl command to your jobs
No polling overhead: The monitoring service waits for pings, doesn’t poll your servers

Implementing Dead Man’s Switch

class DeadManSwitch:
    """Implement dead man's switch pattern"""
    
    def __init__(self, ping_url, interval_minutes):
        self.ping_url = ping_url
        self.interval_minutes = interval_minutes
    
    def send_heartbeat(self):
        """Send heartbeat ping"""
        try:
            response = requests.get(
                self.ping_url,
                timeout=10,
                headers={'User-Agent': 'DeadManSwitch/1.0'}
            )
            response.raise_for_status()
            return True
        except Exception as e:
            print(f"Heartbeat failed: {e}", file=sys.stderr)
            return False
    
    def get_next_expected(self):
        """Calculate when next heartbeat is expected"""
        return time.time() + (self.interval_minutes * 60)


# For a job that runs every hour
@CronMonitor(SEIRI_PING_URL, 'hourly-sync')
def hourly_sync_job():
    """Job that runs every hour"""
    # Job logic here
    pass

# Crontab:
# 0 * * * * /app/jobs.py hourly-sync
# 
# Seiri is configured to expect heartbeats every hour with 10-minute grace period
# If no ping received in 70 minutes → Alert

Advanced Dead Man’s Switch: Multiple Heartbeats

For long-running jobs, send multiple heartbeats:

import threading
import time

class ContinuousHeartbeat:
    """Send heartbeats during long-running jobs"""
    
    def __init__(self, ping_url, interval_seconds=60):
        self.ping_url = ping_url
        self.interval_seconds = interval_seconds
        self.running = False
        self.thread = None
    
    def start(self):
        """Start sending heartbeats"""
        self.running = True
        self.thread = threading.Thread(target=self._heartbeat_loop, daemon=True)
        self.thread.start()
    
    def stop(self):
        """Stop sending heartbeats"""
        self.running = False
        if self.thread:
            self.thread.join(timeout=5)
    
    def _heartbeat_loop(self):
        """Continuous heartbeat loop"""
        while self.running:
            try:
                requests.get(f"{self.ping_url}/heartbeat", timeout=10)
            except Exception as e:
                print(f"Heartbeat failed: {e}", file=sys.stderr)
            
            time.sleep(self.interval_seconds)


# Usage for long-running job
def long_running_job():
    """Job that takes 2+ hours"""
    heartbeat = ContinuousHeartbeat(SEIRI_PING_URL, interval_seconds=300)  # Every 5 minutes
    
    try:
        heartbeat.start()
        
        # Long-running work
        process_large_dataset()  # Takes 2 hours
        
    finally:
        heartbeat.stop()

Monitoring Best Practices

1. Categorize Jobs by Criticality

Not all cron jobs are equally important:

class JobCriticality:
    """Categorize jobs by business impact"""
    
    CRITICAL = {
        'alert_immediately': True,
        'retry_on_failure': True,
        'escalate_after_minutes': 5,
        'notify_channels': ['pagerduty', 'sms', 'slack'],
        'max_acceptable_delay_minutes': 5
    }
    
    HIGH = {
        'alert_immediately': True,
        'retry_on_failure': True,
        'escalate_after_minutes': 30,
        'notify_channels': ['slack', 'email'],
        'max_acceptable_delay_minutes': 15
    }
    
    MEDIUM = {
        'alert_immediately': False,
        'retry_on_failure': True,
        'escalate_after_minutes': 120,
        'notify_channels': ['email'],
        'max_acceptable_delay_minutes': 60
    }
    
    LOW = {
        'alert_immediately': False,
        'retry_on_failure': False,
        'escalate_after_minutes': 1440,  # 24 hours
        'notify_channels': ['email'],
        'max_acceptable_delay_minutes': 240
    }


# Example categorization
JOBS = {
    'process-payments': JobCriticality.CRITICAL,
    'database-backup': JobCriticality.CRITICAL,
    'generate-daily-reports': JobCriticality.HIGH,
    'send-weekly-newsletter': JobCriticality.MEDIUM,
    'cleanup-temp-files': JobCriticality.LOW,
    'update-cache': JobCriticality.MEDIUM
}

2. Avoid Alert Fatigue

Don’t alert on everything:

class SmartAlerting:
    """Intelligent alerting to prevent fatigue"""
    
    def __init__(self):
        self.failure_counts = {}
        self.last_alert_time = {}
    
    def should_alert(self, job_name, current_failure):
        """Determine if we should send an alert"""
        
        # Always alert on first failure
        if job_name not in self.failure_counts:
            self.failure_counts[job_name] = 1
            self.last_alert_time[job_name] = time.time()
            return True
        
        # Increment failure count
        self.failure_counts[job_name] += 1
        failures = self.failure_counts[job_name]
        
        # Alert on specific failure counts: 1, 3, 5, 10, 25, 50, 100
        alert_thresholds = [1, 3, 5, 10, 25, 50, 100]
        if failures in alert_thresholds:
            return True
        
        # Alert if it's been more than 24 hours since last alert
        time_since_last_alert = time.time() - self.last_alert_time.get(job_name, 0)
        if time_since_last_alert > 86400:  # 24 hours
            self.last_alert_time[job_name] = time.time()
            return True
        
        return False
    
    def reset_on_success(self, job_name):
        """Reset counters when job succeeds"""
        self.failure_counts.pop(job_name, None)
        self.last_alert_time.pop(job_name, None)

3. Document Expected Behavior

Create a job manifest:

# cron-jobs-manifest.yaml
jobs:
  database-backup:
    schedule: "0 2 * * *"
    expected_duration_minutes: 30
    grace_period_minutes: 15
    timeout_minutes: 120
    criticality: critical
    owner: platform-team
    description: "Daily PostgreSQL backup to S3"
    dependencies: []
    validates: "backup file size > 100MB"
    
  process-orders:
    schedule: "*/15 * * * *"
    expected_duration_minutes: 5
    grace_period_minutes: 3
    timeout_minutes: 10
    criticality: critical
    owner: payments-team
    description: "Process pending payment orders"
    dependencies: []
    validates: "at least 1 order processed in last hour"
    
  generate-reports:
    schedule: "0 6 * * 1"
    expected_duration_minutes: 10
    grace_period_minutes: 20
    timeout_minutes: 60
    criticality: high
    owner: analytics-team
    description: "Weekly sales reports"
    dependencies: ["database-backup"]
    validates: "report sent to [email protected]"

4. Test Your Monitoring

Regularly test that monitoring actually works:

#!/bin/bash
# test-monitoring.sh - Test that cron monitoring catches failures

SEIRI_PING_URL="https://cloud.seiri.app/ping/test-job"

echo "Testing cron monitoring..."

# Test 1: Successful job
echo "Test 1: Success case"
curl -fsS "${SEIRI_PING_URL}/start"
sleep 2
curl -fsS "${SEIRI_PING_URL}/success"
echo "✓ Success ping sent"

sleep 5

# Test 2: Failed job
echo "Test 2: Failure case"
curl -fsS "${SEIRI_PING_URL}/start"
sleep 2
curl -fsS "${SEIRI_PING_URL}/fail"
echo "✓ Failure ping sent"

sleep 5

# Test 3: Job that never completes (timeout)
echo "Test 3: Timeout case"
curl -fsS "${SEIRI_PING_URL}/start"
# Never send completion ping
echo "✓ Start ping sent, no completion (should timeout)"

echo ""
echo "Check your Seiri dashboard to verify:"
echo "1. Test 1 shows as success"
echo "2. Test 2 shows as failure"
echo "3. Test 3 shows as timeout/missing after grace period"

Troubleshooting Common Issues

Issue 1: Cron Job Not Running

Symptoms:

No pings received
No entries in cron logs
Job never executes

Diagnosis:

#!/bin/bash
# diagnose-cron.sh

echo "=== Cron Daemon Status ==="
if pgrep -x cron > /dev/null || pgrep -x crond > /dev/null; then
    echo "✓ Cron daemon is running"
else
    echo "✗ Cron daemon NOT running"
fi

echo ""
echo "=== Crontab for current user ==="
crontab -l

echo ""
echo "=== Recent cron activity ==="
if [ -f /var/log/cron ]; then
    tail -20 /var/log/cron
elif [ -f /var/log/syslog ]; then
    grep CRON /var/log/syslog | tail -20
fi

echo ""
echo "=== Check environment ==="
env | sort

echo ""
echo "=== Test cron job manually ==="
echo "Run your cron command manually to check for errors:"
echo "/path/to/your/script.sh"

Common fixes:

# Fix 1: Cron daemon not running
sudo systemctl start cron  # or crond

# Fix 2: Syntax error in crontab
crontab -e  # Check for errors

# Fix 3: Script permissions
chmod +x /path/to/script.sh

# Fix 4: Missing PATH
# Add to crontab:
PATH=/usr/local/bin:/usr/bin:/bin

# Fix 5: User deleted/disabled
# Check if user exists:
id username

Issue 2: Job Runs But Fails Silently

Symptoms:

Job appears in logs
No heartbeat pings received
Exit code 0 but work not done

Diagnosis:

#!/bin/bash
# debug-cron-job.sh - Enhanced logging for troubleshooting

# Redirect all output to log file
exec 1>/var/log/cron-jobs/$(basename $0)-$(date +%Y%m%d-%H%M%S).log
exec 2>&1

# Enable error exit
set -e

# Log environment
echo "=== Environment ==="
env | sort
echo ""

echo "=== Working Directory ==="
pwd
echo ""

echo "=== Start Time ==="
date
echo ""

# Your actual job
echo "=== Job Execution ==="
/path/to/actual/script.sh
echo ""

echo "=== End Time ==="
date
echo ""

echo "=== Exit Code: $? ==="

Issue 3: Monitoring Calls Failing

Symptoms:

Job executes successfully
No pings received by Seiri
Network/DNS errors in logs

Diagnosis:

#!/bin/bash
# test-seiri-connectivity.sh

SEIRI_PING_URL="https://cloud.seiri.app/ping/your-id"

echo "Testing connectivity to Seiri..."

# Test DNS resolution
echo "1. DNS Resolution:"
host cloud.seiri.app

# Test HTTPS connectivity
echo "2. HTTPS Connectivity:"
curl -v "${SEIRI_PING_URL}/test" 2>&1 | head -20

# Test from cron environment
echo "3. Test from minimal environment (simulating cron):"
env -i PATH=/usr/bin:/bin curl -v "${SEIRI_PING_URL}/test" 2>&1 | head -20

# Check for proxy settings
echo "4. Proxy Configuration:"
env | grep -i proxy

Common fixes:

# Fix 1: DNS resolution
echo "nameserver 8.8.8.8" >> /etc/resolv.conf

# Fix 2: SSL certificate issues
curl -k "${SEIRI_PING_URL}/test"  # Warning: Only for testing!

# Fix 3: Proxy configuration
export http_proxy=http://proxy.company.com:8080
export https_proxy=http://proxy.company.com:8080

# Fix 4: Timeout issues
curl -m 30 "${SEIRI_PING_URL}/test"  # Increase timeout

Issue 4: Job Times Out

Symptoms:

Job starts but never completes
Process hangs indefinitely
Multiple instances accumulate

Solution:

#!/bin/bash
# timeout-wrapper.sh - Enforce timeout on jobs

TIMEOUT_SECONDS=3600  # 1 hour
JOB_COMMAND="$@"

# Use timeout command
if timeout ${TIMEOUT_SECONDS} ${JOB_COMMAND}; then
    echo "Job completed successfully"
    exit 0
else
    EXIT_CODE=$?
    if [ $EXIT_CODE -eq 124 ]; then
        echo "Job timed out after ${TIMEOUT_SECONDS} seconds" >&2
        exit 124
    else
        echo "Job failed with exit code $EXIT_CODE" >&2
        exit $EXIT_CODE
    fi
fi

Prevent multiple instances:

#!/bin/bash
# single-instance.sh - Prevent concurrent execution

LOCKFILE="/var/lock/$(basename $0).lock"
SEIRI_PING_URL="https://cloud.seiri.app/ping/your-id"

# Try to acquire lock
exec 200>"$LOCKFILE"
if ! flock -n 200; then
    echo "Another instance is already running" >&2
    curl -fsS "${SEIRI_PING_URL}/fail"
    exit 1
fi

# Cleanup on exit
trap 'rm -f "$LOCKFILE"' EXIT

# Send start ping
curl -fsS "${SEIRI_PING_URL}/start"

# Run the actual job
if /path/to/actual/job.sh; then
    curl -fsS "${SEIRI_PING_URL}/success"
else
    curl -fsS "${SEIRI_PING_URL}/fail"
fi

Getting Started with Seiri

Quick Setup (5 Minutes)

Step 1: Sign up for Seiri

Visit https://cloud.seiri.app and create your free account.

Step 2: Create your first cron monitor

Navigate to “Cron Jobs” in your dashboard
Click “Create New Monitor”
Configure your job:
- Name: “Database Backup”
- Schedule: Every day at 2 AM
- Grace Period: 15 minutes
- Timeout: 2 hours
Copy your unique ping URL

Step 3: Add monitoring to your cron job

# Your existing cron job
0 2 * * * /usr/local/bin/backup-database.sh

# Enhanced with Seiri
0 2 * * * curl -m 10 --retry 3 https://cloud.seiri.app/ping/abc123/start && /usr/local/bin/backup-database.sh && curl -m 10 --retry 3 https://cloud.seiri.app/ping/abc123/success || curl -m 10 --retry 3 https://cloud.seiri.app/ping/abc123/fail

Step 4: Configure alerts

In your Seiri dashboard:

Add Slack webhook for instant notifications
Add email for backup alerts
Set up SMS for critical jobs (optional)
Configure PagerDuty integration (optional)

Step 5: Test it

Run your cron job manually:

/usr/local/bin/backup-database.sh

Check your Seiri dashboard—you should see:

Start time
Completion status
Duration
Any error messages

Production Best Practices

For production deployments:

#!/bin/bash
# production-cron-wrapper.sh - Production-ready cron wrapper

set -euo pipefail

# Configuration
SEIRI_PING_URL="${SEIRI_PING_URL}"
JOB_NAME="${1}"
shift
JOB_COMMAND="$@"

# Validation
if [ -z "$SEIRI_PING_URL" ] || [ -z "$JOB_NAME" ]; then
    echo "Error: Missing required configuration" >&2
    exit 1
fi

# Create log directory
LOG_DIR="/var/log/cron-jobs"
mkdir -p "$LOG_DIR"

# Log file with timestamp
LOG_FILE="$LOG_DIR/${JOB_NAME}-$(date +%Y%m%d-%H%M%S).log"

# Redirect output
exec 1>>"$LOG_FILE"
exec 2>&1

# Log environment for debugging
echo "=== Job: $JOB_NAME ==="
echo "Start: $(date)"
echo "Command: $JOB_COMMAND"
echo "User: $(whoami)"
echo "Host: $(hostname)"
echo ""

# Function to send pings with retry
send_ping() {
    local status="$1"
    local max_attempts=5
    local attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        if curl -fsS -m 10 "${SEIRI_PING_URL}/${status}" 2>>/var/log/seiri-errors.log; then
            return 0
        fi
        echo "Ping attempt $attempt/$max_attempts failed" >&2
        sleep $((2 ** attempt))
        attempt=$((attempt + 1))
    done
    
    echo "ERROR: All ping attempts failed for $status" >&2
    return 1
}

# Send start ping
send_ping "start"

# Execute job
START_TIME=$(date +%s)
EXIT_CODE=0

if $JOB_COMMAND; then
    EXIT_CODE=0
    STATUS="success"
else
    EXIT_CODE=$?
    STATUS="fail"
fi

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

# Log completion
echo ""
echo "End: $(date)"
echo "Duration: ${DURATION}s"
echo "Exit Code: $EXIT_CODE"
echo "Status: $STATUS"

# Send completion ping
send_ping "$STATUS"

# Send detailed metrics
curl -fsS -X POST "${SEIRI_PING_URL}" \
    -H "Content-Type: application/json" \
    -d "{
        \"job_name\": \"${JOB_NAME}\",
        \"exit_code\": ${EXIT_CODE},
        \"duration_seconds\": ${DURATION},
        \"hostname\": \"$(hostname)\",
        \"log_file\": \"${LOG_FILE}\"
    }" 2>>/var/log/seiri-errors.log || true

# Cleanup old logs (keep last 30 days)
find "$LOG_DIR" -name "${JOB_NAME}-*.log" -mtime +30 -delete

exit $EXIT_CODE

Production crontab:

# Set Seiri URL
SEIRI_PING_URL=https://cloud.seiri.app/ping/your-production-id

# Set PATH
PATH=/usr/local/bin:/usr/bin:/bin

# Critical jobs with monitoring
0 2 * * * /usr/local/bin/production-cron-wrapper.sh "database-backup" /usr/local/bin/backup-database.sh
0 6 * * * /usr/local/bin/production-cron-wrapper.sh "generate-reports" /usr/local/bin/generate-reports.sh
*/15 * * * * /usr/local/bin/production-cron-wrapper.sh "process-payments" /app/bin/process-payments
0 3 * * * /usr/local/bin/production-cron-wrapper.sh "sync-data" /usr/local/bin/sync-data.sh

Conclusion

Cron job monitoring is not optional—it’s essential infrastructure for any production system. Silent failures in scheduled tasks cost companies millions in lost data, missed SLA obligations, and operational overhead.

Key takeaways:

Use heartbeat/dead man’s switch pattern for reliable monitoring
Monitor execution time to catch performance degradation early
Validate output, not just exit codes
Categorize jobs by criticality and alert appropriately
Prevent alert fatigue with smart alerting logic
Test your monitoring regularly
Document expected behavior for all jobs

The cost of not monitoring:

Lost backups discovered during disasters
Silent payment processing failures
Data pipelines breaking for weeks
Compliance violations
Customer-facing features degrading

The cost of monitoring:

5 minutes to set up
One curl command per job
Peace of mind that failures are caught immediately

Ready to stop worrying about silent cron failures?

Seiri provides intelligent cron job monitoring with heartbeat detection, smart alerting, and detailed execution tracking. Monitor unlimited cron jobs, get instant alerts when jobs fail, and sleep better knowing your scheduled tasks are watched 24/7.

Start monitoring your cron jobs for free →

Have questions about monitoring complex cron job scenarios? Contact our team - we love helping developers build more reliable infrastructure.