Cron Job Monitoring: The Complete Developer's Guide
Master cron job monitoring with practical examples, code samples, and best practices. Learn to detect silent failures, timeouts, and missed executions before they cause data loss.
A critical database backup fails silently at 3 AM. Your team discovers it three weeks later when a server crashes and you need those backups. This nightmare scenario happens more often than you think.
Cron jobs are the invisible backbone of modern infrastructure—running backups, processing data, sending reports, cleaning logs, and keeping systems healthy. But here’s the problem: when cron jobs fail, they fail silently. No error messages, no alerts, just quiet failure that can go unnoticed for days, weeks, or months.
In this comprehensive guide, you’ll learn how to implement professional cron job monitoring that catches failures before they become disasters, complete with practical code examples and battle-tested strategies.
What is Cron Job Monitoring?
Cron job monitoring is the practice of actively tracking your scheduled tasks to ensure they:
- Execute on schedule (not skipped or delayed)
- Complete successfully (exit code 0)
- Finish within expected timeframes (no timeouts)
- Produce expected results (validation beyond exit codes)
- Run on the correct systems (especially in clustered environments)
The Silent Killer: Why Cron Jobs Fail Without Warning
Unlike web applications that fail loudly with 500 errors and user complaints, cron jobs fail in silence. Consider these real-world scenarios:
Scenario 1: The Disappeared Backup
# This backup job runs every night at 2 AM
0 2 * * * /usr/local/bin/backup-database.sh
# What the job does:
# 1. Dumps the database
# 2. Compresses the dump
# 3. Uploads to S3
# 4. Cleans up local files
# What goes wrong:
# - AWS credentials expire → Upload fails, exit code 0 (script continues)
# - Disk is full → Dump fails, script exits silently
# - Network issues → Upload times out, no retry logic
# - Script has a bug → Stops at step 2, returns 0
# Result: No backups for weeks, discovered only during disaster recoveryScenario 2: The Payment Processor
# Process pending payments every 15 minutes
*/15 * * * * /app/bin/process-payments
# What goes wrong:
# - Database connection pool exhausted → Job hangs indefinitely
# - Payment gateway API changes → All transactions fail
# - Server memory issue → Process killed by OOM, no logging
# - Timezone bug → Job runs at wrong time, misses payment windows
# Result: Thousands of failed payments, angry customers, revenue lossScenario 3: The Report Generator
# Generate daily sales report at 6 AM
0 6 * * * /usr/local/bin/generate-sales-report.sh
# What goes wrong:
# - Report server migrated, crontab not updated → Never runs
# - Dependencies updated → Python script breaks with import error
# - Email server down → Report generated but never sent
# - Report takes 4 hours instead of 30 minutes → Still running when next job starts
# Result: Executives don't get reports, business decisions delayedTraditional Monitoring Fails for Cron Jobs
Why your existing monitoring doesn’t catch cron failures:
Server monitoring only shows:
- CPU usage
- Memory consumption
- Disk space
- Process counts
But it doesn’t tell you:
- If a specific cron job ran
- If it succeeded or failed
- How long it took
- What errors occurred
Log monitoring falls short:
# Cron logs look like this:
Nov 15 02:00:01 server CRON[12345]: (root) CMD (/usr/local/bin/backup.sh)
# That's it. No indication of:
# - Did the script succeed?
# - How long did it take?
# - What was the output?
# - Were there errors?The Cost of Silent Cron Failures
Let’s look at the real-world impact:
Data Loss Scenarios
Financial Impact:
- Lost backups discovered during outage: Critical data unrecoverable
- ETL pipeline failing silently: Business analytics using stale data
- Log rotation not running: Disk fills, brings down production
Time Impact:
- Average time to discover failed cron job: 2-3 weeks
- Time to diagnose root cause without logs: 4-8 hours
- Time to rebuild lost data: Days to weeks
Compliance and Security Risks
# This security audit log cleanup should run daily
0 3 * * * /usr/local/bin/rotate-audit-logs.sh
# When it fails:
# - Compliance violations (logs not retained properly)
# - Storage costs spike (logs never archived)
# - Security events lost (old logs overwritten)
# - Audit failures during inspectionsTypes of Cron Job Failures (And How to Detect Them)
1. Never Started (Execution Failure)
The cron job never runs at all.
Common Causes:
- Cron daemon not running
- Syntax errors in crontab
- Incorrect file permissions
- Server timezone issues
- User account disabled
Example:
# You think this runs every day at midnight:
0 0 * * * /home/user/backup.sh
# But it never runs because:
# - backup.sh doesn't have execute permissions
# - User 'user' was deleted
# - Cron daemon crashed and wasn't restartedHow to detect: Monitor for the absence of expected heartbeats. If a job should run every hour but you haven’t received a ping in 90 minutes, something’s wrong.
2. Failed Execution (Non-Zero Exit Code)
The job runs but exits with an error.
Example:
#!/bin/bash
# backup-database.sh
# This will exit with code 1 if pg_dump fails
pg_dump mydb > /backups/db-$(date +%Y%m%d).sql
if [ $? -ne 0 ]; then
# Even with this check, cron won't alert you
echo "Backup failed!" >&2
exit 1
fi
# Upload to S3
aws s3 cp /backups/db-$(date +%Y%m%d).sql s3://my-bucket/Without monitoring, this silent failure continues indefinitely.
3. Timeout/Hung Process
The job starts but never completes.
Example:
#!/usr/bin/env python3
# data-sync.py
import time
import psycopg2
def sync_data():
# This query locks and never returns
conn = psycopg2.connect("dbname=mydb")
cursor = conn.cursor()
# Query has a table lock, waits forever
cursor.execute("""
SELECT * FROM large_table
WHERE status = 'pending'
FOR UPDATE
""")
# Never reaches here
process_rows(cursor.fetchall())
if __name__ == "__main__":
sync_data()Without timeout monitoring:
- Process runs indefinitely
- Next execution starts while previous still running
- Eventually exhausts system resources
- Multiple hung processes accumulate
4. Silent Logic Failure
The job completes “successfully” but produces wrong results.
Example:
#!/bin/bash
# process-orders.sh
# Processes new orders from database
ORDER_COUNT=$(psql -t -c "SELECT COUNT(*) FROM orders WHERE status='new'")
# Bug: Variable is empty due to connection error
# But script continues anyway
for order_id in $(psql -t -c "SELECT id FROM orders WHERE status='new'"); do
process_order $order_id
done
# Returns 0 even though no orders were processed!
exit 0Result: Exit code 0, cron thinks it succeeded, but zero orders processed.
5. Resource Exhaustion
The job fails due to system constraints.
Example:
#!/bin/bash
# generate-reports.sh
# Generates CSV reports for all customers
for customer_id in $(get_customer_ids); do
# Each report is 500MB, stored in /tmp
generate_report $customer_id > /tmp/report-${customer_id}.csv
done
# Problem: /tmp fills up after 10 reports
# Remaining 990 customers get no reports
# Script exits 0 because loop completedImplementing Professional Cron Job Monitoring
The Heartbeat Monitoring Pattern
The most reliable cron monitoring uses a “dead man’s switch” or heartbeat pattern:
- Your cron job pings a monitoring service when it starts
- It pings again when it completes successfully
- The monitoring service expects pings at regular intervals
- If pings stop coming, you get alerted
Why this works:
- No polling required
- Works across firewalls/NAT
- Minimal performance impact
- Detects all failure types
Basic Heartbeat Monitoring with Seiri
Seiri provides simple HTTP endpoints for cron monitoring using the heartbeat pattern.
Quick Start Example
# Your existing cron job
0 2 * * * /usr/local/bin/backup-database.sh
# Enhanced with Seiri monitoring
0 2 * * * curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/start && /usr/local/bin/backup-database.sh && curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/success || curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/failBreaking down the monitoring:
# Before job starts - "I'm about to run"
curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/start
# Run the actual job
/usr/local/bin/backup-database.sh
# After success - "I completed successfully"
curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/success
# On failure - "I failed"
curl -fsS -m 10 --retry 5 https://cloud.seiri.app/webhook/<cid>:<unique-id>/failCurl flags explained:
-f: Fail silently on HTTP errors-s: Silent mode (no progress bar)-S: Show errors even in silent mode-m 10: Maximum 10 seconds for the operation--retry 5: Retry up to 5 times on transient errors
Bash Script Wrapper Pattern
For better reliability, wrap your cron jobs in a monitoring script:
#!/bin/bash
# cron-wrapper.sh - Universal cron job wrapper with monitoring
set -euo pipefail
# Configuration
SEIRI_PING_URL="${SEIRI_PING_URL:-}"
JOB_NAME="${1:-unknown-job}"
shift
JOB_COMMAND="$@"
# Validate configuration
if [ -z "$SEIRI_PING_URL" ]; then
echo "Error: SEIRI_PING_URL not set" >&2
exit 1
fi
# Function to send ping with retry logic
send_ping() {
local status="$1"
local max_attempts=3
local attempt=1
while [ $attempt -le $max_attempts ]; do
if curl -fsS -m 10 "${SEIRI_PING_URL}/${status}" 2>/dev/null; then
return 0
fi
echo "Ping attempt $attempt failed, retrying..." >&2
sleep 2
attempt=$((attempt + 1))
done
echo "Warning: Failed to send $status ping after $max_attempts attempts" >&2
return 1
}
# Send start ping
send_ping "start"
# Record start time
START_TIME=$(date +%s)
# Execute the job and capture output
JOB_OUTPUT=$(mktemp)
JOB_EXIT_CODE=0
if $JOB_COMMAND > "$JOB_OUTPUT" 2>&1; then
JOB_EXIT_CODE=0
PING_STATUS="success"
else
JOB_EXIT_CODE=$?
PING_STATUS="fail"
fi
# Calculate duration
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
# Send completion ping with metadata
send_ping "${PING_STATUS}"
# Optional: Send detailed metrics
curl -fsS -X POST "${SEIRI_PING_URL}" \
-H "Content-Type: application/json" \
-d @- <<EOF 2>/dev/null || true
{
"job_name": "${JOB_NAME}",
"exit_code": ${JOB_EXIT_CODE},
"duration_seconds": ${DURATION},
"output_lines": $(wc -l < "$JOB_OUTPUT"),
"timestamp": $(date +%s)
}
EOF
# If job failed, output the logs
if [ $JOB_EXIT_CODE -ne 0 ]; then
echo "Job ${JOB_NAME} failed with exit code ${JOB_EXIT_CODE}" >&2
echo "Job output:" >&2
cat "$JOB_OUTPUT" >&2
fi
# Cleanup
rm -f "$JOB_OUTPUT"
exit $JOB_EXIT_CODEUsage in crontab:
# Set the Seiri ping URL
SEIRI_PING_URL=https://cloud.seiri.app/webhook/<cid>:<unique-id>
# Use the wrapper for all jobs
0 2 * * * /usr/local/bin/cron-wrapper.sh "database-backup" /usr/local/bin/backup-database.sh
0 6 * * * /usr/local/bin/cron-wrapper.sh "sales-report" /usr/local/bin/generate-sales-report.sh
*/15 * * * * /usr/local/bin/cron-wrapper.sh "process-payments" /app/bin/process-paymentsPython Implementation
For Python-based cron jobs:
#!/usr/bin/env python3
"""
cron_monitor.py - Python decorator for cron job monitoring
"""
import os
import sys
import time
import requests
import functools
import traceback
from typing import Callable, Any
class CronMonitor:
"""Monitor cron jobs with Seiri heartbeat pings"""
def __init__(self, ping_url: str, job_name: str):
self.ping_url = ping_url
self.job_name = job_name
self.session = requests.Session()
self.session.headers.update({'User-Agent': 'Seiri-Cron-Monitor/1.0'})
def send_ping(self, status: str, retry: int = 3) -> bool:
"""Send ping with retry logic"""
url = f"{self.ping_url}/{status}"
for attempt in range(retry):
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return True
except requests.RequestException as e:
if attempt == retry - 1:
print(f"Failed to send {status} ping: {e}", file=sys.stderr)
return False
time.sleep(2 ** attempt) # Exponential backoff
return False
def send_metrics(self, metrics: dict) -> bool:
"""Send detailed metrics via POST"""
try:
response = self.session.post(
self.ping_url,
json=metrics,
timeout=10
)
response.raise_for_status()
return True
except requests.RequestException as e:
print(f"Failed to send metrics: {e}", file=sys.stderr)
return False
def __call__(self, func: Callable) -> Callable:
"""Decorator to wrap cron job functions"""
@functools.wraps(func)
def wrapper(*args, **kwargs) -> Any:
# Send start ping
self.send_ping('start')
start_time = time.time()
exit_code = 0
error_message = None
result = None
try:
# Execute the job
result = func(*args, **kwargs)
status = 'success'
except Exception as e:
# Job failed with exception
status = 'fail'
exit_code = 1
error_message = str(e)
# Log the full traceback
print(f"Job {self.job_name} failed:", file=sys.stderr)
traceback.print_exc()
raise # Re-raise the exception
finally:
# Calculate duration
duration = time.time() - start_time
# Send completion ping
self.send_ping(status)
# Send detailed metrics
metrics = {
'job_name': self.job_name,
'status': status,
'exit_code': exit_code,
'duration_seconds': round(duration, 2),
'timestamp': int(time.time())
}
if error_message:
metrics['error_message'] = error_message
self.send_metrics(metrics)
return result
return wrapper
# Example usage
SEIRI_PING_URL = os.getenv('SEIRI_PING_URL')
if not SEIRI_PING_URL:
raise ValueError("SEIRI_PING_URL environment variable not set")
@CronMonitor(SEIRI_PING_URL, 'database-backup')
def backup_database():
"""Daily database backup job"""
import subprocess
print("Starting database backup...")
# Run pg_dump
result = subprocess.run(
['pg_dump', '-h', 'localhost', '-U', 'postgres', 'mydb'],
capture_output=True,
check=True
)
# Save to file
backup_file = f"/backups/db-{time.strftime('%Y%m%d')}.sql"
with open(backup_file, 'wb') as f:
f.write(result.stdout)
print(f"Backup saved to {backup_file}")
# Upload to S3
subprocess.run(
['aws', 's3', 'cp', backup_file, 's3://my-bucket/backups/'],
check=True
)
print("Backup uploaded to S3")
return backup_file
@CronMonitor(SEIRI_PING_URL, 'process-orders')
def process_pending_orders():
"""Process pending orders from database"""
import psycopg2
conn = psycopg2.connect(
host='localhost',
database='mydb',
user='postgres'
)
cursor = conn.cursor()
# Get pending orders
cursor.execute("SELECT id FROM orders WHERE status = 'pending' LIMIT 100")
order_ids = [row[0] for row in cursor.fetchall()]
if not order_ids:
print("No pending orders to process")
return 0
processed_count = 0
failed_count = 0
for order_id in order_ids:
try:
# Process each order
cursor.execute("""
UPDATE orders
SET status = 'processed', processed_at = NOW()
WHERE id = %s
""", (order_id,))
conn.commit()
processed_count += 1
except Exception as e:
print(f"Failed to process order {order_id}: {e}", file=sys.stderr)
conn.rollback()
failed_count += 1
cursor.close()
conn.close()
print(f"Processed {processed_count} orders, {failed_count} failed")
# Raise exception if too many failures
if failed_count > processed_count * 0.1: # More than 10% failed
raise Exception(f"Too many order processing failures: {failed_count}/{len(order_ids)}")
return processed_count
if __name__ == '__main__':
# This script can be called directly from cron
import sys
if len(sys.argv) < 2:
print("Usage: cron_jobs.py <job_name>", file=sys.stderr)
sys.exit(1)
job_name = sys.argv[1]
if job_name == 'backup':
backup_database()
elif job_name == 'process-orders':
process_pending_orders()
else:
print(f"Unknown job: {job_name}", file=sys.stderr)
sys.exit(1)Crontab configuration:
# Export Seiri ping URL
SEIRI_PING_URL=https://cloud.seiri.app/webhook/<cid>:<unique-id>
# Run different jobs
0 2 * * * /usr/local/bin/cron_jobs.py backup
*/15 * * * * /usr/local/bin/cron_jobs.py process-ordersNode.js Implementation
#!/usr/bin/env node
/**
* cron-monitor.js - Node.js cron job monitoring
*/
const axios = require('axios');
const { performance } = require('perf_hooks');
class CronMonitor {
constructor(pingUrl, jobName) {
this.pingUrl = pingUrl;
this.jobName = jobName;
this.client = axios.create({
timeout: 10000,
headers: { 'User-Agent': 'Seiri-Cron-Monitor/1.0' }
});
}
async sendPing(status, retries = 3) {
const url = `${this.pingUrl}/${status}`;
for (let attempt = 0; attempt < retries; attempt++) {
try {
await this.client.get(url);
return true;
} catch (error) {
if (attempt === retries - 1) {
console.error(`Failed to send ${status} ping:`, error.message);
return false;
}
await new Promise(resolve => setTimeout(resolve, Math.pow(2, attempt) * 1000));
}
}
return false;
}
async sendMetrics(metrics) {
try {
await this.client.post(this.pingUrl, metrics);
return true;
} catch (error) {
console.error('Failed to send metrics:', error.message);
return false;
}
}
async wrap(jobFunction) {
// Send start ping
await this.sendPing('start');
const startTime = performance.now();
let status = 'success';
let exitCode = 0;
let errorMessage = null;
let result = null;
try {
// Execute the job
result = await jobFunction();
} catch (error) {
// Job failed
status = 'fail';
exitCode = 1;
errorMessage = error.message;
console.error(`Job ${this.jobName} failed:`, error);
throw error;
} finally {
// Calculate duration
const duration = (performance.now() - startTime) / 1000;
// Send completion ping
await this.sendPing(status);
// Send detailed metrics
const metrics = {
job_name: this.jobName,
status,
exit_code: exitCode,
duration_seconds: Math.round(duration * 100) / 100,
timestamp: Math.floor(Date.now() / 1000)
};
if (errorMessage) {
metrics.error_message = errorMessage;
}
await this.sendMetrics(metrics);
}
return result;
}
}
// Example usage
const SEIRI_PING_URL = process.env.SEIRI_PING_URL;
if (!SEIRI_PING_URL) {
console.error('Error: SEIRI_PING_URL environment variable not set');
process.exit(1);
}
// Database backup job
async function backupDatabase() {
const { exec } = require('child_process');
const { promisify } = require('util');
const execAsync = promisify(exec);
console.log('Starting database backup...');
const date = new Date().toISOString().split('T')[0].replace(/-/g, '');
const backupFile = `/backups/db-${date}.sql`;
// Run pg_dump
await execAsync(`pg_dump -h localhost -U postgres mydb > ${backupFile}`);
console.log(`Backup saved to ${backupFile}`);
// Upload to S3
await execAsync(`aws s3 cp ${backupFile} s3://my-bucket/backups/`);
console.log('Backup uploaded to S3');
return backupFile;
}
// Process orders job
async function processOrders() {
const { Client } = require('pg');
const client = new Client({
host: 'localhost',
database: 'mydb',
user: 'postgres'
});
await client.connect();
try {
// Get pending orders
const result = await client.query(
"SELECT id FROM orders WHERE status = 'pending' LIMIT 100"
);
if (result.rows.length === 0) {
console.log('No pending orders to process');
return 0;
}
let processedCount = 0;
let failedCount = 0;
for (const row of result.rows) {
try {
await client.query(
"UPDATE orders SET status = 'processed', processed_at = NOW() WHERE id = $1",
[row.id]
);
processedCount++;
} catch (error) {
console.error(`Failed to process order ${row.id}:`, error.message);
failedCount++;
}
}
console.log(`Processed ${processedCount} orders, ${failedCount} failed`);
// Fail if too many errors
if (failedCount > result.rows.length * 0.1) {
throw new Error(`Too many failures: ${failedCount}/${result.rows.length}`);
}
return processedCount;
} finally {
await client.end();
}
}
// Main execution
async function main() {
const jobName = process.argv[2];
if (!jobName) {
console.error('Usage: node cron-monitor.js <job_name>');
process.exit(1);
}
let monitor;
let jobFunction;
if (jobName === 'backup') {
monitor = new CronMonitor(SEIRI_PING_URL, 'database-backup');
jobFunction = backupDatabase;
} else if (jobName === 'process-orders') {
monitor = new CronMonitor(SEIRI_PING_URL, 'process-orders');
jobFunction = processOrders;
} else {
console.error(`Unknown job: ${jobName}`);
process.exit(1);
}
try {
await monitor.wrap(jobFunction);
process.exit(0);
} catch (error) {
process.exit(1);
}
}
if (require.main === module) {
main();
}
module.exports = { CronMonitor };Crontab:
SEIRI_PING_URL=https://cloud.seiri.app/webhook/<cid>:<unique-id>
0 2 * * * /usr/local/bin/node /app/cron-monitor.js backup
*/15 * * * * /usr/local/bin/node /app/cron-monitor.js process-ordersPHP Implementation
<?php
/**
* CronMonitor.php - PHP cron job monitoring
*/
class CronMonitor {
private $pingUrl;
private $jobName;
private $timeout = 10;
private $retries = 3;
public function __construct($pingUrl, $jobName) {
$this->pingUrl = $pingUrl;
$this->jobName = $jobName;
}
private function sendPing($status) {
$url = $this->pingUrl . '/' . $status;
for ($attempt = 0; $attempt < $this->retries; $attempt++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
curl_setopt($ch, CURLOPT_USERAGENT, 'Seiri-Cron-Monitor/1.0');
$result = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode >= 200 && $httpCode < 300) {
return true;
}
if ($attempt < $this->retries - 1) {
sleep(pow(2, $attempt));
}
}
error_log("Failed to send $status ping after {$this->retries} attempts");
return false;
}
private function sendMetrics($metrics) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $this->pingUrl);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($metrics));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'Content-Type: application/json',
'User-Agent: Seiri-Cron-Monitor/1.0'
]);
$result = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode < 200 || $httpCode >= 300) {
error_log("Failed to send metrics: HTTP $httpCode");
return false;
}
return true;
}
public function wrap(callable $jobFunction) {
// Send start ping
$this->sendPing('start');
$startTime = microtime(true);
$status = 'success';
$exitCode = 0;
$errorMessage = null;
$result = null;
try {
// Execute the job
$result = $jobFunction();
} catch (Exception $e) {
// Job failed
$status = 'fail';
$exitCode = 1;
$errorMessage = $e->getMessage();
error_log("Job {$this->jobName} failed: " . $e->getMessage());
error_log($e->getTraceAsString());
throw $e;
} finally {
// Calculate duration
$duration = microtime(true) - $startTime;
// Send completion ping
$this->sendPing($status);
// Send detailed metrics
$metrics = [
'job_name' => $this->jobName,
'status' => $status,
'exit_code' => $exitCode,
'duration_seconds' => round($duration, 2),
'timestamp' => time()
];
if ($errorMessage !== null) {
$metrics['error_message'] = $errorMessage;
}
$this->sendMetrics($metrics);
}
return $result;
}
}
// Example usage
$SEIRI_PING_URL = getenv('SEIRI_PING_URL');
if (!$SEIRI_PING_URL) {
fwrite(STDERR, "Error: SEIRI_PING_URL environment variable not set\n");
exit(1);
}
// Database backup job
function backupDatabase() {
echo "Starting database backup...\n";
$date = date('Ymd');
$backupFile = "/backups/db-{$date}.sql";
// Run mysqldump
$command = "mysqldump -h localhost -u root mydb > " . escapeshellarg($backupFile);
exec($command, $output, $returnCode);
if ($returnCode !== 0) {
throw new Exception("mysqldump failed with exit code $returnCode");
}
echo "Backup saved to $backupFile\n";
// Upload to S3
$command = "aws s3 cp " . escapeshellarg($backupFile) . " s3://my-bucket/backups/";
exec($command, $output, $returnCode);
if ($returnCode !== 0) {
throw new Exception("S3 upload failed with exit code $returnCode");
}
echo "Backup uploaded to S3\n";
return $backupFile;
}
// Process orders job
function processOrders() {
$pdo = new PDO('mysql:host=localhost;dbname=mydb', 'root', '');
// Get pending orders
$stmt = $pdo->query("SELECT id FROM orders WHERE status = 'pending' LIMIT 100");
$orders = $stmt->fetchAll(PDO::FETCH_COLUMN);
if (empty($orders)) {
echo "No pending orders to process\n";
return 0;
}
$processedCount = 0;
$failedCount = 0;
foreach ($orders as $orderId) {
try {
$stmt = $pdo->prepare(
"UPDATE orders SET status = 'processed', processed_at = NOW() WHERE id = ?"
);
$stmt->execute([$orderId]);
$processedCount++;
} catch (Exception $e) {
fwrite(STDERR, "Failed to process order $orderId: " . $e->getMessage() . "\n");
$failedCount++;
}
}
echo "Processed $processedCount orders, $failedCount failed\n";
// Fail if too many errors
if ($failedCount > count($orders) * 0.1) {
throw new Exception("Too many failures: $failedCount/" . count($orders));
}
return $processedCount;
}
// Main execution
if (php_sapi_name() === 'cli') {
if ($argc < 2) {
fwrite(STDERR, "Usage: php cron-monitor.php <job_name>\n");
exit(1);
}
$jobName = $argv[1];
try {
if ($jobName === 'backup') {
$monitor = new CronMonitor($SEIRI_PING_URL, 'database-backup');
$monitor->wrap('backupDatabase');
} elseif ($jobName === 'process-orders') {
$monitor = new CronMonitor($SEIRI_PING_URL, 'process-orders');
$monitor->wrap('processOrders');
} else {
fwrite(STDERR, "Unknown job: $jobName\n");
exit(1);
}
exit(0);
} catch (Exception $e) {
exit(1);
}
}
?>
Crontab:
SEIRI_PING_URL=https://cloud.seiri.app/webhook/<cid>:<unique-id>
0 2 * * * php /app/cron-monitor.php backup
*/15 * * * * php /app/cron-monitor.php process-ordersAdvanced Monitoring Strategies
1. Execution Time Monitoring
Track how long jobs take to identify performance degradation:
import time
import statistics
class ExecutionTimeMonitor:
"""Track execution time trends"""
def __init__(self, ping_url, job_name):
self.ping_url = ping_url
self.job_name = job_name
self.history_file = f'/var/log/cron/{job_name}-times.log'
def record_execution(self, duration_seconds):
"""Record execution time"""
with open(self.history_file, 'a') as f:
f.write(f"{int(time.time())},{duration_seconds}\n")
# Keep only last 100 executions
self.trim_history()
def trim_history(self, keep=100):
"""Keep only recent executions"""
try:
with open(self.history_file, 'r') as f:
lines = f.readlines()
if len(lines) > keep:
with open(self.history_file, 'w') as f:
f.writelines(lines[-keep:])
except FileNotFoundError:
pass
def get_statistics(self):
"""Calculate execution time statistics"""
try:
with open(self.history_file, 'r') as f:
times = [float(line.split(',')[1]) for line in f]
if not times:
return None
return {
'mean': statistics.mean(times),
'median': statistics.median(times),
'stdev': statistics.stdev(times) if len(times) > 1 else 0,
'min': min(times),
'max': max(times),
'recent': times[-1],
'count': len(times)
}
except (FileNotFoundError, ValueError):
return None
def check_for_anomaly(self, current_duration):
"""Detect if current execution is anomalously slow"""
stats = self.get_statistics()
if not stats or stats['count'] < 10:
return False # Not enough data
# If current duration is more than 3 standard deviations from mean
threshold = stats['mean'] + (3 * stats['stdev'])
return current_duration > threshold
# Usage example
@CronMonitor(SEIRI_PING_URL, 'data-sync')
def sync_data():
time_monitor = ExecutionTimeMonitor(SEIRI_PING_URL, 'data-sync')
start = time.time()
# Do the actual work
perform_sync()
duration = time.time() - start
# Record execution time
time_monitor.record_execution(duration)
# Check for anomaly
if time_monitor.check_for_anomaly(duration):
stats = time_monitor.get_statistics()
print(f"WARNING: Job took {duration:.1f}s (mean: {stats['mean']:.1f}s)",
file=sys.stderr)
# Send alert metric
requests.post(SEIRI_PING_URL, json={
'alert': 'slow_execution',
'current_duration': duration,
'mean_duration': stats['mean'],
'deviation': duration - stats['mean']
})2. Output Validation
Don’t just check exit codes—validate the actual results:
#!/bin/bash
# validate-backup.sh - Validate backup job results
SEIRI_PING_URL="https://cloud.seiri.app/webhook/<cid>:<unique-id>"
BACKUP_DIR="/backups"
MIN_SIZE_MB=100 # Minimum expected backup size
MAX_AGE_HOURS=2 # Maximum age for latest backup
# Send start ping
curl -fsS "${SEIRI_PING_URL}/start"
# Find latest backup
LATEST_BACKUP=$(find "$BACKUP_DIR" -name "db-*.sql" -type f -printf '%T@ %p\n' | sort -n | tail -1 | cut -d' ' -f2)
if [ -z "$LATEST_BACKUP" ]; then
echo "ERROR: No backup files found" >&2
curl -fsS "${SEIRI_PING_URL}/fail"
exit 1
fi
# Check backup age
BACKUP_AGE_HOURS=$(( ($(date +%s) - $(stat -c %Y "$LATEST_BACKUP")) / 3600 ))
if [ $BACKUP_AGE_HOURS -gt $MAX_AGE_HOURS ]; then
echo "ERROR: Latest backup is ${BACKUP_AGE_HOURS} hours old (max: ${MAX_AGE_HOURS})" >&2
curl -fsS "${SEIRI_PING_URL}/fail"
exit 1
fi
# Check backup size
BACKUP_SIZE_MB=$(du -m "$LATEST_BACKUP" | cut -f1)
if [ $BACKUP_SIZE_MB -lt $MIN_SIZE_MB ]; then
echo "ERROR: Backup is only ${BACKUP_SIZE_MB}MB (min: ${MIN_SIZE_MB}MB)" >&2
curl -fsS "${SEIRI_PING_URL}/fail"
exit 1
fi
# Validate SQL syntax (quick check)
if ! head -100 "$LATEST_BACKUP" | grep -q "CREATE TABLE"; then
echo "ERROR: Backup doesn't appear to be valid SQL" >&2
curl -fsS "${SEIRI_PING_URL}/fail"
exit 1
fi
# All validations passed
echo "Backup validation passed: ${BACKUP_SIZE_MB}MB, ${BACKUP_AGE_HOURS}h old"
curl -fsS "${SEIRI_PING_URL}/success"
# Send detailed metrics
curl -fsS -X POST "${SEIRI_PING_URL}" \
-H "Content-Type: application/json" \
-d "{
\"backup_size_mb\": ${BACKUP_SIZE_MB},
\"backup_age_hours\": ${BACKUP_AGE_HOURS},
\"validation_passed\": true
}"3. Dependency Chain Monitoring
Monitor jobs that depend on each other:
import requests
import time
class DependencyChain:
"""Monitor dependent cron jobs"""
def __init__(self, ping_url):
self.ping_url = ping_url
def wait_for_upstream(self, upstream_job, timeout_seconds=3600, check_interval=60):
"""Wait for upstream job to complete"""
start_time = time.time()
while time.time() - start_time < timeout_seconds:
# Check if upstream job completed successfully
# This assumes Seiri provides a status check endpoint
try:
response = requests.get(
f"{self.ping_url}/../{upstream_job}/status",
timeout=10
)
if response.status_code == 200:
data = response.json()
if data.get('last_status') == 'success':
# Check recency (within last hour)
last_run = data.get('last_run_timestamp', 0)
if time.time() - last_run < 3600:
return True
except requests.RequestException:
pass
# Wait before checking again
time.sleep(check_interval)
# Timeout reached
raise TimeoutError(f"Upstream job {upstream_job} did not complete within {timeout_seconds}s")
# Example: Data pipeline with dependencies
@CronMonitor(SEIRI_PING_URL, 'extract-data')
def extract_data():
"""Step 1: Extract data from source"""
# Extract logic here
pass
@CronMonitor(SEIRI_PING_URL, 'transform-data')
def transform_data():
"""Step 2: Transform extracted data (depends on extract)"""
chain = DependencyChain(SEIRI_PING_URL)
# Wait for extract job to complete
chain.wait_for_upstream('extract-data', timeout_seconds=1800)
# Now safe to transform
# Transform logic here
pass
@CronMonitor(SEIRI_PING_URL, 'load-data')
def load_data():
"""Step 3: Load transformed data (depends on transform)"""
chain = DependencyChain(SEIRI_PING_URL)
# Wait for transform job to complete
chain.wait_for_upstream('transform-data', timeout_seconds=1800)
# Now safe to load
# Load logic here
passCrontab for pipeline:
# Extract runs at 1 AM
0 1 * * * /app/pipeline.py extract
# Transform runs at 2 AM (after extract)
0 2 * * * /app/pipeline.py transform
# Load runs at 3 AM (after transform)
0 3 * * * /app/pipeline.py load4. Grace Period Configuration
Different jobs need different monitoring windows:
class GracePeriodConfig:
"""Configure appropriate grace periods for different job types"""
CONFIGS = {
'database-backup': {
'expected_duration_minutes': 30,
'grace_period_minutes': 15, # 15 minutes past expected
'schedule_cron': '0 2 * * *', # Daily at 2 AM
'criticality': 'high'
},
'log-rotation': {
'expected_duration_minutes': 2,
'grace_period_minutes': 5,
'schedule_cron': '0 0 * * *', # Daily at midnight
'criticality': 'medium'
},
'send-reports': {
'expected_duration_minutes': 10,
'grace_period_minutes': 20,
'schedule_cron': '0 6 * * 1', # Weekly on Monday at 6 AM
'criticality': 'medium'
},
'process-payments': {
'expected_duration_minutes': 5,
'grace_period_minutes': 3,
'schedule_cron': '*/15 * * * *', # Every 15 minutes
'criticality': 'critical'
}
}
@classmethod
def get_config(cls, job_name):
"""Get configuration for a job"""
return cls.CONFIGS.get(job_name, {
'expected_duration_minutes': 60,
'grace_period_minutes': 30,
'criticality': 'low'
})
@classmethod
def should_alert(cls, job_name, duration_minutes):
"""Determine if duration warrants an alert"""
config = cls.get_config(job_name)
threshold = config['expected_duration_minutes'] + config['grace_period_minutes']
return duration_minutes > thresholdPlatform-Specific Monitoring
Linux Cron Jobs
Standard cron monitoring on Linux systems:
#!/bin/bash
# monitor-cron.sh - Monitor cron daemon health
SEIRI_PING_URL="https://cloud.seiri.app/ping/cron-daemon-health"
# Check if cron daemon is running
if ! pgrep -x cron > /dev/null && ! pgrep -x crond > /dev/null; then
echo "ERROR: Cron daemon not running" >&2
curl -fsS "${SEIRI_PING_URL}/fail"
# Attempt to restart cron
if command -v systemctl > /dev/null; then
systemctl restart cron || systemctl restart crond
else
service cron restart || service crond restart
fi
exit 1
fi
# Check cron logs for recent activity
if [ -f /var/log/cron ]; then
RECENT_JOBS=$(grep "CMD" /var/log/cron | tail -10)
if [ -z "$RECENT_JOBS" ]; then
echo "WARNING: No recent cron activity in logs" >&2
fi
fi
# All checks passed
curl -fsS "${SEIRI_PING_URL}/success"Run this health check hourly:
0 * * * * /usr/local/bin/monitor-cron.shWindows Task Scheduler
Monitoring Windows scheduled tasks:
# Monitor-ScheduledTask.ps1
param(
[Parameter(Mandatory=$true)]
[string]$TaskName,
[Parameter(Mandatory=$true)]
[string]$SeiriPingUrl
)
# Send start ping
Invoke-RestMethod -Uri "$SeiriPingUrl/start" -Method Get -TimeoutSec 10
try {
# Get task information
$task = Get-ScheduledTask -TaskName $TaskName -ErrorAction Stop
$taskInfo = Get-ScheduledTaskInfo -TaskName $TaskName
# Check if task is enabled
if ($task.State -ne 'Ready') {
throw "Task is in state: $($task.State)"
}
# Check last run result
if ($taskInfo.LastTaskResult -ne 0) {
throw "Last run failed with code: $($taskInfo.LastTaskResult)"
}
# Check if task ran recently (within expected window)
$expectedIntervalHours = 24 # Adjust based on task schedule
$hoursSinceRun = (Get-Date) - $taskInfo.LastRunTime | Select-Object -ExpandProperty TotalHours
if ($hoursSinceRun -gt ($expectedIntervalHours + 2)) {
throw "Task hasn't run in $hoursSinceRun hours (expected: $expectedIntervalHours)"
}
# Send success ping with metrics
$metrics = @{
task_name = $TaskName
last_run_time = $taskInfo.LastRunTime.ToString('o')
last_result = $taskInfo.LastTaskResult
next_run_time = $taskInfo.NextRunTime.ToString('o')
state = $task.State
} | ConvertTo-Json
Invoke-RestMethod -Uri "$SeiriPingUrl/success" -Method Get -TimeoutSec 10
Invoke-RestMethod -Uri $SeiriPingUrl -Method Post -Body $metrics -ContentType "application/json" -TimeoutSec 10
exit 0
} catch {
Write-Error "Task monitoring failed: $_"
# Send failure ping
Invoke-RestMethod -Uri "$SeiriPingUrl/fail" -Method Get -TimeoutSec 10
exit 1
}Wrap your Windows scheduled task:
# Instead of running your script directly:
C:\Scripts\backup.ps1
# Run it through the monitor:
powershell.exe -File C:\Scripts\Monitor-ScheduledTask.ps1 -TaskName "Database Backup" -SeiriPingUrl "https://cloud.seiri.app/ping/your-id" && C:\Scripts\backup.ps1Kubernetes CronJobs
Monitoring Kubernetes CronJobs:
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:14
env:
- name: SEIRI_PING_URL
valueFrom:
secretKeyRef:
name: seiri-credentials
key: ping-url
command:
- /bin/sh
- -c
- |
# Send start ping
wget -qO- "${SEIRI_PING_URL}/start"
# Run backup
if pg_dump -h postgres -U postgres mydb > /backup/db-$(date +%Y%m%d).sql; then
# Success
wget -qO- "${SEIRI_PING_URL}/success"
exit 0
else
# Failure
wget -qO- "${SEIRI_PING_URL}/fail"
exit 1
fi
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailureAWS Lambda Scheduled Functions
Monitoring Lambda functions triggered by EventBridge:
import json
import os
import urllib3
http = urllib3.PoolManager()
SEIRI_PING_URL = os.environ['SEIRI_PING_URL']
def send_ping(status):
"""Send ping to Seiri"""
try:
http.request('GET', f"{SEIRI_PING_URL}/{status}", timeout=10)
except Exception as e:
print(f"Failed to send ping: {e}")
def lambda_handler(event, context):
"""AWS Lambda handler with monitoring"""
# Send start ping
send_ping('start')
try:
# Your actual Lambda logic here
result = perform_scheduled_task()
# Send success ping
send_ping('success')
return {
'statusCode': 200,
'body': json.dumps('Success')
}
except Exception as e:
print(f"Error: {e}")
# Send failure ping
send_ping('fail')
return {
'statusCode': 500,
'body': json.dumps(f'Error: {str(e)}')
}
def perform_scheduled_task():
"""Your scheduled task logic"""
# Task implementation
return {'processed': 100}Google Cloud Scheduler
from flask import Flask, request
import requests
import os
app = Flask(__name__)
SEIRI_PING_URL = os.environ['SEIRI_PING_URL']
@app.route('/scheduled-task', methods=['POST'])
def scheduled_task():
"""Endpoint triggered by Cloud Scheduler"""
# Send start ping
requests.get(f"{SEIRI_PING_URL}/start", timeout=10)
try:
# Your task logic
result = perform_task()
# Send success ping
requests.get(f"{SEIRI_PING_URL}/success", timeout=10)
return 'Success', 200
except Exception as e:
# Send failure ping
requests.get(f"{SEIRI_PING_URL}/fail", timeout=10)
return f'Error: {str(e)}', 500
def perform_task():
"""Your scheduled task"""
pass
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)The Dead Man’s Switch Pattern
The dead man’s switch (also called heartbeat monitoring) is the most reliable pattern for cron job monitoring. Here’s why it works and how to implement it properly.
What is a Dead Man’s Switch?
The concept comes from trains: a dead man’s switch requires the operator to continuously hold a button. If they become incapacitated (the “dead man”), they release the button and the train stops automatically.
In cron monitoring:
- Your job sends regular “I’m alive” pings
- The monitoring service expects pings at specific intervals
- If pings stop, you get alerted
Why it’s superior:
- Works across firewalls: Your server pings out, no inbound connections needed
- Detects all failure types: Job didn’t run, hung, crashed, server down—all result in missing pings
- Simple integration: Just add a curl command to your jobs
- No polling overhead: The monitoring service waits for pings, doesn’t poll your servers
Implementing Dead Man’s Switch
class DeadManSwitch:
"""Implement dead man's switch pattern"""
def __init__(self, ping_url, interval_minutes):
self.ping_url = ping_url
self.interval_minutes = interval_minutes
def send_heartbeat(self):
"""Send heartbeat ping"""
try:
response = requests.get(
self.ping_url,
timeout=10,
headers={'User-Agent': 'DeadManSwitch/1.0'}
)
response.raise_for_status()
return True
except Exception as e:
print(f"Heartbeat failed: {e}", file=sys.stderr)
return False
def get_next_expected(self):
"""Calculate when next heartbeat is expected"""
return time.time() + (self.interval_minutes * 60)
# For a job that runs every hour
@CronMonitor(SEIRI_PING_URL, 'hourly-sync')
def hourly_sync_job():
"""Job that runs every hour"""
# Job logic here
pass
# Crontab:
# 0 * * * * /app/jobs.py hourly-sync
#
# Seiri is configured to expect heartbeats every hour with 10-minute grace period
# If no ping received in 70 minutes → AlertAdvanced Dead Man’s Switch: Multiple Heartbeats
For long-running jobs, send multiple heartbeats:
import threading
import time
class ContinuousHeartbeat:
"""Send heartbeats during long-running jobs"""
def __init__(self, ping_url, interval_seconds=60):
self.ping_url = ping_url
self.interval_seconds = interval_seconds
self.running = False
self.thread = None
def start(self):
"""Start sending heartbeats"""
self.running = True
self.thread = threading.Thread(target=self._heartbeat_loop, daemon=True)
self.thread.start()
def stop(self):
"""Stop sending heartbeats"""
self.running = False
if self.thread:
self.thread.join(timeout=5)
def _heartbeat_loop(self):
"""Continuous heartbeat loop"""
while self.running:
try:
requests.get(f"{self.ping_url}/heartbeat", timeout=10)
except Exception as e:
print(f"Heartbeat failed: {e}", file=sys.stderr)
time.sleep(self.interval_seconds)
# Usage for long-running job
def long_running_job():
"""Job that takes 2+ hours"""
heartbeat = ContinuousHeartbeat(SEIRI_PING_URL, interval_seconds=300) # Every 5 minutes
try:
heartbeat.start()
# Long-running work
process_large_dataset() # Takes 2 hours
finally:
heartbeat.stop()Monitoring Best Practices
1. Categorize Jobs by Criticality
Not all cron jobs are equally important:
class JobCriticality:
"""Categorize jobs by business impact"""
CRITICAL = {
'alert_immediately': True,
'retry_on_failure': True,
'escalate_after_minutes': 5,
'notify_channels': ['pagerduty', 'sms', 'slack'],
'max_acceptable_delay_minutes': 5
}
HIGH = {
'alert_immediately': True,
'retry_on_failure': True,
'escalate_after_minutes': 30,
'notify_channels': ['slack', 'email'],
'max_acceptable_delay_minutes': 15
}
MEDIUM = {
'alert_immediately': False,
'retry_on_failure': True,
'escalate_after_minutes': 120,
'notify_channels': ['email'],
'max_acceptable_delay_minutes': 60
}
LOW = {
'alert_immediately': False,
'retry_on_failure': False,
'escalate_after_minutes': 1440, # 24 hours
'notify_channels': ['email'],
'max_acceptable_delay_minutes': 240
}
# Example categorization
JOBS = {
'process-payments': JobCriticality.CRITICAL,
'database-backup': JobCriticality.CRITICAL,
'generate-daily-reports': JobCriticality.HIGH,
'send-weekly-newsletter': JobCriticality.MEDIUM,
'cleanup-temp-files': JobCriticality.LOW,
'update-cache': JobCriticality.MEDIUM
}2. Avoid Alert Fatigue
Don’t alert on everything:
class SmartAlerting:
"""Intelligent alerting to prevent fatigue"""
def __init__(self):
self.failure_counts = {}
self.last_alert_time = {}
def should_alert(self, job_name, current_failure):
"""Determine if we should send an alert"""
# Always alert on first failure
if job_name not in self.failure_counts:
self.failure_counts[job_name] = 1
self.last_alert_time[job_name] = time.time()
return True
# Increment failure count
self.failure_counts[job_name] += 1
failures = self.failure_counts[job_name]
# Alert on specific failure counts: 1, 3, 5, 10, 25, 50, 100
alert_thresholds = [1, 3, 5, 10, 25, 50, 100]
if failures in alert_thresholds:
return True
# Alert if it's been more than 24 hours since last alert
time_since_last_alert = time.time() - self.last_alert_time.get(job_name, 0)
if time_since_last_alert > 86400: # 24 hours
self.last_alert_time[job_name] = time.time()
return True
return False
def reset_on_success(self, job_name):
"""Reset counters when job succeeds"""
self.failure_counts.pop(job_name, None)
self.last_alert_time.pop(job_name, None)3. Document Expected Behavior
Create a job manifest:
# cron-jobs-manifest.yaml
jobs:
database-backup:
schedule: "0 2 * * *"
expected_duration_minutes: 30
grace_period_minutes: 15
timeout_minutes: 120
criticality: critical
owner: platform-team
description: "Daily PostgreSQL backup to S3"
dependencies: []
validates: "backup file size > 100MB"
process-orders:
schedule: "*/15 * * * *"
expected_duration_minutes: 5
grace_period_minutes: 3
timeout_minutes: 10
criticality: critical
owner: payments-team
description: "Process pending payment orders"
dependencies: []
validates: "at least 1 order processed in last hour"
generate-reports:
schedule: "0 6 * * 1"
expected_duration_minutes: 10
grace_period_minutes: 20
timeout_minutes: 60
criticality: high
owner: analytics-team
description: "Weekly sales reports"
dependencies: ["database-backup"]
validates: "report sent to [email protected]"4. Test Your Monitoring
Regularly test that monitoring actually works:
#!/bin/bash
# test-monitoring.sh - Test that cron monitoring catches failures
SEIRI_PING_URL="https://cloud.seiri.app/ping/test-job"
echo "Testing cron monitoring..."
# Test 1: Successful job
echo "Test 1: Success case"
curl -fsS "${SEIRI_PING_URL}/start"
sleep 2
curl -fsS "${SEIRI_PING_URL}/success"
echo "✓ Success ping sent"
sleep 5
# Test 2: Failed job
echo "Test 2: Failure case"
curl -fsS "${SEIRI_PING_URL}/start"
sleep 2
curl -fsS "${SEIRI_PING_URL}/fail"
echo "✓ Failure ping sent"
sleep 5
# Test 3: Job that never completes (timeout)
echo "Test 3: Timeout case"
curl -fsS "${SEIRI_PING_URL}/start"
# Never send completion ping
echo "✓ Start ping sent, no completion (should timeout)"
echo ""
echo "Check your Seiri dashboard to verify:"
echo "1. Test 1 shows as success"
echo "2. Test 2 shows as failure"
echo "3. Test 3 shows as timeout/missing after grace period"Troubleshooting Common Issues
Issue 1: Cron Job Not Running
Symptoms:
- No pings received
- No entries in cron logs
- Job never executes
Diagnosis:
#!/bin/bash
# diagnose-cron.sh
echo "=== Cron Daemon Status ==="
if pgrep -x cron > /dev/null || pgrep -x crond > /dev/null; then
echo "✓ Cron daemon is running"
else
echo "✗ Cron daemon NOT running"
fi
echo ""
echo "=== Crontab for current user ==="
crontab -l
echo ""
echo "=== Recent cron activity ==="
if [ -f /var/log/cron ]; then
tail -20 /var/log/cron
elif [ -f /var/log/syslog ]; then
grep CRON /var/log/syslog | tail -20
fi
echo ""
echo "=== Check environment ==="
env | sort
echo ""
echo "=== Test cron job manually ==="
echo "Run your cron command manually to check for errors:"
echo "/path/to/your/script.sh"Common fixes:
# Fix 1: Cron daemon not running
sudo systemctl start cron # or crond
# Fix 2: Syntax error in crontab
crontab -e # Check for errors
# Fix 3: Script permissions
chmod +x /path/to/script.sh
# Fix 4: Missing PATH
# Add to crontab:
PATH=/usr/local/bin:/usr/bin:/bin
# Fix 5: User deleted/disabled
# Check if user exists:
id usernameIssue 2: Job Runs But Fails Silently
Symptoms:
- Job appears in logs
- No heartbeat pings received
- Exit code 0 but work not done
Diagnosis:
#!/bin/bash
# debug-cron-job.sh - Enhanced logging for troubleshooting
# Redirect all output to log file
exec 1>/var/log/cron-jobs/$(basename $0)-$(date +%Y%m%d-%H%M%S).log
exec 2>&1
# Enable error exit
set -e
# Log environment
echo "=== Environment ==="
env | sort
echo ""
echo "=== Working Directory ==="
pwd
echo ""
echo "=== Start Time ==="
date
echo ""
# Your actual job
echo "=== Job Execution ==="
/path/to/actual/script.sh
echo ""
echo "=== End Time ==="
date
echo ""
echo "=== Exit Code: $? ==="Issue 3: Monitoring Calls Failing
Symptoms:
- Job executes successfully
- No pings received by Seiri
- Network/DNS errors in logs
Diagnosis:
#!/bin/bash
# test-seiri-connectivity.sh
SEIRI_PING_URL="https://cloud.seiri.app/ping/your-id"
echo "Testing connectivity to Seiri..."
# Test DNS resolution
echo "1. DNS Resolution:"
host cloud.seiri.app
# Test HTTPS connectivity
echo "2. HTTPS Connectivity:"
curl -v "${SEIRI_PING_URL}/test" 2>&1 | head -20
# Test from cron environment
echo "3. Test from minimal environment (simulating cron):"
env -i PATH=/usr/bin:/bin curl -v "${SEIRI_PING_URL}/test" 2>&1 | head -20
# Check for proxy settings
echo "4. Proxy Configuration:"
env | grep -i proxyCommon fixes:
# Fix 1: DNS resolution
echo "nameserver 8.8.8.8" >> /etc/resolv.conf
# Fix 2: SSL certificate issues
curl -k "${SEIRI_PING_URL}/test" # Warning: Only for testing!
# Fix 3: Proxy configuration
export http_proxy=http://proxy.company.com:8080
export https_proxy=http://proxy.company.com:8080
# Fix 4: Timeout issues
curl -m 30 "${SEIRI_PING_URL}/test" # Increase timeoutIssue 4: Job Times Out
Symptoms:
- Job starts but never completes
- Process hangs indefinitely
- Multiple instances accumulate
Solution:
#!/bin/bash
# timeout-wrapper.sh - Enforce timeout on jobs
TIMEOUT_SECONDS=3600 # 1 hour
JOB_COMMAND="$@"
# Use timeout command
if timeout ${TIMEOUT_SECONDS} ${JOB_COMMAND}; then
echo "Job completed successfully"
exit 0
else
EXIT_CODE=$?
if [ $EXIT_CODE -eq 124 ]; then
echo "Job timed out after ${TIMEOUT_SECONDS} seconds" >&2
exit 124
else
echo "Job failed with exit code $EXIT_CODE" >&2
exit $EXIT_CODE
fi
fiPrevent multiple instances:
#!/bin/bash
# single-instance.sh - Prevent concurrent execution
LOCKFILE="/var/lock/$(basename $0).lock"
SEIRI_PING_URL="https://cloud.seiri.app/ping/your-id"
# Try to acquire lock
exec 200>"$LOCKFILE"
if ! flock -n 200; then
echo "Another instance is already running" >&2
curl -fsS "${SEIRI_PING_URL}/fail"
exit 1
fi
# Cleanup on exit
trap 'rm -f "$LOCKFILE"' EXIT
# Send start ping
curl -fsS "${SEIRI_PING_URL}/start"
# Run the actual job
if /path/to/actual/job.sh; then
curl -fsS "${SEIRI_PING_URL}/success"
else
curl -fsS "${SEIRI_PING_URL}/fail"
fiGetting Started with Seiri
Quick Setup (5 Minutes)
Step 1: Sign up for Seiri
Visit https://cloud.seiri.app and create your free account.
Step 2: Create your first cron monitor
Navigate to “Cron Jobs” in your dashboard
Click “Create New Monitor”
Configure your job:
- Name: “Database Backup”
- Schedule: Every day at 2 AM
- Grace Period: 15 minutes
- Timeout: 2 hours
Copy your unique ping URL
Step 3: Add monitoring to your cron job
# Your existing cron job
0 2 * * * /usr/local/bin/backup-database.sh
# Enhanced with Seiri
0 2 * * * curl -m 10 --retry 3 https://cloud.seiri.app/ping/abc123/start && /usr/local/bin/backup-database.sh && curl -m 10 --retry 3 https://cloud.seiri.app/ping/abc123/success || curl -m 10 --retry 3 https://cloud.seiri.app/ping/abc123/failStep 4: Configure alerts
In your Seiri dashboard:
- Add Slack webhook for instant notifications
- Add email for backup alerts
- Set up SMS for critical jobs (optional)
- Configure PagerDuty integration (optional)
Step 5: Test it
Run your cron job manually:
/usr/local/bin/backup-database.shCheck your Seiri dashboard—you should see:
- Start time
- Completion status
- Duration
- Any error messages
Production Best Practices
For production deployments:
#!/bin/bash
# production-cron-wrapper.sh - Production-ready cron wrapper
set -euo pipefail
# Configuration
SEIRI_PING_URL="${SEIRI_PING_URL}"
JOB_NAME="${1}"
shift
JOB_COMMAND="$@"
# Validation
if [ -z "$SEIRI_PING_URL" ] || [ -z "$JOB_NAME" ]; then
echo "Error: Missing required configuration" >&2
exit 1
fi
# Create log directory
LOG_DIR="/var/log/cron-jobs"
mkdir -p "$LOG_DIR"
# Log file with timestamp
LOG_FILE="$LOG_DIR/${JOB_NAME}-$(date +%Y%m%d-%H%M%S).log"
# Redirect output
exec 1>>"$LOG_FILE"
exec 2>&1
# Log environment for debugging
echo "=== Job: $JOB_NAME ==="
echo "Start: $(date)"
echo "Command: $JOB_COMMAND"
echo "User: $(whoami)"
echo "Host: $(hostname)"
echo ""
# Function to send pings with retry
send_ping() {
local status="$1"
local max_attempts=5
local attempt=1
while [ $attempt -le $max_attempts ]; do
if curl -fsS -m 10 "${SEIRI_PING_URL}/${status}" 2>>/var/log/seiri-errors.log; then
return 0
fi
echo "Ping attempt $attempt/$max_attempts failed" >&2
sleep $((2 ** attempt))
attempt=$((attempt + 1))
done
echo "ERROR: All ping attempts failed for $status" >&2
return 1
}
# Send start ping
send_ping "start"
# Execute job
START_TIME=$(date +%s)
EXIT_CODE=0
if $JOB_COMMAND; then
EXIT_CODE=0
STATUS="success"
else
EXIT_CODE=$?
STATUS="fail"
fi
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
# Log completion
echo ""
echo "End: $(date)"
echo "Duration: ${DURATION}s"
echo "Exit Code: $EXIT_CODE"
echo "Status: $STATUS"
# Send completion ping
send_ping "$STATUS"
# Send detailed metrics
curl -fsS -X POST "${SEIRI_PING_URL}" \
-H "Content-Type: application/json" \
-d "{
\"job_name\": \"${JOB_NAME}\",
\"exit_code\": ${EXIT_CODE},
\"duration_seconds\": ${DURATION},
\"hostname\": \"$(hostname)\",
\"log_file\": \"${LOG_FILE}\"
}" 2>>/var/log/seiri-errors.log || true
# Cleanup old logs (keep last 30 days)
find "$LOG_DIR" -name "${JOB_NAME}-*.log" -mtime +30 -delete
exit $EXIT_CODEProduction crontab:
# Set Seiri URL
SEIRI_PING_URL=https://cloud.seiri.app/ping/your-production-id
# Set PATH
PATH=/usr/local/bin:/usr/bin:/bin
# Critical jobs with monitoring
0 2 * * * /usr/local/bin/production-cron-wrapper.sh "database-backup" /usr/local/bin/backup-database.sh
0 6 * * * /usr/local/bin/production-cron-wrapper.sh "generate-reports" /usr/local/bin/generate-reports.sh
*/15 * * * * /usr/local/bin/production-cron-wrapper.sh "process-payments" /app/bin/process-payments
0 3 * * * /usr/local/bin/production-cron-wrapper.sh "sync-data" /usr/local/bin/sync-data.shConclusion
Cron job monitoring is not optional—it’s essential infrastructure for any production system. Silent failures in scheduled tasks cost companies millions in lost data, missed SLA obligations, and operational overhead.
Key takeaways:
- Use heartbeat/dead man’s switch pattern for reliable monitoring
- Monitor execution time to catch performance degradation early
- Validate output, not just exit codes
- Categorize jobs by criticality and alert appropriately
- Prevent alert fatigue with smart alerting logic
- Test your monitoring regularly
- Document expected behavior for all jobs
The cost of not monitoring:
- Lost backups discovered during disasters
- Silent payment processing failures
- Data pipelines breaking for weeks
- Compliance violations
- Customer-facing features degrading
The cost of monitoring:
- 5 minutes to set up
- One curl command per job
- Peace of mind that failures are caught immediately
Ready to stop worrying about silent cron failures?
Seiri provides intelligent cron job monitoring with heartbeat detection, smart alerting, and detailed execution tracking. Monitor unlimited cron jobs, get instant alerts when jobs fail, and sleep better knowing your scheduled tasks are watched 24/7.
Start monitoring your cron jobs for free →
Have questions about monitoring complex cron job scenarios? Contact our team - we love helping developers build more reliable infrastructure.