Proprietary Tool Internal • Active

Relay

Task scheduler and workflow engine built to manage complex, interdependent work across teams and systems. Handles reliability, observability, and human oversight.

Why We Built It

Project delivery involves hundreds of tasks: code reviews, deployments, data migrations, incident response. Standard schedulers like cron are fragile—they don't handle failures, they can't be rolled back, and when something breaks at 2am, you're flying blind.

We needed a system that:

  • Retries failed tasks automatically with exponential backoff
  • Lets us inspect, debug, and manually trigger tasks from a dashboard
  • Stores complete history so we can audit what happened
  • Integrates with our monitoring so failures surface before they hurt users
  • Makes it safe to deploy new workflows without shutting down running ones

How It Works

Relay runs as a distributed service. Tasks are defined as code and stored in a database. A scheduler polls for due tasks and dispatches them to workers. Workers execute, report results, and the system decides whether to retry or move on.

Relay Task Execution Flow Workflow Definition YAML or JSON Stored in Git Scheduler Polls every 10s Finds due tasks Workers ×N instances Execute tasks Task States scheduled Waiting running In progress success Done failed Retry? dead_letter Max retries Execution Timeline Deploy new version 1:00 PM Success Sync database 1:15 PM Failed Retry 1 Critical alert response Triggered Running Complete Built-in: exponential backoff • max retry limits • task grouping • conditional execution • manual override UI

Core Features

Automatic Retries

Failed tasks retry with exponential backoff (1s, 2s, 4s, 8s...) up to a limit. Idempotent tasks can be retried forever; critical ones fail after N attempts.

Complete Observability

Every task execution is logged: what ran, when, what the input was, what the output was. Dashboard lets you search, filter, and drill into any execution.

Manual Controls

Need to rerun a task that failed? Trigger it from the UI. Need to cancel a long-running workflow? Click a button. No SSH required.

Dead Letter Queues

Tasks that fail repeatedly go into a holding area. Ops can investigate, fix the underlying issue, and retry when ready.

How It Improves Delivery

90% Reduction in manual ops work
5min MTTR (mean time to recovery)
0 Tasks run twice by accident
100% Task history auditable

With Relay, we went from worrying about whether background jobs would complete to confidently deploying complex multi-step workflows. Recovery from failures is now automatic.

Need reliable background job handling? Let's talk about your infrastructure →