Search all jobs
Fanatics Logo

Staff Operational Resilience Engineer

  • Leeds, UK
  • Full time
  • Competitive
  • 20th October 2025
View organisation profile
Apply Favourite
Copy Link

Full Description

Overview
At Fanatics Betting & Gaming (FBG), a core division of Fanatics’ mission to build the ultimate end-to-end digital sports platform, we’re shaping the future of sports betting. We are seeking a Staff Operational Resilience Engineer to strengthen the reliability, transparency, and operational resilience of our platforms.

This role is not a traditional SRE position. It combines technical awareness with strong cross-functional coordination to ensure platform stability, effective incident response, and operational excellence. As a first responder and process owner for high-priority incidents, you will bridge communication between engineering teams and operational stakeholders, reduce the frequency and impact of Priority 1 (P1) incidents, and embed a culture of continuous improvement. Success in this role will mean P1 incidents occur less frequently and with reduced business impact, with measurable improvements in Mean Time to Mitigation (MTTM). Incident response processes will be smooth, repeatable, and transparent, keeping engineering teams focused while stakeholders remain informed. Post-incident reviews will drive actionable follow-ups that prevent recurrence, while proactive operations through automation, runbooks, and enhanced monitoring will reduce day-to-day overhead. By tracking, reporting, and acting on key operational health metrics, you will help build data-driven resilience across the business.

Responsibilities

  • Incident Command & Escalation:

    • Participate in the on-call rotation as primary incident commander for P1 incidents.
    • Lead real-time response across engineering, product, and operations teams.
    • Assess customer impact, manage bridge calls, and deliver timely executive updates.
    • Ensure decisions, timelines, and actions are accurately captured in real time.
       
  • Monitoring, Runbooks, Triage & Dispatch:

    • Monitor critical Slack channels for escalations or support requests.
    • Execute documented runbooks where applicable.
    • Review and act on Datadog dashboards and alerts, escalating when necessary.
    • Triage unresolved issues, ensuring accurate ticketing, ownership, and priority.
       
  • Post-Incident Management:

    • Track and ensure timely closure of all action items from incidents and support cases.
    • Lead or participate in retrospectives and incident reviews.
    • Drive the Corrections of Errors (CoE) process to prevent recurrence of customer-impacting issues.
    • Promote visibility into operational health metrics and knowledge sharing.
       
  • Operational Excellence & Process Improvement:

    • Contribute to the Operational Excellence program to mature incident handling practices.
    • Enhance templates, training materials, and incident documentation.
    • Propose improvements to incident management processes, reporting cadence, and operational tooling.
    • Collaborate with teams to strengthen observability, alerting, and escalation systems.
    • Ensure metadata accuracy for all P1 and P2 incidents.

Required Qualifications

  • 5+ years in a technical or incident-response-focused role within cloud-based systems or platform operations.
  • Experience coordinating incident response across multiple engineering teams.
  • Working knowledge of:
    • Datadog, GitHub, feature flagging tools
    • Common stacks such as Java/Spring Boot
    • Terraform-managed infrastructure
       
  • Familiarity with Agile workflows and software delivery practices.
  • Strong communication skills, with the ability to translate technical detail into clear updates for both engineers and executives.
  • Experience working in metrics-driven environments with KPIs such as Change Failure Rate, Bug Escape Rate, MTTD/MTTM/MTTR, and System Uptime/Availability.

Other Qualifications / Nice to Have

  • Experience in high-availability, high-scale industries such as sports betting, online gaming, or entertainment.
  • Knowledge of ITIL or incident management best practices (certification not required).
  • Familiarity with incident tooling such as PagerDuty or FireHydrant.

Ready to build the future of sports betting? If you possess some of these skills but not all of them, we still encourage you to apply!Please note that visa sponsorship is not available for this position. We are open to fully remote candidates based in the United Kingdom or Ireland, but we strongly encourage those who can join us on campus two days per week

The organisation

Fanatics
  • Data & Technology
  • New York, USA
  • 2000+ employees
  • Website

Relentlessly Enhancing the Fan Experience

More jobs from Fanatics

Fanatics Logo
Executive Assistant 3
  • New York, USA
  • Full time
  • Competitive
Fanatics Logo
Lead Retail Associate San Francisco Giants Team Store
  • San Francisco, USA
  • Part time
  • Competitive
Fanatics Logo
U - Packer
  • Easton, USA
  • Full time
  • Competitive
Fanatics Logo
Sales Manager, Brazil - Fanatics Collectibles
  • Sao Paulo, Brazil
  • Full time
  • Competitive
Fanatics Logo
Buyer - Hardgoods & Accessories
  • Jacksonville, USA
  • Full time
  • Competitive
Create a job alert

Get notified as soon as new jobs matching your ambitions go live.

Create a course alert

Create a job alert