Site Reliability Engineer operating in a zero-downtime, latency-sensitive trading environment. Experienced in designing, administering, and automating hybrid infrastructure across AWS cloud environments, on-premise virtualised infrastructure, and dedicated bare-metal systems. Owns high-availability observability, scheduling, and automation platforms supporting 12,000+ hosts across production, staging, and development environments. Strong Linux systems administrator with deep automation expertise and production incident leadership experience. Comfortable operating across infrastructure architecture, cluster design, workload orchestration, automation engineering, and secure multi-tenant environments.
Overview
3
3
years of professional experience
Work History
Site Reliability Engineer – Monitoring, Management & Automation Systems
Susquehanna International Group (SIG)
01.2023 - 02.2026
Owned and operated 16 high-availability enterprise platforms within a regulated trading environment supporting 12,000+ hosts.
Designed, provisioned, and maintained infrastructure across AWS cloud environments, on-premise virtualised systems, and dedicated bare-metal servers.
Supported monitoring, scheduling, and automation platforms deployed across mixed infrastructure models (cloud + physical).
Provisioned new physical hardware and virtual machines to support platform expansion and performance scaling.
Worked with Kubernetes environments, provisioning pods and supporting platform-level deployments for observability and automation workloads.
Configured and maintained load balancers, reverse proxies, SSL termination, and firewall segmentation to secure multi-tenant environments.
Managed LDAP integration and enterprise access controls across production systems.
Ensured performance optimisation in latency-sensitive trading infrastructure, minimising overhead from logging agents and monitoring collectors.
Architected and administered enterprise-scale ELK and Splunk clusters (Indexers, Search Heads, Heavy Forwarders, Deployment Servers).
Managed log ingestion from 12,000+ hosts across production, staging, and development environments.
Designed retention, index, and shard allocation strategies balancing performance and cost.
Owned Checkmk HA clusters with custom monitoring checks and alerting frameworks.
Designed Prometheus + Thanos architecture for resilient, long-term time-series storage.
Built executive and trader-facing dashboards (P&L analytics, operational KPIs, risk visibility).
Architected and maintained highly available TIDAL Enterprise Scheduler clusters (Fault Monitor, Primary and Secondary nodes).
Guided engineering teams in designing resilient job workflows and dependency chains.
Ensured reliability and recoverability of business-critical automated processes.
Automated infrastructure provisioning, scaling, and configuration using Ansible and AWX.
Developed modular playbooks for monitoring platform lifecycle management and remediation workflows.
Integrated automation platforms into CI/CD pipelines to standardise deployments.
Administered Octopus Deploy clusters and enterprise Artifactory repositories.
Supported GitLab and Bitbucket service administration.
Senior escalation point for infrastructure and platform incidents.
Led triage and resolution during live trading production incidents.
Executed sensitive upgrades and migrations with zero unplanned downtime.
Reduced recurring operational support tickets by 15–30 per week through automation initiatives.
Balanced product ownership responsibilities with hands-on systems administration.