Job Description
We’re looking for a DevOps / SRE Engineer who will own the reliability, scalability, and automation of The Sports Market’s cloud infrastructure. This is a full-time contract-to-hire role, offering the opportunity to transition into a long-term core engineering position after an initial contract period.
You’ll be instrumental in our migration from a managed ledger platform to a fully self-hosted, Kubernetes-driven, AWS-based architecture. This role puts you at the heart of The Sports Market’s next evolution—building the platform that powers a global sports trading ecosystem.
If you thrive in cloud-native environments, enjoy automating complex systems, and want to help architect a next-generation platform, this role is built for you.
Key Responsibilities
Cloud Infrastructure & AWS Platform Engineering
– Build and maintain AWS infrastructure using Terraform (VPC, EKS, networking, IAM, Secrets Manager, Route53, ALBs/NLBs).
– Operate and optimize production-grade EKS clusters: node groups, autoscaling, RBAC, OIDC integration.
– Implement TLS, certificates, ingress controllers, and network policies.
– Ensure secure, consistent multi-environment deployments across staging and production.
Kubernetes Operations
– Deploy and manage workloads for integrations, adapters, backend services, ledger components, and payment orchestration.
– Configure Helm charts/manifests, resource limits, autoscaling (HPA/VPA), and pod governance.
– Support distributed ledger components (via Catalyst Blockchain Manager), including Canton participants and sequencer nodes.
– Maintain operational reliability for critical workloads: event ingestion, trading integrations, settlement flows, payment orchestration, and automations.
CI/CD, GitOps & Automation
– Build and maintain CI/CD pipelines (GitLab → ArgoCD) for automated deployments and infrastructure provisioning.
– Implement GitOps patterns and progressive delivery strategies (blue/green, canary).
– Automate secrets management, configuration flows, and cluster operations.
Observability, Monitoring & System Insights
– Expand platform observability using Datadog, Prometheus/Grafana, and log aggregation pipelines.
– Build dashboards and alerts for Kubernetes, ledger nodes, integrations, payment workflows, and API workloads.
– Establish SLIs/SLOs and ensure system reliability targets are consistently met.
System Stability & Reliability Engineering
– Investigate incidents, identify root causes, and implement long-term reliability improvements.
– Improve resiliency through redundancy, autoscaling, and failure recovery strategies.
– Maintain deployment safety, rollback strategies, and operational runbooks.
Security, Compliance & Operational Hardening
– Implement IAM least-privilege policies, encryption, secrets management, and secure network segmentation.
– Maintain secure ingress patterns for third-party services (payments, KYC, trading).
– Ensure operational readiness and compliance alignment with platform standards.
Collaboration & Platform Support
– Work closely with backend and full-stack teams to ensure smooth deployments and runtime reliability.
– Support teams during platform migration efforts and environment transitions.
– Participate in incident response, observability improvements, and overall DevOps best practices.
Qualifications
– Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
– 5+ years of DevOps/SRE experience operating production-grade systems.
– Strong hands-on experience with:
– Kubernetes operations
– AWS services (EKS, VPC, IAM, LB, Secrets Manager, Route53)
– Terraform (IaC)
– GitOps tooling (ArgoCD)
– CI/CD pipelines (GitLab preferred)
– Docker & containerized systems
– Datadog (APM, logs, dashboards)
Preferred Qualifications
– Experience with distributed systems or ledger/blockchain platforms (Canton/CBM a plus).
– Familiarity with Prometheus/Grafana and alerting best practices.
– Experience supporting payment/KYC integrations from an infrastructure perspective.
– Background in high-availability or real-time platforms.
– AWS, Terraform, Kubernetes, or DevOps certifications.
Work Environment
– 100% remote workforce
– Modern cloud-native architecture
– High ownership, fast-moving environment
– Direct influence on the next generation of our platform