Senior Site Reliability Engineer, Database Operations:Clickhouse : Gitlab

March 21, 2025
Apply Now

Job Description

Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly 24x7x365. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the GitLab codebase. We specialize in systems, whether it be networking, the Linux kernel, or some more specific interest in scaling, algorithms, or distributed systems, along these functions: 

Design, build, and maintain ClickHouse and PostgreSQL clusters to support high-demand, enterprise-scale workloads.

Provision and Orchestrate cloud infrastructure using configuration management tools (Ansible, Chef), IaC (Terraform) and the Kubernetes ecosystem (Helm charts, Operators) and distributed consensus (etcd) in GCP

Design and implement enterprise-grade, high-availability ClickHouse solutions with ClickHouse Keeper, sharding, and replication, optimized for large-scale and dynamic datasets.

Optimize and scale high-transaction PostgreSQL clusters with Patroni and streaming replication for GitLab’s core applications on GCP

Build and maintain early warning systems, monitoring, and alerting tools (e.g., Prometheus/Grafana) to predict capacity needs, monitor query latency and replication lag, and ensure resource optimization across platforms.

Enable cross-database integrations and workflows, such as ClickHouse-to-PostgreSQL data federation, CDC, and logical replication, to support hybrid analytics.

Respond to platform alerts, user emergencies, and support requests while ensuring strict adherence to SLOs, including during SRE on-call rotations.

Enhance infrastructure security by implementing and updating measures that protect GitLab’s systems and ensure compliance with regulatory requirements (e.g., GDPR, FedRAMP, SOC2, ISO).

Partner with internal and external compliance assessors as Subject Matter Experts during certifications and recertifications.

Collaborate with engineering teams to address architectural bottlenecks, plan service rollouts and migrations, and shape the future roadmap while maintaining strong operational readiness.

Mandatory technical skills and experience

  1. Advanced database platform management experience, preferably using Postgres and Clickhouse at scale
  2. Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes
  3. Solid experience with at least one programming language:  Go, Ruby or Python
  4. Advanced experience with Linux
  5. Extensive on-call experience as an SRE supporting mission critical systems
  6. Solid incident management experience, across all phases:  Analysis, Remediation, RCA and Corrective Actions
  7. Solid experience implementing monitoring at scale (preferably Prometheus and Grafana)

Mandatory non-technical skills, experience and characteristics

  1. Willingness and ability to live and promote Gitlab’s unique CREDIT Values in one’s day to day work and interactions with teammates.
  2. Superior verbal and written communication skills
  3. Cool, collected and composed under pressure
  4. Comfortable and productive working asynchronously across timezones and cultures, at the speed and scale of business.
  5. Enable others to excel
  6. Be a Leader of One
  7. Act Like an Owner with Gitlab’s resources.

How GitLab will support you