Job has been saved to your Account Portal!

Senior Data Platform Reliability Engineer

Job Description

Run managed services, not just systems. Operate multi-tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.

Be the face of reliability. Lead incidents end-to-end, own customer comms and post-incident reviews (RCA with actions customers can see and feel).

Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time-to-data, and optimize costs—so customers notice faster pipelines and fewer surprises.

Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.

Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.

Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cut noise and improve detection and response times

Govern change. Plan and execute upgrades/migrations within change windows; champion safe deploys and rollback strategies.

Partner & mentor. Guide junior engineers; collaborate with customer dev/data teams to unblock delivery and raise the reliability bar.

Participate in on-call. Join a 24x7 rotation with crisp handoffs and playbooks.

Qualification

Background: Bachelor’s in IT/Engineering (or equivalent practical experience).

Data operations: Hands-on support for ETL/ELT, SQL, and production pipelines/workflows.

Platform depth: Strong experience in at least one of Spark, Airflow, Flink, or Jupyter (plus the ecosystem around it).

Scripting/Programming: Solid working knowledge in at least one (1) language - Python, Java or Scala (Automations, Data Manipulations & Orchestrations)

Cloud & containers: Real-world AWS or GCP and production environment usage as a User or Administrator

Kubernetes (or Docker) for scheduling/scale.

Ops craft: Incident management, post-incident reviews, change management, and service reporting.

Communication: Clear customer-facing comms (status updates, RCAs, runbooks).

Tenure:5+ years across the domains above, with depth in at least 1–2 tools per domain.

About The IT Solutions Provider

IT Solutions Provider specializes in making data center and cloud operation teams thrive. Our global team of service architects, infrastructure admins and software engineers have built and operated some of the worlds largest, most scalable environments over the past two decades. Our philosophy is about making technology "werk" for our customers by tailoring solutions to their exact needs. We believe that products, services, tools and processes should serve the needs of people, NOT the other way around. From 24/7 monitoring to infrastructure design to application development, we've got you covered.

Senior Data Platform Reliability Engineer

IT Solutions Provider

Mandaluyong City

Visit Profile

Salary

80,000-120,000/month

Position Level

Professional

Job Level

Experienced Hire

Job Type

Full Time

Hiring Until

11/29/2025

Apply

Refer & Earn PHP 5,000

VIEW JOBS FROM THIS COMPANY