Job has been saved to your Account Portal!

Senior Data Platform Reliability Engineer

Job Description

  • Run managed services, not just systems. Operate multi-tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.
  • Be the face of reliability. Lead incidents end-to-end, own customer comms and post-incident reviews (RCA with actions customers can see and feel).
  • Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time-to-data, and optimize costs—so customers notice faster pipelines and fewer surprises.
  • Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.
  • Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.
  • Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cut noise and improve detection and response times
  • Govern change. Plan and execute upgrades/migrations within change windows; champion safe deploys and rollback strategies.
  • Partner & mentor. Guide junior engineers; collaborate with customer dev/data teams to unblock delivery and raise the reliability bar.
  • Participate in on-call. Join a 24x7 rotation with crisp handoffs and playbooks. 
  • Qualification

  • Background: Bachelor’s in IT/Engineering (or equivalent practical experience).
  • Data operations: Hands-on support for ETL/ELT, SQL, and production pipelines/workflows.
  • Platform depth: Strong experience in at least one of Spark, Airflow, Flink, or Jupyter (plus the ecosystem around it).
  • Scripting/Programming: Solid working knowledge in at least one (1) language - Python, Java or Scala (Automations, Data Manipulations & Orchestrations)
  • Cloud & containers: Real-world AWS or GCP and production environment usage as a User or Administrator
  • Kubernetes (or Docker) for scheduling/scale.
  • Ops craft: Incident management, post-incident reviews, change management, and service reporting.
  • Communication: Clear customer-facing comms (status updates, RCAs, runbooks).
  • Tenure:5+ years across the domains above, with depth in at least 1–2 tools per domain.
  • About The IT Solutions Provider

    IT Solutions Provider specializes in making data center and cloud operation teams thrive. Our global team of service architects, infrastructure admins and software engineers have built and operated some of the worlds largest, most scalable environments over the past two decades. Our philosophy is about making technology "werk"​ for our customers by tailoring solutions to their exact needs. We believe that products, services, tools and processes should serve the needs of people, NOT the other way around. From 24/7 monitoring to infrastructure design to application development, we've got you covered.

    Senior Data Platform Reliability Engineer

    IT Solutions Provider

    Mandaluyong City

    Visit Profile

    Salary

    80,000-120,000/month

    Position Level

    Professional

    Job Level

    Experienced Hire

    Job Type

    Full Time

    Hiring Until

    11/29/2025