Site Reliability Engineer
Infrastructure Commander • Chaos Warrior • DevOps Champion
Site Reliability Engineer with production experience at PhonePe, managing large-scale Linux infrastructure, NGINX edge systems, distributed databases, and observability platforms. Experienced in on-call operations, incident response, HA cluster design, and infrastructure automation across bare metal environments. Strong focus on reliability, performance, and reducing operational toil through internal tooling.
PhonePe Private Limited, Bangalore
July 2024 - Present
PhonePe Private Limited, Bangalore
February 2024 - July 2024
Expert in building observability infrastructure for 800+ hosts. Prometheus, Grafana, InfluxDB expertise.
Manage and operate high-availability clusters for MySQL Galera, Percona, Aerospike, and RabbitMQ.
Infrastructure automation and configuration management using SaltStack, Ansible, and Bash scripting.
Manage NGINX edge layer serving millions of requests per second with zero-downtime deployments.
Design and operate HA clusters on bare metal infrastructure with KVM/QEMU hypervisor-based virtualization.
Developed Slackbot (70% efficiency gain), JIRA portals, and CMR generator (hours to seconds).
Participate in round-the-clock on-call rotations for staging and production environments, responding to incidents and troubleshooting issues to maintain service availability for millions of users.
Manage and operate the NGINX edge layer serving millions of requests per second. Execute configuration changes and implement zero-downtime deployments with precision.
Administer and maintain distributed data systems including MySQL Galera, Percona, and Aerospike. Execute schema changes, checksums, truncates, and production maintenance tasks.
Built and maintained comprehensive observability infrastructure for 800+ hosts using Prometheus, Grafana, InfluxDB, Zabbix, Riemann, and PMM. Dramatically improved infrastructure visibility and incident detection.
Designed and operated high-availability clusters for Percona/MariaDB, RabbitMQ, Elasticsearch, and Aerospike on bare metal infrastructure using KVM/QEMU virtualization.
Sole developer of production automation platform generating 1500+ CMRs and database scripts. Implemented error-zero steps architecture for DB alters and deployment scripts, reducing production incidents to nearly 0%. Also developed Slackbot (70% efficiency boost) and JIRA portal.
Computer Science and Engineering
BMS Institute of Technology, Bangalore
Ready to join the mission? Let's collaborate on building robust, scalable infrastructure.