Share
What you will accomplish:
Lead Incident Management : Act as the Incident Commander to drive resolution of major incidents, manage alarms, and ensure effective communication with leadership and partner teams.
Proactive Monitoring : Continuously monitor the health of eBay's critical services to identify and address potential issues before they escalate.
Collaborative Problem Solving : Work closely with partner teams to resolve recurring technical issues, onboard new alerts, and develop high-quality Standard Operating Procedures (SOPs).
Automation and Process Enhancement: Identify and implement opportunities to enhance automation and reduce manual workload, improving overall efficiency.
Solution Development : Collaborate with Architecture, Engineering, and Operations teams to develop solutions that ensure high site availability, reliability and performance.
Enhance Monitoring Tools
What you will bring:
3 years of experience in large-scale internet/server environments, including cloud computing and multi-tier architectures.
Strong incident management and leadership skills, with excellent technical triage and troubleshooting abilities, especially during crises. (for TDO)
Hands-on Software engineering skills including Java, Python, GO, etc
Expert knowledge in large-scale web operations, including web-based Java/J2EE architectures, JVM configurations, and a deep understanding of UNIX, Linux, networking (TCP/IP), and databases (both relational and NoSQL).
Experience in android and iOS application debugging.
Experience with observability tools such as Grafana and Prometheus, and skills in documenting procedures for knowledge management.
This website uses cookies to enhance your experience. By continuing to browse the site, you agree to our use of cookies. Visit our for more information.
These jobs might be a good fit