- Support and maintain services by measuring and monitoring availability, latency, and overall system health.
- Develop, manage and support SRE tools and applications.
- Engage in improving the whole lifecycle of services from inception through deployment, operations, and refinement.
- Analyze logs and telemetry data by writing monitoring and automation code
- Provide OnCall support to 1st level production support teams
- Provide hands-on technical expertise during service impacting events.
- Collaborate with other engineers on code reviews, internal infrastructure improvements and process enhancements.
- Comfortable working with large-scale server deployments. Knowledge of additional programming languages and platforms: Kubernetes, AWS, Kafka, Cassandra, Hadoop. Strong ability and enthusiasm to learn new technologies in a short time. We seek a self starter, visionary person with strong leadership capabilities. Extraordinary communication skills, for collaborating across many participating teams. You will interact with many other group’s internal team to lead and deliver best-in-class products in an exciting fast-paced environment. Dynamic, smart people and inspiring, innovative technologies are the norm here. Will you join us in crafting solutions that do not yet exist?
- Driven approach to continually improving service levels
- Consistent track record of troubleshooting and resolving issues in live production environments and implementing strategies to eliminate them
- Proficient coding experience using Python, Java, bash or similar languages
- Strong grasp of Linux systems, networking, and security
- Experience with monitoring tools such as Splunk, Nagios, Grafana
Education & Experience
BS degree in computer science or equivalent field with 5+ years or MS degree with 3+ years experience, or equivalent.