Описание
We are looking to hire a highly skilled Senior Site Reliability Engineer (SRE) to join our Platform Engineering team.The ideal candidate should have expertise in DevOps and a deep understanding of Service Level Management (SLM) metrics, along with experience in event-driven infrastructure projects utilizing tools like Terraform, New Relic, Kubernetes, AWS, and Kafka. In this role, you will serve as a vital member of the Platform Engineering team, collaborating with other engineering groups to ensure our platform infrastructure tools meet their needs and positively impact Developer Experience. Additionally, you will assist teams in identifying the appropriate configurations and thresholds for alerts or automations within their applications. Kindly note that this role supports remote work, but only from within Ukraine. ResponsibilitiesDesign scalable and highly available systems, implementing solutions that use load balancing, auto-scaling patterns, canary releases, and blue-green deploymentsDevelop monitoring and logging dashboards with tools such as New Relic, Prometheus, Grafana, and Datadog, ensuring observability through metrics, tracing, log aggregation, and alertingAssist teams in defining settings and thresholds for application-specific alerts and automations, acknowledging varying application performance requirements like response times and resource constraintsMonitor system reliability and optimize performance using tools such as New Relic while applying DORA metrics to enhance development and operational performance, and maintain compliance with SLM metrics like SLAs, SLOs, and SLIsAdvocate for and implement "Chaos" engineering practices to strengthen system resiliencyCollaborate with cross-functional teams to improve platform engineering practices and ensure effective metrics analysis RequirementsKnowledge of Infrastructure-as-Code tooling, such as Terraform, for infrastructure managementUnderstanding of scalability and high availability patterns, including load balancing, auto-scaling, canary releases, and blue-green deploymentsProficiency in DevOps metrics (e.g., DORA) to measure and improve development and operational performanceFamiliarity with Service Level Management (SLM) metrics (e.g., SLAs, SLOs, and SLIs) to define, monitor, and ensure compliance within expected standardsExpertise in monitoring, logging, and observability tools such as New Relic, Prometheus, Grafana, and DatadogBackground in using Kafka to enhance the performance of event-driven, real-time data processing and streaming architecturesCompetency in tools that measure SLM, DevOps, and DORA metrics, including Apache DevLake, Grafana, and New RelicSkills in managing cloud infrastructure with providers such as AWS, Azure, or GCPProficiency in CI/CD pipeline tools such as GitHub Actions, Jenkins, or GitLab CIAnalytical skills to interpret metrics and provide actionable improvementsStrong communication skills to foster collaboration within teams and with stakeholders Nice to haveUnderstanding of Observability-as-Code tools and best practicesBackground in using "Chaos" engineering methodologies to enhance system resiliency We offerWith us you can:Work on a flexible schedule remotely or from any of our comfortable offices or coworking spaces in UkraineReceive the necessary equipment to perform your work tasksChange projects and technology stacks within EPAMGain experience in various business domains (Insurance, E-commerce, Healthcare, Finance, Travelling, Media, Artificial Intelligence, and more)Relocation opportunities may be available for eligible candidates, depending on the role and openings at other EPAM locationsParticipate in volunteer, charity programs and communities (both technical and interest-based)We focus on your professional growth:You can plan your individual career path together with your managerReceive regular feedback from colleaguesImprove your English for free with certified teachers (Speaking Clubs, client interview preparation courses, etc.)Get the opportunity to undergo free training and certification in AWS, GCP, or Azure CloudsUse the internal E-learn training program (18,200+ specialized training and mentoring programs)Access corporate accounts on LinkedIn Learning, Get Abstract and other partner resourcesStudy at EPAM Solution Architecture School with the instructors who are practicing architectsDevelop as a leader, join Delivery Management, Resource Management, Leadership Essentials school and moreParticipate in internal communities (500+ meetups, technical discussions, brainstorming sessions, online events and conferences annually)What we offer:Vacation and sick leave (including a sick leave without a medical certificate)A wide range of Voluntary Medical Insurance programs providing both medical treatment and various preventive options (including sports activities)Medical insurance for family members at corporate ratesCompany support during significant life events (childbirth or adoption, marriage, etc.)Support for psychological comfort: discounts on services from mental health specialists or coaches, thematic trainingE-kids program - a free programming language training program for EPAMers' children Kindly be advised that the set of benefits, including learning, certification, and other opportunities, may vary depending on the role you apply for. Our recruiter will be able to share more details about the specific opportunity during your general interview. EPAM strives to provide its global team of over 61,700 professionals in more than 55 countries with opportunities for professional growth from day one of collaboration. Our colleagues are the source of EPAM's success, so we value cooperation, strive to always understand our clients' business and aim for the highest quality standards. No matter where you are, you will join a dedicated, diverse community that will help you realize your potential to the fullest.