YAM Management - Site Reliability Engineer III
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. The SRE team is responsible for keeping user-facing services and other YAM Worldwide production systems running smoothly. We specialize in systems, whether it be networking, operating systems, or some more specific interest in scaling, cloud, or distributed systems. As SREs, we are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems.
As an SRE you will:
- Be on an on-call rotation to respond to YAMWW availability incidents and provide support for operations or development escalations.
- Identify the root cause of failures, and prevent them from ever happening again.
- Run our infrastructure with Ansible, Cloudformation, and Jenkins.
- Consistently improve monitoring and alerting to alert us on symptoms and not on outages exclusively.
- Document every action so your findings turn into repeatable actions–and then into automation.
- Improve the deployment process for applications, infrastructure, and services to make it as automated as possible.
- Design, build, and maintain core infrastructure pieces that allow YAMWW services to scale from small to enterprise-class deployments.
- Debug production issues across all services and levels of the stack.
You may be a fit for this role if you:
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Know your way around Linux, the Unix Shell, and Windows Powershell.
- Know the use of config management systems, preferably Ansible.
- Are proficient in at least one scripting language.
- Have an urge to collaborate and communicate effectively.
- Have an urge to document all the things so you don't need to learn the same thing twice.
- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't let it go until it has been improved or fixed.
- Have an urge for delivering quickly and iterating fast.
- Have experience with Nutanix, AWS, Ansible, Docker, Cloudformation, Kubernetes, or similar technologies
Projects you could work on:
- Coding infrastructure automation with Cloudformation and Ansible.
- Improving our monitoring and alerting frameworks with Datadog and other tooling.
- Helping our development teams deploy and test their releases with CI/CD pipelines.
- Planning, designing, and deploying infrastructure for new retail locations for any of our sister companies.
- Identify SLO/SLAs that can be effectively measured and utilized to improve service uptime and performance.
- Design and implement solutions using AWS to migrate out of our on-premise data centers.
You should have technical experience with:
- Cloud frameworks (AWS preferred)
- Docker containers in an enterprise environment
- Networking skills (Firewalls and switching)
- Virtual environments (Nutanix or VMware)
- Databases (Postgres, MSSQL, Mysql)
- Web-based applications (NGINX or Apache)
- Enterprise Linux and Windows Server
Your background should include:
- Very Strong problem solving & troubleshooting skills including the ability to perform root cause analysis for preventative analysis.
- Proven track record of working with preventive/predictive maintenance techniques
- Strong analytical, planning, organizational, and time management skills
- A Bachelor's degree in an Engineering discipline or equivalent experience.