As a Site Reliability Engineer – OpenStack Private Cloud (Storage), you will work with other SREs, Engineers, Developers and our support & operations teams to ensure maximum performance, reliability and automation of our Private Cloud deployments and infrastructure.
We recognize that manual approaches to operations do not scale, and are launching a new team in Private Cloud Engineering to tackle the significant problems of managing many, discreet Private Cloud installations with multiple offerings and form-factors at scale world-wide. Our Site Reliability Engineer is someone who is familiar with both software and systems engineering with a desire not to just resolve the problem but prevent it in the future. You should have excellent written and verbal communication skills and you should be comfortable operating in fast paced environment.
In this role, you will be focused on our private cloud storage offerings including CEPH, Swift, and Hummingbird. This is a mix of block, filesystem and object storage systems offered as part of the private cloud product offering. This includes understanding how Openstack integrates these technologies, operational expertise and debugging.
In addition to resolving and automating issues internally and downstream if a problem, or issue is better served by fixing the issue in the upstream Open Source code, you will be submitting patches to improve the operational and reliability aspects of the upstream projects.
- Design, architect, as well as maintain existing operational solutions for managing our customer environments and infrastructure, across data centers and technologies with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
- Implement and maintain monitoring and alerting solutions that help discover failures in a timely fashion while working with engineers to identify root cause and fix issues.
- Provide basic to intermediate network administration and troubleshooting.
- Day-to-day operational management, including response, incident, event and problem management activities along with our service delivery and engineering teams.
- Participate in on-call rotation duties.
- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
- Support services & deployments before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Direct experience administering, using and/or operating one or more of the following: CEPH, Gluster, Amazon S3, Openstack Swift or other block, filesystem and object storage systems.
- Experience with algorithms, data structures, complexity analysis and software design.
- Experience in one or more of the following: Python, Go, and cross platform scripting is a must.
- Experience with Linux systems administration and tuning.
- Experience with automation tools such as Docker, Jenkins, Ansible, Terraform
- Comfort with collaboration, open communication and remote teams.
- Interest in designing, analyzing and troubleshooting large-scale distributed systems.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- Ability to debug and optimize code and automate routine tasks.
- Think of infrastructure and automation as code and critical engineering tasks.
Apply for this job