Site Reliability Engineer
TUI Travel’s vision is to make travel experiences special. To fulfil this vision, we never stop looking ahead, seeking new ways to delight our customers and grow our business. We recognise the power of digital and the massive contribution this brings to creating a truly unique and differentiated customer experience.
The Mobility Hub has a commitment to deliver the TUI Mobile App (TDA) for multiple countries across Europe and to all markets at a high level of quality and stability on a reliable and frequent basis. The scope and growth of the TDA is ever increasing and the ability to implement a continuous delivery pipeline that provides a stable delivery process is essential.
Site Reliability Engineers (SREs) are responsible for keeping all production systems running smoothly. SREs are a blend of pragmatic operators and software engineers that apply sound engineering principles, operational discipline, and mature automation to our environments and code base.
Joining the Mobility Hub team as Site Reliability Engineer, you will be key in maturing our organisation in continuous delivery and help with implementing the processes to ensure high availability and quality. I build it, I run it.
You may be a fit to this role if you:
- Think about systems - edge cases, failure modes, behaviours, specific implementations.
- Have an urge to collaborate and communicate asynchronously.
- Have an urge to document all the things you do so you don't need to learn the same thing twice.
- Have an enthusiastic, go-for-it attitude.
- Have an urge for delivering quickly and iterating fast.
- Share our values, and work in accordance with those values.
Specific Mobile Apps skills:
- App Store Deployments (App Store, Google Play Store).
- iOS Code Signing and Provisioning.
- Xcode Command Line Tools.
- Android SDK Command Line Tools.
- Gradle Build Tools.
General DevOps skills:
- AWS Services e.g. EC2, ECS, ECR, Lambda
- Monitoring and logging e.g. ELK, CloudWatch, CloudTrail, Grafana, Dynatrace, DATADOG
- Security tooling e.g. Tenable
- AWS security features and best practices
- Design of self-healing and fault-tolerant services
- Techniques and strategies for maintaining high availability
- CI Tooling e.g. Jenkins, GitLabCI.
- Certificate management.
- Be on a PagerDuty rotation to respond to availability incidents and customer incidents (level 2 and level 3 support).
- Use your on-call shift to prevent incidents from ever happening again.
- Spend at least 50% of your time on development and automation.
- Run our infrastructure with CloudFormation and Terraform (all environments).
- Make monitoring and alerting alert on symptoms and not on outages (application and infrastructure in all environments).
- Document every action so your findings turn into repeatable actions–and then into automation.
- Improve the deployment process to make it as boring as possible.
- Design, build and maintain core infrastructure pieces that allow scaling.
- Debug production issues across services and levels of the stack.
- Plan the growth of infrastructure.
- Creating blog posts.
- Contributions to handbook, runbooks, general documentation.
- Maintaining good relationships with other engineering teams in TUI that help improve the TDA product.
- Participate in DevOps Champions/Community of Practice.