Site Reliability Engineer
-
- Software Engineering
- Professional
Site Reliability Engineer
-
- Software Engineering
- Professional
A career in IBM Software means you’ll be part of a team that transforms our customers challenges into solutions.
Seeking new possibilities and always staying curious, we are a team dedicated to creating the world’s leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
We are seeking a skilled SRE to join our Platform Engineering team for Data and AI organization within IBM Software. As part of our team, you will be responsible for designing, building, maintaining the underlying infrastructure and tools necessary to support and enable software development, deployment, and operations at scale.
Your Role and Responsibilities
Automation: Develop and maintain automation tools and scripts to streamline deployment, monitoring, and management of the infrastructure and
applications.
Monitoring and Alerting: Set up and maintain monitoring and alerting systems to proactively identify and resolve issues before they impact customers.
or services.
Performance Optimization: Identify opportunities for performance optimization and work with development teams to implement improvements.
Documentation: Maintain up-to-date documentation for the infrastructure, processes, and procedures.
Collaboration: Work closely with development teams, product managers, and other stakeholders to understand requirements and ensure the reliability of the platform.
Continuous Improvement: Participate in post-incident reviews, retrospectives, and other forums to identify areas for improvement and drive continuous improvement initiatives.
Required Technical and Professional Expertise
- Experience with Cloud Platforms: Strong experience with cloud platforms such as AWS, Azure, or Google Cloud Platform, including expertise in
- Deploying and managing services in these environments.
- Managing, and troubleshooting containerized applications.
- Automation and Scripting: Strong scripting skills (e.g., Python, Bash) and experience with configuration management tools (e.g., Ansible, Chef, Puppet) to automate deployment and management tasks.
- Troubleshooting and Problem Solving: Strong troubleshooting skills and the ability to quickly identify and resolve complex issues in a production environment, including experience with incident response and post-incident analysis.
Preferred Technical and Professional Expertise
- DevOps Culture: Experience working in a DevOps culture and mindset, including a strong understanding of the collaboration between development and operations teams to achieve business goals.
- Container Orchestration: Proficiency in container orchestration tools such as Kubernetes and OpenShift, including experience in deploying,
- Monitoring and Logging: Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack) to monitor the health and performance of infrastructure and applications.
- Experience with Scalable Architectures: Experience designing and implementing scalable architectures for cloud-based applications, including knowledge of best practices for scalability, performance, and reliability.
- Experience with Monitoring and Observability: Experience with advanced monitoring and observability practices, including using tools such as Prometheus, Grafana, and Kubernetes-native monitoring solutions to gain insights into system performance and behavior.
Want to know what it’s like to be an IBMer?
Key Job Details
Don’t see a fit at this time?
Don’t worry. Join our Talent Network and get notified about the latest opportunities.