Site Reliability Architect Lead

Posted 08 December 2022
Location Oklahoma City, United States of America
Job type Full Time
DisciplineIT
Reference8200

Job Description

Site Reliability Architect Lead

Location: Oklahoma City, OK - Hybrid

Level of Responsibility:

The Site Reliability Architect is responsible for providing continuous feedback of site health, reliability, availability, and user experience for all core products. Meaningful and relevant real-time measurements for production environments will be collected, aggregated, analyzed, and provided as a feedback loop to the Product Line team, to provide insight and visibility into product performance and activity. The Site Reliability Architect will provide user experience analysis to internal business partners, executive leadership, and product delivery teams to help drive changes to increase customer satisfaction, product availability and reliability. In addition to monitoring and insight, a heavy focus will be placed on automation opportunities and automating operational processes to maintain 99% availability of core products. These efforts are in addition to operational
and support responsibilities to quickly respond to, resolve production incidents, and perform required administration/operational activities on systems.

Level of Responsibility:

  • Responsibilities are varied and complex.

  • May work outside area of assigned duties. Expert in own area of responsibility.

  • Works independently.

  • Resolves complex problems within area of responsibility.

  • Identifies opportunities and innovative solutions.

  • Recommends changes in procedures.

  • Reviews progress and evaluates results.

  • Authority to make decisions related to job responsibilities.


Essential Functions:

  • Contribute to definition of strategy, standardization of technologies, and establishment of patterns for rapid and continuous development and application of automated solutions to address reliability issues and automate manual tasks.

  • Contribute to the maintenance and continuity of the Site Reliability Architecture strategies and processes.

  • Define and report SLOs / SLAs for 99% availability to executive leadership and business partners.

  • Ensure solutions can be delivered without negatively impacting business partners or external customers.

  • Strong understanding of how network, storage, hosting technologies, databases and applications interact and integrate to deliver a business outcome.

  • Implements and trains team members for measuring and testing of site reliability using chaos-monkey based methodologies.

  • Implements and trains team members on the tool consolidation strategy to optimize spend versus value for our end to end monitoring platform.

  • Influence product delivery teams to implement usability and reliability enhancements leading to improved user experience index scores and improved availability.

  • Lead, implement and train team members on measurement capability of core product availability across the external and the internal Cloud using HTTP endpoint testing and synthetic user testing.

  • Leads, Implements and trains team members for the DevOps principle of Feedback by creating user experience measures for all products.

  • Maintain automated site availability reporting and data platform.

  • Maintain technological awareness and provide assessments of architecture, design, code, technology and industry trends.

  • Present usability, reliability, incident, and user experience of products to senior and/or executive leadership.

  • Assist with production support, incident management, problem management, and service restoration as needed to quickly respond to and resolve production issues.

  • Provide detailed analysis and troubleshooting for systems outages providing feedback to the product line team.

  • Provide technical leadership for calculating system availability SLAs across all products.

  • Provide technical leadership for usage and maintenance of tools for measuring core product health in production (with opportunities to extend those capabilities all the way back through the entire DevOps pipeline).

  • Responsible for driving root cause analysis (RCAs) activities with vendors and service organizations to achieve alignment on the root cause of such failures and corrective actions.

  • Review all major incidents and outages to confirm with confidence that appropriate lessons were learned, documented and appropriate actions taken to prevent similar issues in the future.

  • Serve as a lead in troubleshooting disruptions services including composite applications.

  • The position focuses on the architecture of complex, composite systems over their life cycle taking into consideration requirements, reliability, change, configuration management and disaster recovery – including the coordination of different support teams.

  • Track all non-compliances and any remediation where Suppliers are not compliant with current Architecture standards and guidelines.

  • Work with appropriate teams to provide budget analysis information in order to make informed budget considerations regarding system replacements, upgrades in technology, and in-place applications.

  • Work with Business Relationship managers to assess business impact levels associated with availability, incidents and problems and established well defined standard outage windows.

  • Work with project teams to ensure operational requirements are defined and have proper level of event monitoring designed into the solution.

  • Work with Service Management to define Service Levels.

  • Work with the Site Reliability Architecture community, Enterprise Architecture, and other Product Line teams to implement and contribute to strategy for DevOps CICD performance and monitoring quality gates within the delivery pipeline.

  • All other duties as assigned


Working Conditions:

  • Office environment.

  • May work hours outside normal schedule.


Knowledge, Skills & Abilities:

  • A working knowledge of Virtual Machines (VM) technology

  • Ability to think laterally and constructively question established process.

  • Ability to understand Enterprise architecture (EA) artifacts, roadmaps, principles and associated strategies

  • An understanding of configuration management, or containerization toolsets

  • Basic knowledge and understanding of Security (CIA Model and PCI compliance) is a plus

  • Be proficient in one or more cloud providers, including Azure

  • Broad knowledge of corporate policies and business strategies including disaster recovery, business continuity and risk

  • management.

  • Can identify and mitigate reliability risks

  • Design solutions with failure in mind to ensure reliability

  • Enjoy pushing scalability to the limit with high throughput services

  • Excellent communication and troubleshooting skills

  • Excellent communications skills both verbal and written

  •  Excellent communications skills with the ability to communicate with customers, peers, management etc. in both formal and informal situations

  • Exceptional leadership, planning, problem-solving and organizational skills.

  • Experience balancing the service reliability, sustainability, and technical debt for services running at scale

  • Experience with continuous integration/deployment frameworks such as Jenkins

  • Experience with operational monitoring tools such as Dynatrace with a mindset towards predictive analysis

  • Experience with troubleshooting and debugging issues at any level 

  • Expert level knowledge of IT infrastructure (networks, servers, VM’s, OS’s) implementation and operations.

  • Focus on performance bottlenecks and performance improvement techniques

  • Good general level knowledge of the component pieces of composite systems and associated technologies.

  • Good understanding of networking including L2 and L3 concepts, including Firewall, Load Balancing, Routing and Switching.

  • Knowledge and understanding of microservices based architectures, APIs, etc.

  • Knowledge and understanding of utility business process, objectives, milestones and structure.

  • Knowledge of fundamental networking protocols, such as TCP/IP, HTTP, SSL, and DNS

  • Like looking through metrics and logs as if it were a treasure hunt

  • Must be comfortable working with mission critical and sensitive systems, with a sense of urgency appropriate to the responsibilities

  • Prefer to build automation to perform redundant tasks rather than manually handling toil

  • Strong analytical and problem-solving skills

  • Strong facilitation skills.

  • Strong scripting skills including ability to write scripts from scratch

  • Understanding of databases and data modeling

  • Versed with the entire software development lifecycle, DevOps, and SRE practices

  • Working knowledge of the automation tools


Education/Work Experience:

  • Bachelor's Degree in MIS, Engineering, CIS or other directly related discipline, and 7 years experience building/operating IT solutions, documenting and defining Architecture (data, applications, monitoring, security); Or High School Diploma/GED and 12 years experience building/operating IT solutions, documenting and defining Architecture (data, applications, security) and Disaster Recovery.