Observability Senior Architect
Job Overview:
We are looking for professionals with a minimum of 10 years of relevant work experience in setting up monitoring solutions using products like Dynatrace, Datadog, ELK stack, Splunk, Grafana/Prometheus, etc., especially in critical production environments. Additionally, we value a minimum of 5-6 years of experience in end-to-end observability, covering technical, user experience, and business outcome metrics. Experience with AIOps would be a significant advantage.
Qualification/Experience:
· Your experience with private cloud and cloud-native public-cloud (particularly AWS) hosted applications will be highly beneficial. We are particularly interested in individuals who have worked with multi-tenancy setups and data segregation on the observability and AIOps stack.
· Furthermore, we are looking for expertise in designing and building an Observability & Maintenance (O&M) module for multi-tenant solutions, as well as defining SLIs and setting up SLOs for these solutions.
· Experience in implementing Container, Network, APM, RUM, Log Analytics, end-to-end tracing, and custom alerts using tools like Grafana, Prometheus, and Grafana Loki (alternatively Logstash or Fluent bit). Additionally, experience with other third-party products like Dynatrace will also be considered valuable.
· Proficiency with containers and multi-tenancy setup for the observability solution is essential. The ability to configure custom alerts, monitors, and build AIOps workflows based on telemetry is another critical aspect we are focusing on. A solid understanding of setting up integration capabilities with other systems via APIs, consuming external APIs for IAM, and ingesting metric-based telemetry via collectors is also required.
· Furthermore, we need someone capable of building custom observability dashboards tailored to different portfolios and personas. Setting up Synthetic Monitoring and Test Automation and integrating its telemetry into the observability stack is also a key requirement. The role also involves tenant and data segregation and the ability to obfuscate sensitive information on the common observability schema.
· Lastly, proficiency in coding, particularly in Python, Java, and Ansible scripting, is highly preferable. Cloud- GCP/Azure.
· Any certifications in Observability Foundation from the DevOps Institute or any product-level accreditation would be highly valuable for this role. Additionally, having recognized System Architecture qualifications such as TOGAF would be a great bonus.
Responsibilities and Duties:
· The primary responsibilities for this role include architecting, designing, and ensuring the implementation of the entire observability solution to be packaged as a module within our multi-tenant private cloud solution. This also involves implementing the observability solution to monitor and apply the same feature-set across all tenants, effectively serving as a hypervisor.
· Furthermore, the candidate will be responsible for designing and implementing integrations, as well as externalizing APIs. They will also need to set up authentication and authorization controls by integrating with an IAM layer.
· Collaboration with the UI/UX teams is essential to design dashboards for the Observability & Maintenance platform for both the tenants and the host. Additionally, the role entails designing and setting up an AIOps module responsible for automated remediation workflows, such as capacity scaling, container restarts, and anomaly detection.
· The candidate will also work on building Proof-of-Concept solutions to view end-to-end tube-maps or service flows for the respective tenant's services. Defining and setting up a CMDB to serve as a source for infrastructure and application telemetry is another crucial responsibility.
· Moreover, they will work with other teams to ensure the system is well-tested and scalable, meeting tenant demands. Finally, defining business-aligned SLIs and setting SLOs for core services and journeys will be part of their responsibilities.