To support our continued business growth in the US and in international markets, we are setting up a world-class product development and operations support center in Bangalore India. We are looking for people with a passion for innovation and creative use of new technologies in building consumer applications in a retail domain.
We're looking for a talented and driven Site Reliability Engineers. In this role, the candidate will be working to improve the reliability and performance of our services. You will help expand, maintain and troubleshoot a geographically diverse network of services in a 24x7 operations environment across different verticals.
Responsible for the up time and reliability of infrastructure of Quotient.
Management of events related to IT infrastructure elements (e.g. data centers, networks, servers, storage, operating systems, Internet security, and business applications).
24x7x365 Monitoring and response to events, Incident Management, Problem management, Activities pertaining to Change management, Reporting of KPI’s, CMDB management.
Responsible for activities/projects involving Datacenter migration to Cloud (GCP, Azure etc)
Responsible for managing activities/projects for Network Team, DB Team, BI etc
Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Troubleshoot issues spanning the entire OSI layer. Work with Engineering teams to achieve maximum network and application up-time and swift resolution of all issues
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Scale systems sustainably through mechanisms like automation and evolve systems to improve reliability and velocity.
Use industry tools such as Netcool, Servicenow, Moogsoft, Cacti, Solarwinds, Nagios, Splunk, Cloudera, Pagerduty etc
Provide input into process and procedure for increasing reliability, reducing procedural errors and managing change within the datacenters.
Extensive process-level and node-level monitoring and auto healing of entire cluster.
Managing, provisioning and servicing Datacenter and Cloud servers.
Contribution to back-end services to contribute to its infrastructure system design.
Responsible for identifying Problem incidents and driving it for resolution.
Responsible for driving RCA for high priority incidents and working with respective development teams on preventive measures.
Experience in a Systems Engineering/SRE role in a large scale environment.
Experience troubleshooting incidents/problems and working with a team to resolve large scale production issues.
Knowledge of at least one programming language: Python, Perl, Java, C++, Powershell, etc.
Strong knowledge of Linux systems
Good understanding of standard networking protocols and components such as HTTP, DNS, TCP/IP, the OSI Model, networking and load balancing.
Technical skills in Apache, Tomcat, Jetty, Memcached, Java, CDN technologies, network analysis tools, or equivalent.
Familiarity with logging systems like Splunk.
Experience with Puppet/Ansible.
Experience with monitoring tools such as App Dynamics, Extrahop, Solarwinds etc.
Experience with Git, continuous integration and testing methods.
Effective written and verbal communications skills are required
Strong problem solving and troubleshooting skills are required.
Willing to Work in a 24x7x365 setup, involving night shifts.