SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.
SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.
What You Will Be Doing:
- Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.
- Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.
- Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.
- Implement monitoring and alerting systems to proactively identify and address issues.
- Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.
- Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.
- Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.
What We Need To See:
- BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.
- Proven experience in storage system administration and site reliability engineering.
- Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.
- Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java
- Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
- Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana.
Ways to stand out from the crowd:
- Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.
- Strong Linux and network troubleshooting skills by running various commands and tools.
- Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..
- Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.
- Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.