Senior System Reliability Engineer

Nvidia • Full-time • Taiwan, Hsinchu • 1w ago

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing — with the GPU acting as the brains of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company and build our teams with the most thoughtful people in the world. Join us at the forefront of technological advancement.

GPU Servers are one of the fastest-growing segments for NVIDIA and the Artificial Intelligence industry. As the computational power increases with every GPU generation, developing efficient and reliable systems is an imperative. We are looking for a System Reliability Engineer to join NVDIA's existing Reliability Engineering team, involved in NVDIA's diverse system product range specifically Graphics and High-Performance Computing printed circuit boards and Data Center Servers.

What you will be doing:

This position is not for silicon or chip reliability, but for printed circuit board assemblies (PCBAs) and Server products, ranging from Graphics Cards to HGX/DGX AI Servers. The position will be locating in Taiwan and reporting to U.S.
Work closely with CM/ODM. You will have the opportunity to interface and interact with all pertinent engineering groups and suppliers ensuring the desired reliability is achieved using Design for Reliability (DfR) approaches including FMEA and DoE approaches.
Establish, deliver and maintain product reliability standards and metrics for NVDIA's new system technologies, using existing tools and processes or developing new as required.
Provide reliability predictions along with test plan definition and methods to assess and drive product reliability to the desired levels.
Perform and lead appropriate testing with associated failure analysis and recommendations for improving designs and manufacturing.
Develop and present methods of correlating reliability test results with actual field performance.

What we need to see:

BS/MS in EE/ME/Computer Engineering, or equivalent experience, graduate degree preferred.
5 plus years in a hardware validation/reliability environment related to printed circuit boards and servers.
Hands-on experience with Reliability demonstration & testing along with accelerated life methods such as Thermal Cycling, Shock & Vibration, ALT/HALT/HASS, Burn-in, and ORT for components, subassemblies, and complete products.
Understand power supply, memory, high speed I/O, PCI express, Ethernet and I2C.
Strong command and understanding of statistical concepts/models/analysis and how they relate to product reliability & life analysis.
Fluent in Chinese and English. Good verbal and writing skills as well as the ability to communicate at a high level.
Self-motivating, independent, and committed to getting things done.
Good project management skills and ability to balance multiple simultaneous projects during development and production stages.

With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers; we have some of the most forward-thinking and hardworking people in the world working for us and, due to unparalleled growth, best-in-class teams are rapidly growing. If you’re creative and autonomous with a real passion for your work, we want to hear from you!

Apply