Senior NPU Kernel Operator Engineer

San Jose

USA

Permanent

Hardware

Senior NPU Kernel / Operator Engineer

Overview

We are seeking a Senior NPU Kernel / Operator Engineer to lead the development and optimization of high-performance deep learning operators for a next-generation AI accelerator platform.

This role focuses on kernel design, hardware-aware performance tuning, and correctness validation across a broad range of neural network workloads.

The ideal candidate will have deep experience optimizing compute-intensive software on GPU, NPU, DSP, SIMD, embedded accelerators, compiler backends, or HPC systems, with the ability to reason from model-level requirements down to hardware execution efficiency.

Responsibilities

Design, implement, and optimize high-performance operators such as:
- Normalization
- Reduction
- Transpose
- Reshape
- Gather / Scatter
- Quantization / Dequantization
- Fused elementwise kernels
Own performance optimization across key hardware constraints, including:
- Memory bandwidth
- SRAM utilization
- Data reuse
- DMA latency
- Bank conflicts
- Compute utilization
Develop advanced optimization strategies including:
- Tiling
- Blocking
- Vectorization
- Memory scheduling
Analyze and resolve bottlenecks related to:
- Memory hierarchy
- Synchronization overhead
- Instruction scheduling
- Data movement
Validate operator correctness and numerical precision against reference implementations (e.g. PyTorch, NumPy)
Benchmark and profile kernel performance across simulation, emulation, FPGA, or production silicon environments
Debug complex issues involving:
- Tensor layouts
- Precision loss
- Memory access patterns
- Performance regressions
Build performance models and optimize operators toward hardware roofline limits
Collaborate closely with compiler, runtime, hardware architecture, and ML model teams to improve operator APIs and execution efficiency
Document optimization strategies, tensor layouts, and performance improvements
Mentor junior engineers and help define engineering best practices

Requirements

BS / MS / PhD in Computer Science, Electrical Engineering, Computer Engineering, or related field
5+ years of experience in one or more of the following:
- Accelerator programming
- GPU / NPU development
- Compiler backend engineering
- Embedded systems
- High-performance computing
- Performance optimization
Strong programming skills in:
- C/C++
- Python
Deep understanding of:
- Tensor computation
- Neural network operators
Strong knowledge of computer architecture concepts:
- Memory hierarchy
- Bandwidth and latency analysis
- Cache / SRAM behaviour
- Parallelism and synchronization
- Data locality and vectorization
Proven experience optimizing performance-critical kernels or numerical compute pipelines
Ability to identify and resolve performance bottlenecks from algorithm through to hardware execution
Strong debugging, profiling, and analytical problem-solving skills

Preferred Experience

Experience with one or more of the following:

Frameworks / Tooling

CUDA
Triton
OpenCL
TVM
MLIR
Halide

Systems Experience

SIMD
DSP
Embedded C/C++
GPU / NPU programming
FPGA development
HPC systems

Advanced Optimization Techniques

Tiling and blocking
Vectorization
Memory access optimization
Instruction scheduling
Mixed-precision optimization

Numerical Formats

FP32
FP16
BF16
FP8
INT8 / INT4

AI Accelerator Architecture Familiarity

Matrix engines
Vector engines
Systolic arrays
DMA engines
SRAM / NoC / DRAM systems

Bonus

Experience with simulator, emulator, FPGA, or silicon bring-up

Opportunity

Join a highly technical team building cutting-edge AI compute infrastructure and contribute directly to the performance of next-generation machine learning hardware. This is an opportunity to work at the intersection of AI systems, compiler optimisation, and hardware acceleration, with significant ownership and technical impact.

Darwin Recruitment is acting as an Employment Agency in relation to this vacancy.

Reece Waldon