Senior NPU Kernel Operator Engineer

globe (3)

San Jose

globe (3)

USA

hourglass (1)

Permanent

business-cards (1)

Hardware

1

Senior NPU Kernel / Operator Engineer

Overview

We are seeking a Senior NPU Kernel / Operator Engineer to lead the development and optimization of high-performance deep learning operators for a next-generation AI accelerator platform.

This role focuses on kernel design, hardware-aware performance tuning, and correctness validation across a broad range of neural network workloads.

The ideal candidate will have deep experience optimizing compute-intensive software on GPU, NPU, DSP, SIMD, embedded accelerators, compiler backends, or HPC systems, with the ability to reason from model-level requirements down to hardware execution efficiency.


Responsibilities

  • Design, implement, and optimize high-performance operators such as:
    • Normalization
    • Reduction
    • Transpose
    • Reshape
    • Gather / Scatter
    • Quantization / Dequantization
    • Fused elementwise kernels
  • Own performance optimization across key hardware constraints, including:
    • Memory bandwidth
    • SRAM utilization
    • Data reuse
    • DMA latency
    • Bank conflicts
    • Compute utilization
  • Develop advanced optimization strategies including:
    • Tiling
    • Blocking
    • Vectorization
    • Memory scheduling
  • Analyze and resolve bottlenecks related to:
    • Memory hierarchy
    • Synchronization overhead
    • Instruction scheduling
    • Data movement
  • Validate operator correctness and numerical precision against reference implementations (e.g. PyTorch, NumPy)
  • Benchmark and profile kernel performance across simulation, emulation, FPGA, or production silicon environments
  • Debug complex issues involving:
    • Tensor layouts
    • Precision loss
    • Memory access patterns
    • Performance regressions
  • Build performance models and optimize operators toward hardware roofline limits
  • Collaborate closely with compiler, runtime, hardware architecture, and ML model teams to improve operator APIs and execution efficiency
  • Document optimization strategies, tensor layouts, and performance improvements
  • Mentor junior engineers and help define engineering best practices

Requirements

  • BS / MS / PhD in Computer Science, Electrical Engineering, Computer Engineering, or related field
  • 5+ years of experience in one or more of the following:
    • Accelerator programming
    • GPU / NPU development
    • Compiler backend engineering
    • Embedded systems
    • High-performance computing
    • Performance optimization
  • Strong programming skills in:
    • C/C++
    • Python
  • Deep understanding of:
    • Tensor computation
    • Neural network operators
  • Strong knowledge of computer architecture concepts:
    • Memory hierarchy
    • Bandwidth and latency analysis
    • Cache / SRAM behaviour
    • Parallelism and synchronization
    • Data locality and vectorization
  • Proven experience optimizing performance-critical kernels or numerical compute pipelines
  • Ability to identify and resolve performance bottlenecks from algorithm through to hardware execution
  • Strong debugging, profiling, and analytical problem-solving skills

Preferred Experience

Experience with one or more of the following:

Frameworks / Tooling

  • CUDA
  • Triton
  • OpenCL
  • TVM
  • MLIR
  • Halide

Systems Experience

  • SIMD
  • DSP
  • Embedded C/C++
  • GPU / NPU programming
  • FPGA development
  • HPC systems

Advanced Optimization Techniques

  • Tiling and blocking
  • Vectorization
  • Memory access optimization
  • Instruction scheduling
  • Mixed-precision optimization

Numerical Formats

  • FP32
  • FP16
  • BF16
  • FP8
  • INT8 / INT4

AI Accelerator Architecture Familiarity

  • Matrix engines
  • Vector engines
  • Systolic arrays
  • DMA engines
  • SRAM / NoC / DRAM systems

Bonus

  • Experience with simulator, emulator, FPGA, or silicon bring-up

Opportunity

Join a highly technical team building cutting-edge AI compute infrastructure and contribute directly to the performance of next-generation machine learning hardware. This is an opportunity to work at the intersection of AI systems, compiler optimisation, and hardware acceleration, with significant ownership and technical impact.

Darwin Recruitment is acting as an Employment Agency in relation to this vacancy.

Reece Waldon

Submit Your CV

This field is for validation purposes and should be left unchanged.
Name_1
Max. file size: 512 MB.

Similar Jobs

0

Permanent

Thermal Engineering Lead

Defense

Hardware

Lead Thermal Engineer | Defense Tech | South California Darwin Defense is partnering with a high-growth Defense Tech to find a Lead Thermal Engineer See more…

to $225,000/year

Carlsbad

USA

0

Permanent

Senior Mechanical Engineer

Defense

Hardware

Senior Mechanical Engineer | Autonomous Maritime Defense Systems Florida (On-Site) We’re supporting a highly advanced defense technology team developing next-generation autonomous maritime for defense See more…

to $160,000/year

USA

0

Permanent

Principal Structures Engineer

Defense

Hardware

✈️ Principal Structures Engineer Southern California (On-Site) We’re supporting a defense technology company in Southern California developing advanced aerial systems for U.S. defense programmes. See more…

to $215,000/year

Carlsbad

USA