Site icon Darwin Recruitment

Senior NPU Kernel Operator Engineer

Senior NPU Kernel / Operator Engineer

Overview

We are seeking a Senior NPU Kernel / Operator Engineer to lead the development and optimization of high-performance deep learning operators for a next-generation AI accelerator platform.

This role focuses on kernel design, hardware-aware performance tuning, and correctness validation across a broad range of neural network workloads.

The ideal candidate will have deep experience optimizing compute-intensive software on GPU, NPU, DSP, SIMD, embedded accelerators, compiler backends, or HPC systems, with the ability to reason from model-level requirements down to hardware execution efficiency.


Responsibilities

  • Design, implement, and optimize high-performance operators such as:
    • Normalization
    • Reduction
    • Transpose
    • Reshape
    • Gather / Scatter
    • Quantization / Dequantization
    • Fused elementwise kernels
  • Own performance optimization across key hardware constraints, including:
    • Memory bandwidth
    • SRAM utilization
    • Data reuse
    • DMA latency
    • Bank conflicts
    • Compute utilization
  • Develop advanced optimization strategies including:
    • Tiling
    • Blocking
    • Vectorization
    • Memory scheduling
  • Analyze and resolve bottlenecks related to:
    • Memory hierarchy
    • Synchronization overhead
    • Instruction scheduling
    • Data movement
  • Validate operator correctness and numerical precision against reference implementations (e.g. PyTorch, NumPy)
  • Benchmark and profile kernel performance across simulation, emulation, FPGA, or production silicon environments
  • Debug complex issues involving:
    • Tensor layouts
    • Precision loss
    • Memory access patterns
    • Performance regressions
  • Build performance models and optimize operators toward hardware roofline limits
  • Collaborate closely with compiler, runtime, hardware architecture, and ML model teams to improve operator APIs and execution efficiency
  • Document optimization strategies, tensor layouts, and performance improvements
  • Mentor junior engineers and help define engineering best practices

Requirements

  • BS / MS / PhD in Computer Science, Electrical Engineering, Computer Engineering, or related field
  • 5+ years of experience in one or more of the following:
    • Accelerator programming
    • GPU / NPU development
    • Compiler backend engineering
    • Embedded systems
    • High-performance computing
    • Performance optimization
  • Strong programming skills in:
    • C/C++
    • Python
  • Deep understanding of:
    • Tensor computation
    • Neural network operators
  • Strong knowledge of computer architecture concepts:
    • Memory hierarchy
    • Bandwidth and latency analysis
    • Cache / SRAM behaviour
    • Parallelism and synchronization
    • Data locality and vectorization
  • Proven experience optimizing performance-critical kernels or numerical compute pipelines
  • Ability to identify and resolve performance bottlenecks from algorithm through to hardware execution
  • Strong debugging, profiling, and analytical problem-solving skills

Preferred Experience

Experience with one or more of the following:

Frameworks / Tooling

  • CUDA
  • Triton
  • OpenCL
  • TVM
  • MLIR
  • Halide

Systems Experience

  • SIMD
  • DSP
  • Embedded C/C++
  • GPU / NPU programming
  • FPGA development
  • HPC systems

Advanced Optimization Techniques

  • Tiling and blocking
  • Vectorization
  • Memory access optimization
  • Instruction scheduling
  • Mixed-precision optimization

Numerical Formats

  • FP32
  • FP16
  • BF16
  • FP8
  • INT8 / INT4

AI Accelerator Architecture Familiarity

  • Matrix engines
  • Vector engines
  • Systolic arrays
  • DMA engines
  • SRAM / NoC / DRAM systems

Bonus

  • Experience with simulator, emulator, FPGA, or silicon bring-up

Opportunity

Join a highly technical team building cutting-edge AI compute infrastructure and contribute directly to the performance of next-generation machine learning hardware. This is an opportunity to work at the intersection of AI systems, compiler optimisation, and hardware acceleration, with significant ownership and technical impact.

Darwin Recruitment is acting as an Employment Agency in relation to this vacancy.

Reece Waldon

Exit mobile version