Junior NPU Kernel/Operator Engineer
Role
We are looking for a Junior NPU Kernel/Operator Engineer to develop and optimize deep learning operators for a custom AI accelerator / NPU. The role focuses on kernel/operator implementation, performance tuning, and correctness validation across a broad range of neural network workloads.
This is a good fit for candidates with strong C/C++ and Python skills who are interested in hardware-aware software optimization. Prior NPU experience is helpful but not required.
Responsibilities
- Implement and optimize NPU operators such as normalization, reduction, transpose, reshape, gather/scatter, quant/dequant, and fused elementwise kernels.
- Tune kernels for memory bandwidth, SRAM usage, data reuse, DMA latency, bank conflicts, and compute utilization.
- Validate operator correctness against PyTorch, NumPy, or framework reference results.
- Benchmark performance on simulator or silicon.
- Debug correctness, precision, memory layout, and performance issues.
- Work with compiler, runtime, hardware, and model teams.
- Document operator behavior, tensor layout, tiling strategy, and performance results.
Requirements
- BS/MS in CS, EE, Computer Engineering, or related field.
- Strong C/C++ and Python programming skills.
- Basic understanding of tensor computation and neural network operators.
- Familiarity with basic computer architecture concepts such as memory hierarchy, bandwidth, latency, cache/SRAM, and parallelism.
- Good debugging and problem-solving skills.
Preferred
- Experience with any of the following:
- CUDA, Triton, OpenCL, TVM, MLIR, Halide
- SIMD, DSP, embedded C/C++, GPU, NPU, FPGA, or HPC programming
- compiler/runtime development
- Understanding of tiling, vectorization, memory access optimization, or mixed precision.
- Experience with FP32, FP16, BF16, INT8, or other numerical formats.