Skip to content Skip to main nav
  • Log in
  • Sign up
  • Guests
McKelvey School of Engineering
Industry Connect
  • Audiences
    • Students
    • Industry Partners
    • Alumni
  • Industries
    • Aerospace & Automotive
    • AI, Machine Learning, IoT & Data Science
    • Biotechnology & Healthcare
    • Building, Design & Construction
    • Business, Management & Consulting
    • Cybersecurity
    • Electrical & Electronics
    • Energy & Sustainability
    • Entrepreneurship & Startup
    • Government & Non-Profit
    • Manufacturing & Materials
    • Research & Academia
    • Technology
  • Majors
    • Biomedical Engineering
    • Computer Science & Engineering
    • Electrical & Systems Engineering
    • Energy, Environmental & Chemical Engineering
    • Mechanical Engineering & Materials Science
  • Professional Hubs
    • Advanced Degrees
    • Career Skills
    • Greater St. Louis Region
  • News
  • Jobs
  • Events
  • Resources
  • Groups
Black Sesame Technologies

NPU Kernel Engineer

  • Share This: Share NPU Kernel Engineer on FacebookShare NPU Kernel Engineer on LinkedInShare NPU Kernel Engineer on X
Recruitment began on April 27, 2026
and the job listing Expires on May 28, 2026
Full-Time
Apply Now

Junior NPU Kernel/Operator Engineer

Role

We are looking for a Junior NPU Kernel/Operator Engineer to develop and optimize deep learning operators for a custom AI accelerator / NPU. The role focuses on kernel/operator implementation, performance tuning, and correctness validation across a broad range of neural network workloads.

This is a good fit for candidates with strong C/C++ and Python skills who are interested in hardware-aware software optimization. Prior NPU experience is helpful but not required.

Responsibilities

  • Implement and optimize NPU operators such as normalization, reduction, transpose, reshape, gather/scatter, quant/dequant, and fused elementwise kernels.
  • Tune kernels for memory bandwidth, SRAM usage, data reuse, DMA latency, bank conflicts, and compute utilization.
  • Validate operator correctness against PyTorch, NumPy, or framework reference results.
  • Benchmark performance on simulator or silicon.
  • Debug correctness, precision, memory layout, and performance issues.
  • Work with compiler, runtime, hardware, and model teams.
  • Document operator behavior, tensor layout, tiling strategy, and performance results.

Requirements

  • BS/MS in CS, EE, Computer Engineering, or related field.
  • Strong C/C++ and Python programming skills.
  • Basic understanding of tensor computation and neural network operators.
  • Familiarity with basic computer architecture concepts such as memory hierarchy, bandwidth, latency, cache/SRAM, and parallelism.
  • Good debugging and problem-solving skills.

Preferred

  • Experience with any of the following:
    • CUDA, Triton, OpenCL, TVM, MLIR, Halide
    • SIMD, DSP, embedded C/C++, GPU, NPU, FPGA, or HPC programming
    • compiler/runtime development
  • Understanding of tiling, vectorization, memory access optimization, or mixed precision.
  • Experience with FP32, FP16, BF16, INT8, or other numerical formats.
Apply Now
McKelvey Industry Relations
Industry Connect
Facebook Instagram LinkedIn X (formerly Twitter) YouTube
1 Brookings Dr
St. Louis, MO 63130
(314) 935-5869
mckelveyindustryrelations@wustl.edu
Privacy Policy | Terms of Service | Contact
Copyright © 2026 McKelvey Industry Relations
Powered by uConnect