TensorDynamic: Bridging Application- and Instruction-Level Fault Injection for DNN Tensor Core Execution
Abstract
Deep neural network (DNN) inference relies heavily on Tensor Core operations, which are vulnerable to transient hardware faults in computation pipelines not protected by error-correcting codes (ECC). Prior fault injection work has explored both application-level and instruction-level effects on DNN accuracy. However, existing application-level approaches support only coarse perturbations and do not capture hardware execution details, while instruction-level approaches lack application-level context.
To address this gap, we propose TensorDynamic, an application-aware instruction-level dynamic fault injection tool for Tensor Core execution in DNN workloads. TensorDynamic enables fine-grained fault injection into MMA (matrix-multiply-accumulate) instructions during DNN execution. Across multiple models, we show that, under the same error injection rate and severity, application-level fault injection can produce substantially different inference outcomes from instruction-level fault injection. This result underscores the need for execution-aware fault injection when evaluating DNN resilience on GPU Tensor Cores.
BibTeX
@inproceedings{wong2026darthpum,
author = {Yuxiao Jia and Euijun Chung and Huanzhi Pu and Ben Feinberg and Hyesoon Kim},
title = {{TensorDynamic: Bridging Application- and Instruction-Level Fault Injection for DNN Tensor Core Execution}},
booktitle = {International Symposium on Performance Analysis of Systems and Software (ISPASS)},
year = {2026},
month = {apr},
address = {Seoul, South Korea},
doi = {}
}