Mitigating silent data corruption in high performance computing

Mitigating silent data corruption in high performance computing

A new technical paper titled “Mitigation of Silent Data Corruption in HPC Applications Across Multiple Program Inputs” has been released by researchers at the University of Iowa, Baidu Security, and Argonne National Lab. The paper was a finalist for Best Paper at SC22.

The researchers “offer MinpSID, an automated SID framework that automatically identifies and redefines incubation instructions in a given program to improve SDC coverage. The evaluation shows that MinpSID can effectively mitigate SDC coverage loss across multiple entries,” the document states.

Find the technical document here or here. Published November 2022. Presentation slides are here.

Huang, Yafan, et al. “Mitigation of Silent Data Corruption in HPC Applications Across Multiple Program Inputs.” Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analytics. 2022.

Related reading
Troubleshooting silent data errors
Other SDEs can be found with targeted electrical testing and 100% inspection, but not all.
Silent data corruption
How to prevent defects that can cause errors.
Why silent data errors are so hard to find
Subtle integrated circuit faults in data center processors lead to miscalculations.

Similar Posts