FTHPC

From NetSysLab

Jump to: navigation, search

GPUs (Graphics Processing Units) have gained wide adoption as accelerators for general purpose computing. They are widely used in error-sensitive applications, i.e. General Purpose GPU (GPGPU) applications. However, the reliability implications of using GPUs are unclear ( against hardware faults ). This project aims to evaluate the error resiliency of GPGPU applications and build a fault tolerant framework for GPGPU applications with the consideration of the factors including performance, reliability and power consumption.

We initiate this study by understanding the reliability characteristic of GPGPU applications. This characteristic study provides insight to develop heuristics for applying fault detection mechanisms to GPGPU applications in order to reduce Silent Data Corruptions.

People

Bo Fang
Jiesheng Wei
Karthik Pattabiraman
Matei Ripeanu
Sudhanva Gurumurthi

Download

GPU-Qin – is a fault injector for GPGPU (CUDA) applications. github

Publications

[8] ePVF: An Enhanced Program Vulnerability Factor Methodology for Holistic Resilience Analysis, Bo Fang, Qining Lu, Karthik Pattabiraman, Matei Ripeanu, Sudhanva Gurumurthi, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’16), June 2016 (acceptance rate: 58/253=22%) pdf slides
[7] SDC is in the Eye of the Beholder: A Survey and Preliminary Study, Bo Fang, Panruo Wu, Qiang Guan, Nathan Debardeleben, Laura Monroe, Sean Blanchard, Zhizong Chen, Karthik Pattabiraman, Matei Ripeanu, 3rd IEEE International Workshop on Reliability and Security Data Analysis (RSDA 2016), June 2016 pdf slides
[6] A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications, Bo Fang, Karthik Pattabiraman, Matei Ripeanu, Gurumurthi Sudhanva, IEEE Transactions on Parallel and Distributed Systems (TPDS), accepted January 2016, pdf
[5] GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications, Bo Fang, Karthik Pattabiraman, Matei Ripeanu, Sudhanva Gurumurthi, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'14), March 23-25, 2014. Monterey, CA. pdf (superseeded by technical report) slides
[4] GPGPUs: How to Combine High Computational Power with High Reliability, L. Bautista Gomez, F. Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, K. Pattabiraman, P. Rech, M. Sonza Reorda, DATE'14, pdf
[3] Evaluating the Error Resilience of Parallel Programs, Bo Fang, Karthik Pattabiraman, Sudhanva Gurumurthi, Matei Ripeanu, Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), collocated with DSN 2014, Atlanta, GA, June 2014. pdf slides
[2] Towards Building Error Resilient GPGPU Applications, Bo Fang, Jiesheng Wei, Karthik Pattabiraman, Matei Ripeanu, to appear in the 3rd Workshop on Resilient Architecture (WRA) in conjunction with MICRO 2012, Vancouver Canada. pdf slides
[1] Evaluating the Error Resilience of GPGPU Applications, Bo Fang, Jiesheng Wei, Karthik Pattabiraman, Matei Ripeanu, poster at IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC12), Salt Lake City, UT November 2012. pdf poster