edoc

Analysis of Node Failures in High Performance Computers Based on System Logs

Ghiasvand, Siavash and Ciorba, Florina M. and Tschüter, Ronny and Nagel, Wolfgang E.. (2015) Analysis of Node Failures in High Performance Computers Based on System Logs. 28th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015) .

[img] PDF - Published Version
100Kb

Official URL: http://edoc.unibas.ch/40834/

Downloads: Statistics Overview

Abstract

The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures.
Faculties and Departments:05 Faculty of Science > Departement Mathematik und Informatik > Informatik > High Performance Computing (Ciorba)
UniBasel Contributors:Ciorba, Florina M.
Item Type:Other
Note:Publication type according to Uni Basel Research Database: Other publications
Language:English
Related URLs:
Last Modified:18 May 2018 14:03
Deposited On:28 Aug 2017 11:48

Repository Staff Only: item control page