Repository logo
Log In
  1. Home
  2. Unibas
  3. Publications
  4. Analysis of Node Failures in High Performance Computers Based on System Logs
 
  • Details

Analysis of Node Failures in High Performance Computers Based on System Logs

Date Issued
2015-01-01
Author(s)
Ghiasvand, Siavash
Ciorba, Florina M.  
Tschüter, Ronny
Nagel, Wolfgang E.
Abstract
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures.
File(s)
Loading...
Thumbnail Image
Name

20160106143839_568d18dfc57a2.pdf

Size

100.04 KB

Format

Adobe PDF

Checksum

(MD5):773981559687dcce3da94b70c8c730b1

University of Basel

edoc
Open Access Repository University of Basel

  • About edoc
  • About Open Access at the University of Basel
  • edoc Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement