On benchmarking of deep learning systems: software engineering issues and reproducibility challenges

Maffia, Antonio. On benchmarking of deep learning systems: software engineering issues and reproducibility challenges. 2023, Doctoral Thesis, University of Basel, Faculty of Science.

Available under License CC BY-NC-SA (Attribution-NonCommercial-ShareAlike).


Official URL: https://edoc.unibas.ch/93976/

Downloads: Statistics Overview


Since AlexNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, Deep Learning (and Machine Learning/AI in general) gained an exponential interest.
Nowadays, their adoption spreads over numerous sectors, like automotive, robotics, healthcare and finance.
The ML advancement goes in pair with the quality improvement delivered by those solutions.
However, those ameliorations are not for free: ML algorithms always require an increasing computational power, which pushes computer engineers to develop new devices capable of coping with this demand for performance.
To foster the evolution of DSAs, and thus ML research, it is key to make it easy to experiment and compare them. This may be challenging since, even if the software built around these devices simplifies their usage, obtaining the best performance is not always straightforward.
The situation gets even worse when the experiments are not conducted in a reproducible way.
Even though the importance of reproducibility for the research is evident, it does not directly translate into reproducible experiments. In fact, as already shown by previous studies regarding other research fields, also ML is facing a reproducibility crisis.
Our work addresses the topic of reproducibility of ML applications. Reproducibility in this context has two aspects: results reproducibility and performance reproducibility. While the reproducibility of the results is mandatory, performance reproducibility cannot be neglected because high-performance device usage causes cost. To understand how the ML situation is regarding reproducibility of performance, we reproduce results published for the MLPerf suite, which seems to be the most used machine learning benchmark.
Because of the wide range of devices and frameworks used in different benchmark submissions, we focus on a subset of accuracy and performance results submitted to the MLPerf Inference benchmark, presenting a detailed analysis of the difficulties a scientist may find when trying to reproduce such a benchmark and a possible solution using our workflow tool for experiment reproducibility: PROVA!.
We designed PROVA! to support the reproducibility in traditional HPC experiments, but we will show how we extended it to be used as a 'driver' for MLPerf benchmark applications.
The PROVA! driver mode allows us to experiment with different versions of the MLPerf Inference benchmark switching among different hardware and software combinations and compare them in a reproducible way.
In the last part, we will present the results of our reproducibility study, demonstrating the importance of having a support tool to reproduce and extend original experiments getting deeper knowledge about performance behaviours.
Advisors:Burkhart, Helmar and Ciorba, Florina M.
Committee Members:Resch, Michael M.
Faculties and Departments:05 Faculty of Science > Departement Mathematik und Informatik > Ehemalige Einheiten Mathematik & Informatik > High Performance and Web Computing (Burkhart)
UniBasel Contributors:Burkhart, Helmar and Ciorba, Florina M.
Item Type:Thesis
Thesis Subtype:Doctoral Thesis
Thesis no:15007
Thesis status:Complete
Number of Pages:xxii, 181
Identification Number:
  • urn: urn:nbn:ch:bel-bau-diss150075
edoc DOI:
Last Modified:23 Jun 2023 01:30
Deposited On:11 May 2023 09:10

Repository Staff Only: item control page