Alam, Sadaf R. and Fourestey, Gilles and Videau, Brice and Genovese, Luigi and Goedecker, Stefan and Dugan, Nazim. (2012) Overlapping Computations with Communications and I/O Explicitly Using OpenMP Based Heterogeneous Threading Models. In: OpenMP in a Heterogeneous World. Berlin, pp. 267-270.
Full text not available from this repository.
Official URL: https://edoc.unibas.ch/74603/
Downloads: Statistics Overview
Abstract
Holistic tuning and optimization of hybrid MPI and OpenMP applications is becoming focus for parallel code developers as the number of cores and hardware threads in processing nodes of high-end systems continue to increase. For example, there is support for 32 hardware threads on a Cray XE6 node with Interlagos processors while the IBM Blue Gene/Q system could support up to 64 threads per node. Note that, by default, OpenMP threads and MPI tasks are pinned to processor cores on these high-end systems and throughout the paper we assume fix bindings of threads to physical cores for the discussion. A number of OpenMP runtimes also support user specified bindings of threads to physical cores. Parallel and node efficiencies on these high-end systems for hybrid MPI and OpenMP applications largely depend on balancing and overlapping computation and communication workloads. This issue is further intensified when the nodes have a non-uniform access memory (NUMA) model and I/O accelerator devices. In these environments, where access to I/O devices such as GPU for code acceleration and network interface for MPI communication and parallel file I/O are managed and scheduled by a host CPU, application developers could introduce innovative solutions to overlap CPUs and I/O operations to improve node and parallel efficiencies. For example, in a production level application called BigDFT, the developers have introduced a master-slave model to explicitly overlap blocking, collective communication operations and local multi-threaded computation. Similarly some applications parallelized with MPI, OpenMP and GPU acceleration could assign a management thread for the GPU data and control orchestration, an MPI control thread for communication management while the CPU threads perform overlapping calculations, and potentially a background thread can be set aside for file I/O based fault-tolerance. Considering these emerging applications design needs, we would like to motivate the OpenMP standards committee, through examples and empirical results, to introduce thread and task heterogeneity in the language specification. This will allow code developers, especially those programming for large-scale distributed-memory HPC systems and accelerator devices, to design and develop portable solutions with overlapping control and data flow for their applications without resorting to custom solutions.
Faculties and Departments: | 05 Faculty of Science > Departement Physik > Physik > Physik (Goedecker) |
---|---|
UniBasel Contributors: | Goedecker, Stefan |
Item Type: | Book Section |
Book Section Subtype: | Further Contribution in a Book |
Publisher: | Springer |
ISBN: | 978-3-642-30960-1 |
e-ISBN: | 978-3-642-30961-8 |
Series Name: | Lecture Notes in Computer Science |
Issue Number: | 7312 |
Note: | Publication type according to Uni Basel Research Database: Book item |
Identification Number: | |
Last Modified: | 23 Mar 2020 11:22 |
Deposited On: | 23 Mar 2020 11:22 |
Repository Staff Only: item control page