Record Details

Cooperative Execution of Opencl Programs on Multiple Heterogeneous Devices

Electronic Theses of Indian Institute of Science

View Archive Info
 
 
Field Value
 
Title Cooperative Execution of Opencl Programs on Multiple Heterogeneous Devices
 
Creator Pandit, Prasanna Vasant
 
Subject Heterogeneous Computers
Open Computing Language
FluidiCL
Fluidic Kernels
OpenCL Application Programming Interface
Graphics Processing Unit (GPU)
Central Processing Unit (CPU)
Computer Architecture
FluidiCL Runtime
Heterogeneous OpenCL Runtime
OpenCL Programs
CPU–GPU Systems
Computer Engineering
 
Description Computing systems have become heterogeneous with the increasing prevalence of multi-core CPUs, Graphics Processing Units (GPU) and other accelerators in them. OpenCL has emerged as an attractive programming framework for heterogeneous systems. However, utilizing mul- tiple devices in OpenCL is a challenge as it requires the programmer to explicitly map data and computation to each device. Utilizing multiple devices simultaneously to speed up execu- tion of a kernel is even more complex, as the relative execution time of the kernel on different devices can vary significantly. Also, after each kernel execution, a coherent version of the data needs to be established. This means that, in order to utilize all devices effectively, the programmer has to spend considerable time and effort to distribute work across all devices, keep track of modified data in these devices and correctly perform a merging step to put the data together. Further, the relative performance of a program may vary across different inputs, which means a statically determined work distribution may not work well.
In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses multiple heterogeneous devices to execute each kernel. The runtime performs dynamic work distribution and cooperatively executes each kernel on all available devices. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. Flu- idiCL also does not require prior training or profiling and is completely portable across dif- ferent machines. Because it is dynamic, the runtime is able to adapt to system load. We have developed several optimizations for improving the performance of FluidiCL. We evaluate the runtime across different sets of devices. On a machine with an Intel quad-core processor and an NVidia Fermi GPU, FluidiCL shows a geomean speedup of nearly 64% over the GPU, 88% over the CPU and 14% over the best of the two devices in each benchmark. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices. FluidiCL shows similar results on a machine with a quad-core CPU and an NVidia Kepler GPU, with up to 26% speedup over the best of the two. We also present results considering an Intel Xeon Phi accelerator and a CPU and find that FluidiCL performs up to 45% faster than the best of the two devices. We extend FluidiCL from a CPU–GPU scenario to a three-device setup hav- ing a quad-core CPU, an NVidia Kepler GPU and an Intel Xeon Phi accelerator and find that FluidiCL obtains a geomean improvement of 6% in kernel execution time over the best of the three devices considered in each case.
 
Contributor Govindarajan, R
 
Date 2018-05-01T06:49:24Z
2018-05-01T06:49:24Z
2018-05-01
2013
 
Type Thesis
 
Identifier http://etd.iisc.ernet.in/2005/3468
http://etd.iisc.ernet.in/abstracts/4335/G25888-Abs.pdf
 
Language en_US
 
Relation G25888