As someone who maintains a fair number of software packages that build, in one way or an other, on OpenCL (and keeps creating more!), a friend recently asked me what I thought of the state of the OpenCL ecosystem in 2018. In part, I am writing this to support the notion that it is healthier than one might assume. So, with that: here is my personal, subjective view of how OpenCL is doing.
OpenCL as a Standard
First, to get one thing out of the way, independently of the state of concrete implementations (which I’ll get to in a second), the OpenCL machine model makes a great deal of sense (to me). It provides:
- Two heavily overdecomposed levels of parallelism (one for cores, the other for vector lanes), with lightweight (“barely any”) synchronization.
- A dependency graph between computations (“kernel invocations”) that are submitted in large batches
- Shared virtual memory. This means that memory handles are just pointers in the host address space, and multiple devices and the host can access this memory with potentially strong coherency guarantees. (This is in OpenCL 2, on devices that support it.)
- Just-in-time compilation
Even while OpenCL is headed for its tenth birthday, that machine model still
maps very cleanly onto the multi-core CPUs and GPUs of today, and if you write
scientific computing code with code generation, there is hardly a better, more
coherent machine model that lets you target a broad class of machines. So even
if OpenCL were to not ‘make it’, the abstraction itself is likely to continue to
make sense for the foreseeable future. For instance, OpenCL and CUDA do not
differ by much except aside from spelling. OCCA and
ispc expose much the same model. A very similar
computing environment can be assembled out of OpenMP’s core parallelism and
#pragma simd
.
In addition, OpenCL-the-standard provides:
- A fairly complete set of transcendental functions that, typically, vectorize cleanly
- Easy access from Python and interoperability with numpy (OK, this one is a shameless plug)
- Compared with many custom JIT solutions, OpenCL has fairly rigorous semantics, so you’re not likely to be left guessing whether an implementation’s behavior is correct or not.
Implementations
A standard is only as good as its implementations—so what is the state of the available OpenCL implementations? OpenCL is a large and complex specification, so implementing it fully and correctly is a substantial effort.
A number of fairly recent developments have helped substantially in that regard:
- LLVM’s built-in OpenCL support keeps improving
- Khronos’ OpenCL conformance test suite became open-source and publicly available
- Khronos’ OpenCL ICD loader has become open-source
Nvidia GPU
Nvidia’s OpenCL implementation is capable and generally works well.
Occasionally, one does get the impression that, as a vendor of a directly competing solution (CUDA), Nvidia has a vested interest in crippling OpenCL support on their devices: from lacking tool support (debugging, profiling) to reports of performance differences from CUDA (in both directions!) to lagging support for newer OpenCL versions, there is plenty that could be better.
That said, Nvidia’s leverage in this regard may be limited—surprisingly.
- POCL (see below) has a proof-of-concept CUDA backend that uses Nvidia’s CUDA APIs to implement OpenCL. With some work, this could be made actually useful.
- There is ongoing work in the Linux graphics stack to support computation. While there is substantial work to be done, this work has come very far in the past few years.
AMD GPU
AMD’s brand-new open-source rocm stack has just become usable for me on Debian on Radeon R9 Fury GPU, now that they’ve switched to shipping their kernel driver as a DKMS module. While I haven’t done comprehensive performance measurements, it passed most of the PyOpenCL test suite on the first attempt.
POCL CPU
POCL is an LLVM-based OpenCL implementation for (mainly) CPUs. It has just celebrated its 1.0 release. Most of all, it is conformant and correct. My group builds and tests software mainly on POCL, because it is easy to install on basically any machine (see below under ‘Shipping Software’).
Its vectorization performance varies from competitive to ‘needs work’, but in general, code executes at least as fast as reasonably written C or Fortran code. With work out of Intel on an outer-loop vectorizer for LLVM, this has the potential for substantial improvement in the near future. POCL also integrates the SLEEF vectorized special function library.
Intel Integrated GPUs
Intel’s Beignet implements CL for Linux on their integrated GPUs. It passes most of the PyOpenCL test suite and performs well.
Update June 2018: In early 2018, Intel has released an additional OpenCL compute runtime for its “Gen” family of integrated GPUs that supersedes Beignet for Broadwell (“Gen8”) and newer chips. I do not have first-hand experience with this ICD yet, but since it is reportedly based on Intel’s previously closed Windows CL runtime, I expect a mostly competent implementation.
Apple GPU/CPU
While Apple originated the OpenCL standard, the implementations they ship are borderline unusable due to bugs. For CPU work, POCL can be built on macOS with little trouble (or see below under ‘Shipping Software’). For GPU work, Linux provides a nicer environment.
Update June 2018: Apple has deprecated OpenCL in favor of their own graphics abstraction, Metal, that is unavailable on any other platform.
Intel CPU (Updated August 2019)
Intel has a (now open-source) CPU OpenCL
ICD that has
fairly competitive code generation. Based on file names, it appears that this is an
updated version of their previous closed-source OpenCL CPU
ICD.
One unfortunate aspect of this implementation is
that it ships with a number of shared libraries that are very likely to clash with
other shared libraries already installed on your system. So, for use of this ICD,
make sure to configure (e.g.) the $LD_LIBRARY_PATH
environment variable so as to
prefer the libraries shipped with the implementation over the system ones.
This bug has discussion of this issue.
A previous version of this guide mentioned code generation bugs and crashes. I suspect that was not true, and the crashes I observed instead due to the clashes with system libraries noted above.
Shipping Software
Shipping software based on OpenCL sounds like it might be troublesome, since,
in addition to the actual software, a user must have an ICD loader
(libOpenCL.so
) in addition to an ICD (the actual implementation).
This concern is mostly a thing of the past though. Five easy shell commands install a usable environment on Linux and macOS.
The Future
Recent news out of the OpenCL working group is that OpenCL might become implementable on top of Vulkan, a graphics API that is rapidly gaining ubiquity, by way of a single ‘shim’ ICD provided by the OpenCL working group. While this is dependent on a number of future developments, it would make OpenCL support essentially ubiquitous on all GPUs supporting Vulkan.
Conclusions
While OpenCL has been around for a while, many recent developments make me optimistic for its future. If I missed anything, do let me know, and I would be happy to add it here.