The State of OpenCL for Scientific Computing in 2018

As someone who maintains a fair number of software packages that build, in one way or an other, on OpenCL (and keeps creating more!), a friend recently asked me what I thought of the state of the OpenCL ecosystem in 2018. In part, I am writing this to support the notion that it is healthier than one might assume. So, with that: here is my personal, subjective view of how OpenCL is doing.

OpenCL as a Standard

First, to get one thing out of the way, independently of the state of concrete implementations (which I’ll get to in a second), the OpenCL machine model makes a great deal of sense (to me). It provides:

Two heavily overdecomposed levels of parallelism (one for cores, the other for vector lanes), with lightweight (“barely any”) synchronization.
A dependency graph between computations (“kernel invocations”) that are submitted in large batches
Shared virtual memory. This means that memory handles are just pointers in the host address space, and multiple devices and the host can access this memory with potentially strong coherency guarantees. (This is in OpenCL 2, on devices that support it.)
Just-in-time compilation

Even while OpenCL is headed for its tenth birthday, that machine model still maps very cleanly onto the multi-core CPUs and GPUs of today, and if you write scientific computing code with code generation, there is hardly a better, more coherent machine model that lets you target a broad class of machines. So even if OpenCL were to not ‘make it’, the abstraction itself is likely to continue to make sense for the foreseeable future. For instance, OpenCL and CUDA do not differ by much except aside from spelling. OCCA and ispc expose much the same model. A very similar computing environment can be assembled out of OpenMP’s core parallelism and #pragma simd.

In addition, OpenCL-the-standard provides:

A fairly complete set of transcendental functions that, typically, vectorize cleanly
Easy access from Python and interoperability with numpy (OK, this one is a shameless plug)
Compared with many custom JIT solutions, OpenCL has fairly rigorous semantics, so you’re not likely to be left guessing whether an implementation’s behavior is correct or not.

Implementations

A standard is only as good as its implementations—so what is the state of the available OpenCL implementations? OpenCL is a large and complex specification, so implementing it fully and correctly is a substantial effort.

A number of fairly recent developments have helped substantially in that regard:

LLVM’s built-in OpenCL support keeps improving
Khronos’ OpenCL conformance test suite became open-source and publicly available
Khronos’ OpenCL ICD loader has become open-source

Nvidia GPU

Nvidia’s OpenCL implementation is capable and generally works well.

Occasionally, one does get the impression that, as a vendor of a directly competing solution (CUDA), Nvidia has a vested interest in crippling OpenCL support on their devices: from lacking tool support (debugging, profiling) to reports of performance differences from CUDA (in both directions!) to lagging support for newer OpenCL versions, there is plenty that could be better.

That said, Nvidia’s leverage in this regard may be limited—surprisingly.

POCL (see below) has a proof-of-concept CUDA backend that uses Nvidia’s CUDA APIs to implement OpenCL. With some work, this could be made actually useful.
There is ongoing work in the Linux graphics stack to support computation. While there is substantial work to be done, this work has come very far in the past few years.

AMD GPU

AMD’s brand-new open-source rocm stack has just become usable for me on Debian on Radeon R9 Fury GPU, now that they’ve switched to shipping their kernel driver as a DKMS module. While I haven’t done comprehensive performance measurements, it passed most of the PyOpenCL test suite on the first attempt.

POCL CPU

POCL is an LLVM-based OpenCL implementation for (mainly) CPUs. It has just celebrated its 1.0 release. Most of all, it is conformant and correct. My group builds and tests software mainly on POCL, because it is easy to install on basically any machine (see below under ‘Shipping Software’).

Its vectorization performance varies from competitive to ‘needs work’, but in general, code executes at least as fast as reasonably written C or Fortran code. With work out of Intel on an outer-loop vectorizer for LLVM, this has the potential for substantial improvement in the near future. POCL also integrates the SLEEF vectorized special function library.

Intel Integrated GPUs

Intel’s Beignet implements CL for Linux on their integrated GPUs. It passes most of the PyOpenCL test suite and performs well.

Update June 2018: In early 2018, Intel has released an additional OpenCL compute runtime for its “Gen” family of integrated GPUs that supersedes Beignet for Broadwell (“Gen8”) and newer chips. I do not have first-hand experience with this ICD yet, but since it is reportedly based on Intel’s previously closed Windows CL runtime, I expect a mostly competent implementation.

Apple GPU/CPU

While Apple originated the OpenCL standard, the implementations they ship are borderline unusable due to bugs. For CPU work, POCL can be built on macOS with little trouble (or see below under ‘Shipping Software’). For GPU work, Linux provides a nicer environment.

Update June 2018: Apple has deprecated OpenCL in favor of their own graphics abstraction, Metal, that is unavailable on any other platform.

Intel CPU (Updated August 2019)

Intel has a (now open-source) CPU OpenCL ICD that has fairly competitive code generation. Based on file names, it appears that this is an updated version of their previous closed-source OpenCL CPU ICD. One unfortunate aspect of this implementation is that it ships with a number of shared libraries that are very likely to clash with other shared libraries already installed on your system. So, for use of this ICD, make sure to configure (e.g.) the $LD_LIBRARY_PATH environment variable so as to prefer the libraries shipped with the implementation over the system ones. This bug has discussion of this issue.

A previous version of this guide mentioned code generation bugs and crashes. I suspect that was not true, and the crashes I observed instead due to the clashes with system libraries noted above.

Shipping Software

Shipping software based on OpenCL sounds like it might be troublesome, since, in addition to the actual software, a user must have an ICD loader (libOpenCL.so) in addition to an ICD (the actual implementation).

This concern is mostly a thing of the past though. Five easy shell commands install a usable environment on Linux and macOS.

The Future

Recent news out of the OpenCL working group is that OpenCL might become implementable on top of Vulkan, a graphics API that is rapidly gaining ubiquity, by way of a single ‘shim’ ICD provided by the OpenCL working group. While this is dependent on a number of future developments, it would make OpenCL support essentially ubiquitous on all GPUs supporting Vulkan.

Conclusions

While OpenCL has been around for a while, many recent developments make me optimistic for its future. If I missed anything, do let me know, and I would be happy to add it here.