As someone who maintains a fair number of software packages that build, in one way or an other, on OpenCL (and keeps creating more!), a friend recently asked me what I thought of the state of the OpenCL ecosystem in 2018. In part, I am writing this to support the notion that it is healthier than one might assume. So, with that: here is my personal, subjective view of how OpenCL is doing.

OpenCL as a Standard

First, to get one thing out of the way, independently of the state of concrete implementations (which I’ll get to in a second), the OpenCL machine model makes a great deal of sense (to me). It provides:

  • Two heavily overdecomposed levels of parallelism (one for cores, the other for vector lanes), with lightweight (“barely any”) synchronization.
  • A dependency graph between computations (“kernel invocations”) that are submitted in large batches
  • Shared virtual memory. This means that memory handles are just pointers in the host address space, and multiple devices and the host can access this memory with potentially strong coherency guarantees. (This is in OpenCL 2, on devices that support it.)
  • Just-in-time compilation

Even while OpenCL is headed for its tenth birthday, that machine model still maps very cleanly onto the multi-core CPUs and GPUs of today, and if you write scientific computing code with code generation, there is hardly a better, more coherent machine model that lets you target a broad class of machines. So even if OpenCL were to not ‘make it’, the abstraction itself is likely to continue to make sense for the foreseeable future. For instance, OpenCL and CUDA do not differ by much except aside from spelling. OCCA and ispc expose much the same model. A very similar computing environment can be assembled out of OpenMP’s core parallelism and #pragma simd.

In addition, OpenCL-the-standard provides:

  • A fairly complete set of transcendental functions that, typically, vectorize cleanly
  • Easy access from Python and interoperability with numpy (OK, this one is a shameless plug)
  • Compared with many custom JIT solutions, OpenCL has fairly rigorous semantics, so you’re not likely to be left guessing whether an implementation’s behavior is correct or not.


A standard is only as good as its implementations—so what is the state of the available OpenCL implementations? OpenCL is a large and complex specification, so implementing it fully and correctly is a substantial effort.

A number of fairly recent developments have helped substantially in that regard:

  • LLVM’s built-in OpenCL support keeps improving
  • Khronos’ OpenCL conformance test suite became open-source and publicly available
  • Khronos’ OpenCL ICD loader has become open-source

Nvidia GPU

Nvidia’s OpenCL implementation is capable and generally works well.

Occasionally, one does get the impression that, as a vendor of a directly competing solution (CUDA), Nvidia has a vested interest in crippling OpenCL support on their devices: from lacking tool support (debugging, profiling) to reports of performance differences from CUDA (in both directions!) to lagging support for newer OpenCL versions, there is plenty that could be better.

That said, Nvidia’s leverage in this regard may be limited—surprisingly.

  • POCL (see below) has a proof-of-concept CUDA backend that uses Nvidia’s CUDA APIs to implement OpenCL. With some work, this could be made actually useful.
  • There is ongoing work in the Linux graphics stack to support computation. While there is substantial work to be done, this work has come very far in the past few years.


AMD’s brand-new open-source rocm stack has just become usable for me on Debian on Radeon R9 Fury GPU, now that they’ve switched to shipping their kernel driver as a DKMS module. While I haven’t done comprehensive performance measurements, it passed most of the PyOpenCL test suite on the first attempt.


POCL is an LLVM-based OpenCL implementation for (mainly) CPUs. It has just celebrated its 1.0 release. Most of all, it is conformant and correct. My group builds and tests software mainly on POCL, because it is easy to install on basically any machine (see below under ‘Shipping Software’).

Its vectorization performance varies from competitive to ‘needs work’, but in general, code executes at least as fast as reasonably written C or Fortran code. With work out of Intel on an outer-loop vectorizer for LLVM, this has the potential for substantial improvement in the near future. POCL also integrates the SLEEF vectorized special function library.

Intel iGPU

Intel’s Beignet implements CL for Linux on their integrated GPUs. It passes most of the PyOpenCL test suite and performs well.


While Apple originated the OpenCL standard, the implementations they ship are borderline unusable due to bugs. For CPU work, POCL can be built on macOS with little trouble (or see below under ‘Shipping Software’). For GPU work, Linux provides a nicer environment.

Intel CPU

Intel has a CPU OpenCL driver that has fairly competitive code generation. Unfortunately, many code gen bugs make this implementation less usable for demanding applications.

Shipping Software

Shipping software based on OpenCL sounds like it might be troublesome, since, in addition to the actual software, a user must have an ICD loader ( in addition to an ICD (the actual implementation).

This concern is mostly a thing of the past though. Five easy shell commands install a usable environment on Linux and macOS.

The Future

Recent news out of the OpenCL working group is that OpenCL might become implementable on top of Vulkan, a graphics API that is rapidly gaining ubiquity, by way of a single ‘shim’ ICD provided by the OpenCL working group. While this is dependent on a number of future developments, it would make OpenCL support essentially ubiquitous on all GPUs supporting Vulkan.


While OpenCL has been around for a while, many recent developments make me optimistic for its future. If I missed anything, do let me know, and I would be happy to add it here.