PyCUDA

PyCUDA

PyCUDA lets you access Nvidia‘s CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist–so what's so special about PyCUDA?

  • Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code. PyCUDA knows about dependencies, too, so (for example) it won’t detach from a context before all memory allocated in it is also freed.
  • Convenience. Abstractions like pycuda.driver.SourceModule and pycuda.gpuarray.GPUArray make CUDA programming even more convenient than with Nvidia’s C-based runtime.
  • Completeness. PyCUDA puts the full power of CUDA’s driver API at your disposal, if you wish.
  • Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions.
  • Speed. PyCUDA’s base layer is written in C++, so all the niceties above are virtually free.
  • Helpful Documentation.

Documentation

See the PyCUDA Documentation.

If you'd like to get an impression what PyCUDA is being used for in the real world, head over to the PyCUDA showcase.

Support

Having trouble with PyCUDA? Maybe the nice people on the PyCUDA mailing list can help.

Download

PyCUDA may be downloaded from its Python Package Index page or obtained directly from my source code repository by typing

git clone http://git.tiker.net/trees/pycuda.git

You may also browse the source.

Prerequisites:

  • Boost (any recent version should work)
  • CUDA (version 2.0 beta or newer)
  • Numpy (version 1.0.4 or newer)

Why is GPU Computing significant?

In my opinion, GPU computing is significant because I--as a grad student--can easily afford a machine that allows me to perform a simulation like the following in 40 minutes instead of a whole workday. That's why.

If you're curious, this shows the density of a vortex shedding flow behind a square obstacle at Re=100 and Ma=0.1. The attentive viewer may notice a sound wave at the beginning as the system settles from uniform flow to flow around the obstacle, as well as the passing of a gentle density "nudge" intended to throw the system off balance and accelerate the onset of shedding. This was computed using my Discontinuous Galerkin solver hedge on an Nvidia GTX 260.

This work owes a lot to Hendrik Riedmann from IAG, Uni Stuttgart who wrote the initial version of the Navier-Stokes operator in hedge.

(Btw: did you notice how the movie cleverly avoids the typical criticism of being "CFD"--colorful fluid dynamics? :-)

Submitted: PyCUDA: GPU Run-Time Code Generation for High-Performance Computing

Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, Ahmed Fasih and I have recently submitted an article that explains how PyCUDA allows the user to do run-time code generation ("RTCG"), and how that is an enormous boon to implementation efforts of most high-performance codes. Among many other things, PyCUDA also underlies our efforts to bring discontinuous Galerkin PDE solvers onto the GPU.

Get it while it's hot: Arxiv, Brown SC

Update: Fixed arXiv link.

Abstract

High-performance scientific computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), and PyCUDA, an open-source toolkit that supports this technique.

In introducing PyCUDA, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. It is further observed that, compared to competing techniques, the effort required to create codes using run-time code generation with PyCUDA grows more gently in response to growing needs. The concept of RTCG is simple and easily implemented using existing, robust tools. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.

PyCUDA talk at Nvidia's GPU Technology Conference

This past week, I had the honor of presenting a talk on PyCUDA at Nvidia's inaugural GPU Technology Conference.

pycuda-nvidia.png
pycuda-nvidia.png

Please click the following link to view the slides: pycuda-nvidia.pdf.

Update: Nvidia has posted a recording of the session that you may watch or download.

SciPy'09: Advanced Tutorial on PyCUDA

The SciPy'09 conference ended less than a week ago. At the invitation of the SciPy'09 organizers (especially Fernando Perez), Nicolas Pinto gave a talk in the Advanced Tutorials track on how to use PyCUDA to do GPU scripting.

First, I would like to use this opportunity to publicly thank Nicolas for all the work and time he put into making this tutorial a reality. Second, I would like to point out the video of his session, which you can watch below:

Talk slides: PyCuda@MIT

Nicolas Pinto at MIT was nice enough to invite me over to give a talk in their CUDA class. I made a bunch of slides that I think are of general interest to people who are interested in PyCuda. You can find them here.

Submitted: "Nodal Discontinuous Galerkin Methods on Graphics Processors"

Tim Warburton, Jeff Bridge, my advisor Jan Hesthaven and I have recently submitted an article detailing our efforts to accelerate Discontinuous Galerkin computations by using Nvidia CUDA GPUs. DG seems to be a good fit for these machines.

Get it while it's hot: Arxiv, Brown SC reports, JCP

Abstract

Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. Lately, another property of DG has been growing in importance: The majority of a DG operator is applied in an element-local way, with weak penalty-based element-to-element coupling.

The resulting locality in memory access is one of the factors that enables DG to run on off-the-shelf, massively parallel graphics processors (GPUs). In addition, DG's high-order nature lets it require fewer data points per represented wavelength and hence fewer memory accesses, in exchange for higher arithmetic intensity. Both of these factors work significantly in favor of a GPU implementation of DG.

Using a single US$400 Nvidia GTX 280 GPU, we accelerate a solver for Maxwell's equations on a general 3D unstructured grid by a factor of 40 to 60 relative to a serial computation on a current-generation CPU. In many cases, our algorithms exhibit full use of the device's available memory bandwidth. Example computations achieve and surpass 200 gigaflops/s of net application-level floating point work.

In this article, we describe and derive the techniques used to reach this level of performance. In addition, we present comprehensive data on the accuracy and runtime behavior of the method.

PyCuda 0.91

I'm happy to announce the availability of PyCuda 0.91. There is full, up-to-date documentation available.

The following exciting stuff is in PyCuda 0.91:

  • Support for Windows and MacOS X, in addition to Linux. (Gert Wohlgemuth, Cosmin Stejerean, Znah on the Nvidia forums, and David Gadling)
  • Support more arithmetic operators on pycuda.gpuarray.GPUArray. (Gert Wohlgemuth)
  • Add pycuda.gpuarray.arange(). (Gert Wohlgemuth)
  • Add pycuda.curandom. (Gert Wohlgemuth)
  • Add pycuda.cumath. (Gert Wohlgemuth)
  • Add pycuda.autoinit.
  • Add pycuda.tools.
  • Add pycuda.tools.DeviceData and pycuda.tools.OccupancyRecord. pycuda.gpuarray.
  • GPUArray parallelizes properly on GTX200-generation devices.
  • Add support for compiling on CUDA 1.1. Added version query pycuda.driver.get_version(). Updated documentation to show 2.0-only functionality.
  • Make pycuda.driver.Function resource usage available to the program. (See, e.g. pycuda.driver.Function.registers.) Cache kernels compiled by pycuda.driver.SourceModule.
  • Allow for faster, prepared kernel invocation. See pycuda.driver.Function.prepare().
  • Added memory pools, at pycuda.tools.DeviceMemoryPool as experimental, undocumented functionality. For some workloads, this can cure the slowness of pycuda.driver.mem_alloc().
  • Fix the memset family of functions.
  • Improve Error Reporting.

Check the docs change list for a fully hyperlinked version of the above.

Have fun, Andreas