PyCUDA lets you access Nvidia‘s CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist–so what's so special about PyCUDA?
See the PyCUDA Documentation.
If you'd like to get an impression what PyCUDA is being used for in the real world, head over to the PyCUDA showcase.
Having trouble with PyCUDA? Maybe the nice people on the PyCUDA mailing list can help.
PyCUDA may be downloaded from its Python Package Index page or obtained directly from my source code repository by typing
git clone --recursive http://git.tiker.net/trees/pycuda.git
You may also browse the source.
Prerequisites:
This October I had the honor of presenting my work on using Python with GPUs at PyData NYC 2012. Here's a video of my talk:
There was also a panel discussion on Python+Parallel that I was a part of--here's the video of that:
Also be sure to check out all the videos of the other great talks to see what you've missed.
If this sounds interesting to you, also be sure to check out their next conference, PyData Silicon Valley 2013. And please (continue to!) support NumFocus and check out what Continuum are doing for big data in Python. They deserve a lot of credit for bringing the Python community together at events like PyData.
Like last year, I had the honor of being invited to present PyCUDA and PyOpenCL along with a few examples of their use to a great crowd at Nvidia's inaugural GPU Technology Conference 2010.

Please click the following link to view the slides: pycuda-pyopencl-gtc-2010.pdf.
Update: Nvidia has posted a recording of the session. There's also a full list of sessions, with many talks that are worth being watched. In particular, I'd like to recommend the ones by Bryan Catanzaro on Copperhead, which is built on top of PyCUDA, by Tim Warburton on all things GPU-based discontinuous Galerkin. Also check out the poster on Atomic Hedgehog by Cyrus Omar.
At the recent PyCon Quattro, which took place in early May in the beautiful Tuscan city of Florence, Fabrizio Milo gave a talk on PyCUDA entitled
PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python (PyCUDA: How to make use of the power of graphics cards in Python applications)
He made a set of rather nice slides (in English), which may be of interest. They are downloadable in PDF form at the link.
Thanks Fabrizio for taking the time to talk about PyCUDA!
In my opinion, GPU computing is significant because I--as a grad student--can easily afford a machine that allows me to perform a simulation like the following in 40 minutes instead of a whole workday. That's why.
If you're curious, this shows the density of a vortex shedding flow behind a square obstacle at Re=100 and Ma=0.1. The attentive viewer may notice a sound wave at the beginning as the system settles from uniform flow to flow around the obstacle, as well as the passing of a gentle density "nudge" intended to throw the system off balance and accelerate the onset of shedding. This was computed using my Discontinuous Galerkin solver hedge on an Nvidia GTX 260.
This work owes a lot to Hendrik Riedmann from IAG, Uni Stuttgart who wrote the initial version of the Navier-Stokes operator in hedge.
(Btw: did you notice how the movie cleverly avoids the typical criticism of being "CFD"--colorful fluid dynamics? :-)
Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, Ahmed Fasih and I have recently submitted an article that explains how PyCUDA allows the user to do run-time code generation ("RTCG"), and how that is an enormous boon to implementation efforts of most high-performance codes. Among many other things, PyCUDA also underlies our efforts to bring discontinuous Galerkin PDE solvers onto the GPU.
Get it while it's hot: Arxiv, Brown SC
Update: Fixed arXiv link.
High-performance scientific computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), and PyCUDA, an open-source toolkit that supports this technique.
In introducing PyCUDA, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. It is further observed that, compared to competing techniques, the effort required to create codes using run-time code generation with PyCUDA grows more gently in response to growing needs. The concept of RTCG is simple and easily implemented using existing, robust tools. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.
This past week, I had the honor of presenting a talk on PyCUDA at Nvidia's inaugural GPU Technology Conference.

Please click the following link to view the slides: pycuda-nvidia.pdf.
Update: Nvidia has posted a recording of the session that you may watch or download.
Update 2: Giancarlo Colasante has transcoded the above video into just 16 MB. You may download the resulting video here.
The SciPy'09 conference ended less than a week ago. At the invitation of the SciPy'09 organizers (especially Fernando Perez), Nicolas Pinto gave a talk in the Advanced Tutorials track on how to use PyCUDA to do GPU scripting.
First, I would like to use this opportunity to publicly thank Nicolas for all the work and time he put into making this tutorial a reality. Second, I would like to point out the video of his session, which you can watch below:
Nicolas Pinto at MIT was nice enough to invite me over to give a talk in their CUDA class. I made a bunch of slides that I think are of general interest to people who are interested in PyCuda. You can find them here.
Tim Warburton, Jeff Bridge, my advisor Jan Hesthaven and I have recently submitted an article detailing our efforts to accelerate Discontinuous Galerkin computations by using Nvidia CUDA GPUs. DG seems to be a good fit for these machines.
Get it while it's hot: Arxiv, Brown SC reports, JCP
Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. Lately, another property of DG has been growing in importance: The majority of a DG operator is applied in an element-local way, with weak penalty-based element-to-element coupling.
The resulting locality in memory access is one of the factors that enables DG to run on off-the-shelf, massively parallel graphics processors (GPUs). In addition, DG's high-order nature lets it require fewer data points per represented wavelength and hence fewer memory accesses, in exchange for higher arithmetic intensity. Both of these factors work significantly in favor of a GPU implementation of DG.
Using a single US$400 Nvidia GTX 280 GPU, we accelerate a solver for Maxwell's equations on a general 3D unstructured grid by a factor of 40 to 60 relative to a serial computation on a current-generation CPU. In many cases, our algorithms exhibit full use of the device's available memory bandwidth. Example computations achieve and surpass 200 gigaflops/s of net application-level floating point work.
In this article, we describe and derive the techniques used to reach this level of performance. In addition, we present comprehensive data on the accuracy and runtime behavior of the method.