Quote of the day:
In Python 2.6, no one can hear you scream.
(source) This reminds me of that one time when I was debugging some odd interaction between threading and contexts in PyCUDA. Never again, hopefully. :)

Chambana, here I come! :)
Ever wanted to include a nice illustration from someone else's PDF in your own? (while preserving vector happiness and not resorting to bitmap screenshots--yuck) No longer a problem: Enter InteractiCrop!
Obviously: Make sure to request permission when appropriate, and credit your source! In fact, here's a TikZ snippet that will help you add a credit box to your slide:
\newcommand{\creditto}[1]{
\begin{tikzpicture}[overlay]
\node [xshift=1cm,yshift=0.5cm]
at (current page.south west)
[font=\scriptsize,fill=gray!30,anchor=south west,opacity=0.5]
{#1};
\end{tikzpicture}
}
This October I had the honor of presenting my work on using Python with GPUs at PyData NYC 2012. Here's a video of my talk:
There was also a panel discussion on Python+Parallel that I was a part of--here's the video of that:
Also be sure to check out all the videos of the other great talks to see what you've missed.
If this sounds interesting to you, also be sure to check out their next conference, PyData Silicon Valley 2013. And please (continue to!) support NumFocus and check out what Continuum are doing for big data in Python. They deserve a lot of credit for bringing the Python community together at events like PyData.
Like last year, I had the honor of being invited to present PyCUDA and PyOpenCL along with a few examples of their use to a great crowd at Nvidia's inaugural GPU Technology Conference 2010.

Please click the following link to view the slides: pycuda-pyopencl-gtc-2010.pdf.
Update: Nvidia has posted a recording of the session. There's also a full list of sessions, with many talks that are worth being watched. In particular, I'd like to recommend the ones by Bryan Catanzaro on Copperhead, which is built on top of PyCUDA, by Tim Warburton on all things GPU-based discontinuous Galerkin. Also check out the poster on Atomic Hedgehog by Cyrus Omar.
At the recent PyCon Quattro, which took place in early May in the beautiful Tuscan city of Florence, Fabrizio Milo gave a talk on PyCUDA entitled
PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python (PyCUDA: How to make use of the power of graphics cards in Python applications)
He made a set of rather nice slides (in English), which may be of interest. They are downloadable in PDF form at the link.
Thanks Fabrizio for taking the time to talk about PyCUDA!
Quite often, I hear complaints that coding for GPUs is difficult. In response to such comments, I believe that, for correct perspective, the discussion needs to be framed somewhat differently.
First of all, squeezing the last drop of performance out of modern CPUs is hard, too. Here's a nice article on cache effects by Igor Ostrovsky that explains some of the phenomena one needs to take into account and the surprising things that can happen.
It just appears to me that on the CPU, fewer people care about good performance, whereas for GPUs, you admit that you do care simply by your choice of architecture. Not caring about CPU is not entirely unreasonable--you are somewhat likely to get 'average' performance even without detailed analyses. On the GPU on the other hand, carelessly written code is not as likely to perform well.
So, in summary, my belief is that both CPUs and GPUs can be equally difficult to understand, it's just that the potential payoff of caring about performance is much greater on one than on the other.
In my opinion, GPU computing is significant because I--as a grad student--can easily afford a machine that allows me to perform a simulation like the following in 40 minutes instead of a whole workday. That's why.
If you're curious, this shows the density of a vortex shedding flow behind a square obstacle at Re=100 and Ma=0.1. The attentive viewer may notice a sound wave at the beginning as the system settles from uniform flow to flow around the obstacle, as well as the passing of a gentle density "nudge" intended to throw the system off balance and accelerate the onset of shedding. This was computed using my Discontinuous Galerkin solver hedge on an Nvidia GTX 260.
This work owes a lot to Hendrik Riedmann from IAG, Uni Stuttgart who wrote the initial version of the Navier-Stokes operator in hedge.
(Btw: did you notice how the movie cleverly avoids the typical criticism of being "CFD"--colorful fluid dynamics? :-)
Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, Ahmed Fasih and I have recently submitted an article that explains how PyCUDA allows the user to do run-time code generation ("RTCG"), and how that is an enormous boon to implementation efforts of most high-performance codes. Among many other things, PyCUDA also underlies our efforts to bring discontinuous Galerkin PDE solvers onto the GPU.
Get it while it's hot: Arxiv, Brown SC
Update: Fixed arXiv link.
High-performance scientific computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), and PyCUDA, an open-source toolkit that supports this technique.
In introducing PyCUDA, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. It is further observed that, compared to competing techniques, the effort required to create codes using run-time code generation with PyCUDA grows more gently in response to growing needs. The concept of RTCG is simple and easily implemented using existing, robust tools. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.
This past week, I had the honor of presenting a talk on PyCUDA at Nvidia's inaugural GPU Technology Conference.

Please click the following link to view the slides: pycuda-nvidia.pdf.
Update: Nvidia has posted a recording of the session that you may watch or download.
Update 2: Giancarlo Colasante has transcoded the above video into just 16 MB. You may download the resulting video here.