After hours experimenting ...
I wrote a couple programs to take 1000 integers and put them into 16 bins (a very simple histogram).
The first program uses OpenCL and the second program uses the CPU (2.66 GHz Intel Core 2 Duo). The GPU is NVIDIA GeForce 9400 256 MB of VRAM.
I added some clocking and the results were:
OpenCL: 0.0011 seconds
CPU: 0.000008 seconds
So the OpenCL program takes something like 3 orders of magnitude longer to complete. (Note I put the timestamping around the *execution* and not the loading the program into the kernel or creating the 1000 values etc).
My question is: where is the value in OpenCL?
Is it that I have a bad GPU (my brother-in-law told me my GPU was not-so-great)? Is it that this can go on in the background while the more powerful CPU does other work in the foreground asynchronously? Am I breaking things up into workgroups incorrectly?
I tried breaking up the OpenCL into a couple workgroups, but I'm probably doing it wrong. I would think a parallel execution would make it more effecient, but it takes about 100 microseconds longer when I break it up.
I'm attaching my files, but they only work on OS X.6 (snow leopard). The build commands are:
g++ -framework OpenCL -o test_opencl opencl_histo.cpp
g++ -framework OpenCL -o test_cpu cpu_histo.cpp
Bookmarks