Maximum Speed Needed [Archive]

lixo1

20th March 2009, 14:09

Hi everybody,

I'm creating a mathematical application that has a very long calculation period, on windows using VC++ Express compiler + .Net Framework my application need 3min to finish (using the BackGroundWorker that is something like QThread).

Well now I restart my project using Qt4, I created a QThread, but unfortunatelly when I compile it in release mode, the needed time is > 15min, (on Linux and Windows).

So, is there any way to speed it at the maximum, using all CPUs. Is there some QMAKE option that allow this kind of performance.

Absolutely thank you very much for any kind of help,
Cheers,
Louis

caduel

20th March 2009, 14:20

have you
* checked that you have turned on compiler optimizations?
* perhaps changed your code in other ways (is the algorithm really the same)?
* checked if BackGroundWorker perhaps is able to split the work not into one thread but perhaps a whole bunch of threads? (I.e. check if the other app has more than 2 threads running.)

wysota

20th March 2009, 14:22

Are you talking about compilation time or execution time?

lixo1

20th March 2009, 14:32

Hi caduel and wysota

>Are you talking about compilation time or execution time?
Execution time.

> * checked that you have turned on compiler optimizations?
Well, on my Qt4 project, I put CONFIG+=release, and when it compiles it add -O2 option to the compiler.

> * perhaps changed your code in other ways (is the algorithm really the same)?
The alogrithm is the same (copy/paste).

> * checked if BackGroundWorker perhaps is able to split the work not into one thread but perhaps a whole bunch of threads? (I.e. check if the other app has more than 2 threads running.)

No, it's only 1 thread.

Thank you very much for all attention.
Cheers,
Louis

seneca

20th March 2009, 14:47

Qt can also be compiled with VisualStudio, that should give you the same speed.

I'm not sure if it is compilable with Intels C++ compiler, but that one would probably be even better thatn microsofts.

lixo1

20th March 2009, 15:10

Hi seneca,

> Qt can also be compiled with VisualStudio, that should give you the same speed.
Yes this is the point, on windows I'm using VC++ with Qt4, on linux g++ and I got 15 min!

Cheers,
Louis

lni

20th March 2009, 15:47

How do you compile the one that run 3 minutes, and how do you compile the other one that run 15 minutes?

You should be able to compare the compilation and make adjustment. Make sure they use the same flags, same compiler, etc...

I had same issues, but my program runs 3 days, so it is harder to compare because I need to wait for 3 days to get result...

Hiker Hauk

20th March 2009, 16:04

Well now I restart my project using Qt4, I created a QThread, but unfortunatelly when I compile it in release mode, the needed time is > 15min, (on Linux and Windows).

Is it implied here that if you compile with Qt debug library it runs faster and less than 15min?

The alogrithm is the same (copy/paste).

If you can copy and paste and it is convenient, I suggest you use some other 3rd party threading library with Linux to test that piece of code and see how long it takes.

If only .NET can run that fast, we can do nothing about it. It is a Microsoft's Framework running on a Microsoft's OS :P

Also, you can time the various parts of your code to see if a certain part takes very long and optimize it.

lixo1

20th March 2009, 17:06

Hi everybody,

> Is it implied here that if you compile with Qt debug library it runs faster and less than 15min?
No, release mode is faster than debug mode.

Well on windows I created a VC++ project using TEMPLATE = vcapp, qmake creates a VC++ project with all the same compiler options like my .Net application, eccept the /clr option (I will not link it with .net framework). The result is good, now it takes 5min instead of 15min. Great, thank you.

The problem is on Linux (in my case Ubuntu 8.10) with g++, it takes 15min, is it possible that g++ cannot optimize it? In the past I worked on Windows using Dev-cpp (that uses g++, mingw) and unfortunately this same program without graphical interface took about 1h to finish.

Thank you very much for your help!
Cheers,
Louis

fullmetalcoder

20th March 2009, 18:01

Did you think about turning on SSE (or other SIMD instruction sets available on your PC)? The compiler flags below are know to dramatically improve performance of apps making intensive use of floating points computations, which is probably your case :

QMAKE_CXX_FLAGS += -march=native -mfpmath=sse

To figure out which extra instructions set are supported by your CPU you can examine the output of a "cat /proc/cpuinfo". If the "flags" fields contains at least "sse" your can safely use the above flags (if "sse2", "sse3", "ssse3" or "sse4" are present it is even better).

Please note however that such binaries will not be portable.

Also did you consider using QtConcurrent to distribute the computations accross multiple threads?

lixo1

20th March 2009, 18:15

Hi fullmetalcoder,

Thank you for your help, but unfortunately I got another time 15min. (my cpu flags has sse and sse2).

It's very strange!
Now, about multiple thread: in my case the "thread" is very simple it's a simple for() like the code bellow:

for (i=0;i<12;i++) {compt[i][0]=0;}
for (ii=0;ii<=2;ii=ii+1)
{
for (i=0;i<12;i++) {compt[i][1]=0;}
for (jj=0;jj<=2;jj=jj+1)
{
for (i=0;i<12;i++) {compt[i][2]=0;}
for (kk=0;kk<=2;kk=kk+1)
{
for (i=0;i<12;i++) {compt[i][3]=0;}
for (ll=0;ll<=2;ll=ll+1)
{
for (i=0;i<12;i++) {compt[i][4]=0;}
for (mm=0;mm<=2;mm=mm+1)
{
for (i=0;i<12;i++) {compt[i][5]=0;}
for (nn=0;nn<=2;nn=nn+1)
{
for (i=0;i<12;i++) {compt[i][6]=0;}
for (oo=0;oo<=2;oo=oo+1)
{
for (i=0;i<12;i++) {compt[i][7]=0;}
for (pp=0;pp<=2;pp=pp+1)
{

for (i=0;i<12;i++) {compt[i][8]=0;}
for (psi=0;psi<2*M_PI;psi=psi+step)
{
for (i=0;i<12;i++) {compt[i][9]=0;}
for (phi=0;phi<2*M_PI;phi=phi+step)
{
for (i=0;i<12;i++) {compt[i][10]=0;}
for (teta=0;teta<M_PI;teta=teta+step)
{ //*V 90
compt[0][10]=compt[0][10] + sin(teta)*(0.5*step*step*step*(eulerb(2,ii,teta,ph i,psi)*eulerb(0,jj,teta,phi,psi)*eulerb(0,kk,teta, phi,psi)*eulerb(0,ll,teta,phi,psi)*gamma_a[ii][jj][kk][ll]
+ eulerb(1,ii,teta,phi,psi)*eulerb(0,jj,teta,phi,psi )*eulerb(0,kk,teta,phi,psi)*eulerb(0,ll,teta,phi,p si)*gamma_a[ii][jj][kk][ll])
* (eulerb(2,mm,teta,phi,psi)*eulerb(0,nn,teta,phi,ps i)*eulerb(0,oo,teta,phi,psi)*eulerb(0,pp,teta,phi, psi)*gamma_a[mm][nn][oo][pp]
+ eulerb(1,mm,teta,phi,psi)*eulerb(0,nn,teta,phi,psi )*eulerb(0,oo,teta,phi,psi)*eulerb(0,pp,teta,phi,p si)*gamma_a[mm][nn][oo][pp]));

compt[1][10]=compt[1][10] + sin(teta)*step*step*step*(eulerb(2,ii,teta,phi,psi )*eulerb(0,jj,teta,phi,psi)*eulerb(0,kk,teta,phi,p si)*eulerb(0,ll,teta,phi,psi)*gamma_a[ii][jj][kk][ll]
+ eulerb(1,ii,teta,phi,psi)*eulerb(0,jj,teta,phi,psi )*eulerb(0,kk,teta,phi,psi)*eulerb(0,ll,teta,phi,p si)*gamma_a[ii][jj][kk][ll])
* (eulerb(2,mm,teta,phi,psi)*eulerb(0,nn,teta,phi,ps i)*eulerb(1,oo,teta,phi,psi)*eulerb(1,pp,teta,phi, psi)*gamma_a[mm][nn][oo][pp]
+ eulerb(1,mm,teta,phi,psi)*eulerb(0,nn,teta,phi,psi )*eulerb(1,oo,teta,phi,psi)*eulerb(1,pp,teta,phi,p si)*gamma_a[mm][nn][oo][pp])
+ 2 *sin(teta)*step*step*step*(eulerb(2,ii,teta,phi,ps i)*eulerb(0,jj,teta,phi,psi)*eulerb(0,kk,teta,phi, psi)*eulerb(1,ll,teta,phi,psi)*gamma_a[ii][jj][kk][ll]
+ eulerb(1,ii,teta,phi,psi)*eulerb(0,jj,teta,phi,psi )*eulerb(0,kk,teta,phi,psi)*eulerb(1,ll,teta,phi,p si)*gamma_a[ii][jj][kk][ll])
* (eulerb(2,mm,teta,phi,psi)*eulerb(0,nn,teta,phi,ps i)*eulerb(0,oo,teta,phi,psi)*eulerb(1,pp,teta,phi, psi)*gamma_a[mm][nn][oo][pp]
+ eulerb(1,mm,teta,phi,psi)*eulerb(0,nn,teta,phi,psi )*eulerb(0,oo,teta,phi,psi)*eulerb(1,pp,teta,phi,p si)*gamma_a[mm][nn][oo][pp]);

compt[2][10]=compt[2][10] + sin(teta)*0.5*step*step*step*(eulerb(2,ii,teta,phi ,psi)*eulerb(0,jj,teta,phi,psi)*eulerb(1,kk,teta,p hi,psi)*eulerb(1,ll,teta,phi,psi)*gamma_a[ii][jj][kk][ll]
+ eulerb(1,ii,teta,phi,psi)*eulerb(0,jj,teta,phi,psi )*eulerb(1,kk,teta,phi,psi)*eulerb(1,ll,teta,phi,p si)*gamma_a[ii][jj][kk][ll])
* (eulerb(2,mm,teta,phi,psi)*eulerb(0,nn,teta,phi,ps i)*eulerb(1,oo,teta,phi,psi)*eulerb(1,pp,teta,phi, psi)*gamma_a[mm][nn][oo][pp]
+ eulerb(1,mm,teta,phi,psi)*eulerb(0,nn,teta,phi,psi )*eulerb(1,oo,teta,phi,psi)*eulerb(1,pp,teta,phi,p si)*gamma_a[mm][nn][oo][pp]);
//*H 90
compt[3][10]=compt[3][10] + sin(teta)*(0.5*step*step*step*(eulerb(2,ii,teta,ph i,psi)*eulerb(2,jj,teta,phi,psi)*eulerb(0,kk,teta, phi,psi)*eulerb(0,ll,teta,phi,psi)*gamma_a[ii][jj][kk][ll]
+ eulerb(1,ii,teta,phi,psi)*eulerb(2,jj,teta,phi,psi )*eulerb(0,kk,teta,phi,psi)*eulerb(0,ll,teta,phi,p si)*gamma_a[ii][jj][kk][ll])
* (eulerb(2,mm,teta,phi,psi)*eulerb(2,nn,teta,phi,ps i)*eulerb(0,oo,teta,phi,psi)*eulerb(0,pp,teta,phi, psi)*gamma_a[mm][nn][oo][pp]
+ eulerb(1,mm,teta,phi,psi)*eulerb(2,nn,teta,phi,psi )*eulerb(0,oo,teta,phi,psi)*eulerb(0,pp,teta,phi,p si)*gamma_a[mm][nn][oo][pp]));
compt[4][10]=compt[4][10] + sin(teta)*step*step*step*(eulerb(2,ii,teta,phi,psi )*eulerb(2,jj,teta,phi,psi)*eulerb(0,kk,teta,phi,p si)*eulerb(0,ll,teta,phi,psi)*gamma_a[ii][jj][kk][ll]
+ eulerb(1,ii,teta,phi,psi)*eulerb(2,jj,teta,phi,psi )*eulerb(0,kk,teta,phi,psi)*eulerb(0,ll,teta,phi,p si)*gamma_a[ii][jj][kk][ll])
* (eulerb(2,mm,teta,phi,psi)*eulerb(2,nn,teta,phi,ps i)*eulerb(1,oo,teta,phi,psi)*eulerb(1,pp,teta,phi, psi)*gamma_a[mm][nn][oo][pp]
+ eulerb(1,mm,teta,phi,psi)*eulerb(2,nn,teta,phi,psi )*eulerb(1,oo,teta,phi,psi)*eulerb(1,pp,teta,phi,p si)*gamma_a[mm][nn][oo][pp])
+ 0.5*sin(teta)*step*step*step*(eulerb(2,ii,teta,phi ,psi)*eulerb(2,jj,teta,phi,psi)*eulerb(0,kk,teta,p hi,psi)*eulerb(1,ll,teta,phi,psi)*gamma_a[ii][jj][kk][ll]
+ eulerb(1,ii,teta,phi,psi)*eulerb(2,jj,teta,phi,psi )*eulerb(0,kk,teta,phi,psi)*eulerb(1,ll,teta,phi,p si)*gamma_a[ii][jj][kk][ll])
* (eulerb(2,mm,teta,phi,psi)*eulerb(2,nn,teta,phi,ps i)*eulerb(0,oo,teta,phi,psi)*eulerb(1,pp,teta,phi, psi)*gamma_a[mm][nn][oo][pp]
+ eulerb(1,mm,teta,phi,psi)*eulerb(2,nn,teta,phi,psi )*eulerb(0,oo,teta,phi,psi)*eulerb(1,pp,teta,phi,p si)*gamma_a[mm][nn][oo][pp]);
compt[5][10]=compt[5][10] + sin(teta)*0.5*step*step*step*(eulerb(2,ii,teta,phi ,psi)*eulerb(2,jj,teta,phi,psi)*eulerb(1,kk,teta,p hi,psi)*eulerb(1,ll,teta,phi,psi)*gamma_a[ii][jj][kk][ll]
+ eulerb(1,ii,teta,phi,psi)*eulerb(2,jj,teta,phi,psi )*eulerb(1,kk,teta,phi,psi)*eulerb(1,ll,teta,phi,p si)*gamma_a[ii][jj][kk][ll])
* (eulerb(2,mm,teta,phi,psi)*eulerb(2,nn,teta,phi,ps i)*eulerb(1,oo,teta,phi,psi)*eulerb(1,pp,teta,phi, psi)*gamma_a[mm][nn][oo][pp]
+ eulerb(1,mm,teta,phi,psi)*eulerb(2,nn,teta,phi,psi )*eulerb(1,oo,teta,phi,psi)*eulerb(1,pp,teta,phi,p si)*gamma_a[mm][nn][oo][pp]);
//Cancel process if stop requested
if(isTerminated){goto end;}

}
for (i=0;i<12;i++) {compt[i][9]=compt[i][9]+compt[i][10];}
}
for (i=0;i<12;i++) {compt[i][8]=compt[i][8]+compt[i][9];}
}
for (i=0;i<12;i++) {compt[i][7]=compt[i][7]+compt[i][8];}
}
for (i=0;i<12;i++) {compt[i][6]=compt[i][6]+compt[i][7];}
}
for (i=0;i<12;i++) {compt[i][5]=compt[i][5]+compt[i][6];}
}
for (i=0;i<12;i++) {compt[i][4]=compt[i][4]+compt[i][5];}
}
for (i=0;i<12;i++) {compt[i][3]=compt[i][3]+compt[i][4];}
}

//Increment ProgressBar + set number in %
if(DipCoefChecked){ percent++; emit progbar(percent);}
else{ percent +=2; emit progbar(percent); }

for (i=0;i<12;i++) {compt[i][2]=compt[i][2]+compt[i][3];}
}
for (i=0;i<12;i++) {compt[i][1]=compt[i][1]+compt[i][2];}
}
for (i=0;i<12;i++) {compt[i][0]=compt[i][0]+compt[i][1];}
}

fullmetalcoder

20th March 2009, 19:23

So you used added the flags to your project and it did not change anything right? Well, did you

rebuild your whole app (changing compiler flags is not taken into account by make deps check AFAIK)
check that these flags were passed to the compiler (quick look at the compile log would be enough), there could be something as simple as a spelling error in the variable name or something akin

Maybe I'm a bit touchy but your code is among the worst formatted I have ever seen and this has a dramatic effect on readability...

Another important fact is that you use HUGE assignements in the inner loop which are probably not very well optimized by GCC because it cannot make any assumptions on the values returned by sin() and eulerb() (specifically it cannot guess that several calls to these with the exact same parameter will return the same result). Besides most, if not all your assignements could be changed to use the += operator (for readability and to make sure the compiler optimize as much as possible). You could also compute step*step*step only once and out of the loop (making it const could help optimizations as well).

suggested refactoring (which I am not entierely doing for you ;)) :

for (i=0;i<12;i++)
{compt[i][0]=0;}

const stepCube = step * step * step;

for (ii=0;ii<=2;ii=ii+1)
{
for (i=0;i<12;i++)
{compt[i][1]=0;}

for (jj=0;jj<=2;jj=jj+1)
{
for (i=0;i<12;i++)
{compt[i][2]=0;}

for (kk=0;kk<=2;kk=kk+1)
{
for (i=0;i<12;i++)
{compt[i][3]=0;}

for (ll=0;ll<=2;ll=ll+1)
{
for (i=0;i<12;i++)
{compt[i][4]=0;}

const double gammaIJKL = gamma_a[ii][jj][kk][ll];
for (mm=0;mm<=2;mm=mm+1)
{
for (i=0;i<12;i++)
{compt[i][5]=0;}

for (nn=0;nn<=2;nn=nn+1)
{
for (i=0;i<12;i++)
{compt[i][6]=0;}

for (oo=0;oo<=2;oo=oo+1)
{
for (i=0;i<12;i++)
{compt[i][7]=0;}

for (pp=0;pp<=2;pp=pp+1)
{
for (i=0;i<12;i++)
{compt[i][8]=0;}

const double gammaMNOP = gamma_a[mm][nn][oo][pp];
for (psi=0;psi<2*M_PI;psi=psi+step)
{
for (i=0;i<12;i++)
{compt[i][9]=0;}

for (phi=0;phi<2*M_PI;phi=phi+step)
{
for (i=0;i<12;i++)
{compt[i][10]=0;}

for (teta=0;teta<M_PI;teta=teta+step)
{ //*V 90
const double sinTheta = sin(teta);
const double stsc = sinTheta * stepCube;
const double euler1i = eulerb(1,ii,teta,phi,psi);
const double euler2i = eulerb(2,ii,teta,phi,psi);
const double euler1i2i = euler1i + euler2i;

const double euler0j = eulerb(0,jj,teta,phi,psi);
const double euler0k = eulerb(0,kk,teta,phi,psi);
const double euler0l = eulerb(0,ll,teta,phi,psi);

const double euler1j = eulerb(1,jj,teta,phi,psi);
const double euler1k = eulerb(1,kk,teta,phi,psi);
const double euler1l = eulerb(1,ll,teta,phi,psi);

const double euler2j = eulerb(2,jj,teta,phi,psi);
const double euler2k = eulerb(2,kk,teta,phi,psi);
const double euler2l = eulerb(2,ll,teta,phi,psi);

compt[0][10] += stsc * 0.5 *
euler1i2i*euler0j*euler0k*euler0l*gammaIJKL
*
euler1m2m*euler0n*euler0o*euler0p*gammaMNOP;
...
Of course since the step is fixed during the whole loop you could (and probably should) precompute most of the values (sinTheta, eulerXXX...) and put them in vector/hash (for fast lookup) to reuse them in the inner loop instead of recomputing them every time.

With all these optimizations I'm willing to bet you can go below one minute, despite your algorithm being rather suboptimal.

wysota

20th March 2009, 19:28

I would do such computations in assembly if I were you... If you have a multicore or multicpu machine available try parallelizing the computation. Furthermore replace the for loops with something faster. The first loop can be replaced by memset() for example. Also optimize your code to make proper use of the data cache in your processor. Cache partial results - there is no need to recompute sin(teta) and all the other stuff all the time.

And lose the goto :)

lixo1

20th March 2009, 20:15

Hi fullmetalcoder,

Absolutely thank you very much for all patience, help and example, I will modify all code.
I'm an amateur (in particular a physicist) so sorry very much for my chaotic readability.

wysota > And lose the goto
What you recommend to use?
I don't use "return;" because after the loop I delete my arrays.

Cheers,
Louis

wysota

20th March 2009, 21:27

wysota > And lose the goto
What you recommend to use?
I don't use "return;" because after the loop I delete my arrays.

There are numerous ways to do it. The easiest one is to allocate all the data on the stack and not on the heap (the latter using the "new" operator). Another is to delete the data using an external function call and then call return. You can also use the "break" keyword to bail out of a loop.

lni

20th March 2009, 21:50

You are computing the same thing a billion times :)

For instance, you put sin( teta ) in the inner most loop, repeatedly compute the same sin( teta ) in every single outer loop. You can make a sin( teta ) array by computing only once. sin( teta ) is very slow instruction.

Also, you have step * step * step which is another constant, but it is computed 3 trillion times :)

In the function calls, don't call a function with same arguments many times, such as eulerb( 2, ii, teta, phi, psi ). You can do:

double foo = eulerb( 2, ii, teta, phi, psi );

and then use this foo to replace the same function calls.

Have fun

wysota

20th March 2009, 22:32

You can even use the "register" keyword to hint the compiler about most often used variables.

fullmetalcoder

21st March 2009, 08:29

I would do such computations in assembly if I were you...
You're going to scare him to death with such advices. Besides, well written C code can perform pretty well and is be a lot more maintainable and portable than assembly.

You can even use the "register" keyword to hint the compiler about most often used variables.
This would actually be a bad idea given the number of most used variables and the fact that modern compilers are typically smarter about guessing an optimal register allocation strategy than the user could ever be.

wysota

21st March 2009, 09:08

Besides, well written C code can perform pretty well and is be a lot more maintainable and portable than assembly.
I'm guessing he's not much about portability here and it's obviously some formula that is unlikely to change so the benefit of using assembly (or even NVIDIA's CUDA technology) is huge.

This would actually be a bad idea given the number of most used variables and the fact that modern compilers are typically smarter about guessing an optimal register allocation strategy than the user could ever be.

I would not agree here.

fullmetalcoder

21st March 2009, 09:42

I would not agree here.
Well, maybe some individuals are able to best the compiler regarding register allocation strategy but then it is very likely that they are able produce more optimized assembly code which renders the register keyword useless.

AFAIK careful use of const and restrict keywords (with strict aliasing enabled for the later to be of any use) are pretty good hints to give to the compiler in such a situation and register is unlikely to make a big difference.

I'm guessing he's not much about portability here and it's obviously some formula that is unlikely to change so the benefit of using assembly (or even NVIDIA's CUDA technology) is huge.

True enough but given the difficulty encountered to write it in C, I doubt that he will be tempted to learn assembly. CUDA might be a viable alternative as it does not require use of assembly AFAIK but its biggest advantage is that it offers massive parallelization and the algorithm above looks like an iterative approximation to me so I'm not sure parallelizing it will be easy...

lixo1

21st March 2009, 10:04

Hi guys,

This post is a well of culture!
Great thank you for all these high level informations, I would like only to ask you if it is possible, to link some web page tutorial (or simple book titles) about the different strategies, because it's very difficult to find informations about these technologies if we don't work with it.

Only a thing: my application need to be redistributable on any windows/linux/mac.

Thank you very much, for this great post.
Cheers,
Louis

fullmetalcoder

21st March 2009, 10:12

NVidia CUDA : http://www.nvidia.com/object/cuda_home.html

Some blog posts on restrict keyword (http://www.cellperformance.com/mike_acton/2006/05/demystifying_the_restrict_keyw.html) and strict aliasing (http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html).

I can't seem to find a link to that blog discussing the use of the register keyword so you'll have to search yourself.

lixo1

21st March 2009, 10:26

Great thank you very much.

wysota

21st March 2009, 12:28

I would also point you to Qt Concurrent which is a Qt multi-core solution for parallelization.

lni

21st March 2009, 19:24

I think you should be able to speed it up 100 times using C codes if you don't calculate the same data repeatedly. Besides sin( teta ) optimization, you can make an eulerb array of eulerb[ 2 x 2 x nteta x nphi x npsi ], which means the total calls to eulerb is 2 x 2 x nteta x nphi x npsi.

Since your phi/psi/teta are the same size, and their dimension is 2*M_PI/step, so the total size for eulerb is 2 x 2 x (2*M_PI/step)^3, if your step is 0.00314, you will have 100 array size in teta, phi, and psi, and total memory needed is 122 MB, this amount of memory is very little for modern computer. If step = 0.00157, you would need 976 MB memory though, but it is way over sampled.

Going to assembly or CUDA/GPU is not that easy...

FORTRAN vertorization can be useful too, which is much easier than CUDA

wysota

21st March 2009, 22:53

Do you know what "CUDA" means in Polish? It means "miracles". Funny name, right? And I guess it's true you can do miracles with it ;)

lixo1

21st March 2009, 23:22

Hi guys,

I optimized my code, now it takes 3min :-) on Windows and Linux. Thank you.

About CUDA: it seems to be very interesting and good but unfornatelly my pc has a Nvidia GeForce 7400, so it's not compatible.

About QtConcurrent: it seems to be a great implementation, but I cannot understand exactly how to use it. There are examples of its use, but they don't show how to employ QtConcurrent with a mathematical (in general calculation) application. Did you have a tutorial, or could you explain me how to use it.

Another time thank you very much for all help.
Cheers,
Louis

lni

22nd March 2009, 03:35

Do you know what "CUDA" means in Polish? It means "miracles". Funny name, right? And I guess it's true you can do miracles with it ;)

GPU with CUDA is about 10 to 20 times faster than CPU for a correct size of data set. However, I am currently having trouble with it. I can't monitor the GPU using Linux's "top" command.

I have a job that runs very slow in GPU, the CPU's load is about 30% most of the time, I have been pushing Nvidia to provide a "top" like command to monitor GPU in real time, but they said most of GPU jobs only take a few mil-seconds, so can't monitor it. I disagree with it, at least my job take minutes to hours in GPU.

wysota

22nd March 2009, 09:11

About CUDA: it seems to be very interesting and good but unfornatelly my pc has a Nvidia GeForce 7400, so it's not compatible.
That's true but you can have more than one version of the algorithm implemented so that when you run the application on a machine that supports CUDA, you'd get the result faster.

About QtConcurrent: it seems to be a great implementation, but I cannot understand exactly how to use it. There are examples of its use, but they don't show how to employ QtConcurrent with a mathematical (in general calculation) application. Did you have a tutorial, or could you explain me how to use it.

It allows you to operate concurrently on different data - either doing the same calculations (with map() or filter()) or different ones (with QRunnable).

I have a job that runs very slow in GPU, the CPU's load is about 30% most of the time, I have been pushing Nvidia to provide a "top" like command to monitor GPU in real time, but they said most of GPU jobs only take a few mil-seconds, so can't monitor it. I disagree with it, at least my job take minutes to hours in GPU.

Maybe you should provide them with examples of such jobs to convince them?

lixo1

22nd March 2009, 11:58

Ok thank you,

QtConcurrent:
So if I well understand I need to: create a QRunnable function, that is copy my QThread::run() to this new function, them apply something like
QFuture<void> calculation = QtConcurrent::map(...) ??????
Is it correct?

Thank you another time,
Cheers,
Louis

wysota

22nd March 2009, 12:57

No, runnables and map() are different mechanisms. In a runnable you can place whatever you want and ask QtConcurrent to run such task for you. With map() you divide your problem into subproblems where each subproblem does exactly the same thing (same algorithm) but on a different data. Then you ask QtConcurrent to run the calculations for you and return the result when it's ready.

For instance if you had such a task to perform:

int array[10];
struct Input {
int base;
uint exp;
};
Input input[10];
for(int i=0;i<10;i++){
input[i].base = qrand() % 1000;
input[i].exp = qrand() % 100;
}
for(int i=0;i<10;i++){
array[i] = 1;
for(int j = 0; j < input[i].exp; j++)
array[i] = array[i]*input[i].base;
}
which is a naive way to calulate random powers of random values 10 times in sequence, you could parallelize this as:

QList<Input> input;
for(int i=0;i<10;i++){
Input inp;
inp.base = qrand() % ...
inp.exp = ...
input << inp;
}

QList<int> result = QtConcurrent::blockingMapped(input, myFunc);
with myFunc() defined as:

int myFunc(const Input &inp){
int result = 1;
for(int i = 0; i<inp.exp; i++)
result = result * inp.base;
return result;
}

The difference is the first method runs in an iterative manner on a single core and the second one runs in parallel on all computing units of your system, hence you get the result much faster and with better resource usage in the second case.

songyuncen

22nd March 2009, 12:58

Here is my suggestion in order, hope help.
1. Using better algorithm is the first choice.Such as using look up table to make a balance between time and space.
2. Parallel :
(1) parallel in threads : I think OpenMP is a much easier solution instead of other multithread techniques. However it is not supported in express version of Visual C++. It's a set of compiler directives which make your computation parallel automatically.
(2) parallel in instruction : It's very hard for compiler to find the parallelism in your compution even if the SIMD flag has been turn on. So the assembly code from these compiler just use these SIMD instructions to accomplish sequence tasks mostly. You need write the main part of of your computation in assemly language or some platform-independent wraper all by yourself. It speeds your computation more than multi-thread technique.
(3) CUDA/OpenCL or some thing call heterogeneous systems parallel computation maybe. It's really fast. It's hard for me to learn with little graphics computation architecture knowledge and you need a Nvidia card in your computer, it's make your software hard to deploy freely.
3 embedded computation : ASIC/FPGA or DSP processors. OK, I'm going far now ...

lixo1

22nd March 2009, 13:33

Hi wysota,

Thank you very much for the example, now I understand what I have to do.
To start I will implement a QtConcurrent, to verify the needed time.
Thank you another time,
Cheers,
Louis

wysota

22nd March 2009, 13:47

Just remember that to employ SIMD correctly, the result of one computation can't influence restults of other computations. In other words, one iteration can't depend on the data modified by some other iteration or on the order of iterations or the overall result has to be undependent of the order of "chunks" of the calculation.

lixo1

22nd March 2009, 20:39

Hi wysota,

Only a last question:
I tried to use QtConcurrent::run() and it works (but the speed is the same like a normal QThread, ok its correct), now I would like to use the blockingMapped, but I cannot setup it correctly to my function inside my class:
For example:

QFuture<void> result = QtConcurrent::run(this,&MyClass::calc1);
//but it doesn't accept the following
QFuture<void> result = QtConcurrent::blockingMapped(this,mysequence,&MyClass::calc2);

Thank you very much for all help,
Cheers,
Louis

wysota

22nd March 2009, 22:27

If you want to use a method, it either has to be declared as static or it has to be a method of the class that serves as an item for the data container.

QtConcurrent::run() uses a runnable so you won't gain any speed on it. The second statement is surely invalid. The first argument (this) shouldn't be there - you should pass an input sequence and a function pointer taking items of the sequence and returning the type of the result sequence. And blockingMapped() doesn't return a QFuture by the way. I suggest you take a look at the docs.