does a Qt application run faster on a windows 64 bit machine? [Archive]

View Full Version : does a Qt application run faster on a windows 64 bit machine?

qt_gotcha

7th January 2011, 13:54

I am developing a soil erosion model in QT (mostly the interface is Qt, the model itself uses some Qt but is mostly plain C or C++), http://sourceforge.net/projects/lisem/. Now a user asks if the application is faster on a 64bit machine. This type of models typically simulate the water balance for gridded maps of 500x500 cells and use some 400 steps to do that. So its really number crunching.
I have no idea, if 64bit is faster, I can't test it. Should I do a certain way of compiling?

thanks

wysota

7th January 2011, 14:12

If the application is built for a 64-bit target and not 32-bit one then the compiler is likely to optimize some things so "number crunching" should be a bit faster (up to 30% probably but it depends on a case-by-case basis). Qt however is not that much about number crunching so it will not gain much here and it will certainly not influence any part of your application that doesn't use Qt classes.

pkohut

7th January 2011, 16:07

Since the code is heavy on loops, reducing the number of "if" statements will provide a nice speed boost. For example, from "lisInfiltration.cpp" you have these lines of code (applies to other files as well)

if (SwitchBuffers && !SwitchSedtrap)
if(BufferID->Drc > 0 && BufferVol->Drc > 0)
WH->Drc = 0;

Because of the && operator, lines 1 and 2 account for 4 if statements. Using the logical & operator can reduce this to 2 if statements without changing the structure of the code.

if (SwitchBuffers & !SwitchSedtrap)
if(BufferID->Drc > 0 & BufferVol->Drc > 0)
WH->Drc = 0;

This could further be reduced to a single if statement ...

if(SwitchBuffers & !SwitchSedtrap & BufferID->Drc > 0 & BufferVol->Drc > 0)
But careful observation and profiling need to be done to make sure it doesn't adversely effect the speed and run slower than the original if statement. The reason is that accessing memory outside of the L1 and L2 cache is slow. If BufferID is outside of those caches then it will be brought into the cache unnecessarily if(SwitchBuffers & !SwitchSedtrap) is false.

The second example would be my preferred choice.

Next, to help the compiler you should define constant variables as const. This will tell the compiler that the variable is not suppose to change and the compiler can apply some optimizations. For example, this

for(...) {
double Perim, Radius, Area, beta = 0.6;
double _23 = 2.0/3.0;
double beta1 = 1/beta;
double wh = ChannelWH->Drc;
double FW = ChannelWidth->Drc;
double grad = sqrt(ChannelGrad->Drc);
double dw = 0.5*(ChannelWidthUpDX->Drc - FW); // extra width when non-rectamgular
}
... can be changed to this.

for(...) {
double Perim, Radius, Area;
const double beta = 0.6;
const double _23 = 2.0/3.0;
const double beta1 = 1/beta;
double wh = ChannelWH->Drc;
double FW = ChannelWidth->Drc;
double grad = sqrt(ChannelGrad->Drc);
double dw = 0.5*(ChannelWidthUpDX->Drc - FW); // extra width when non-rectamgular
}
again, not really changing the code structure of what you've already got, just making your intent clear and helping the compiler along the way.

And finally, this will change the code structure, but eliminates an extra if statement when certain conditions are met. Change this ...

if (Perim > 0)
Radius = Area/Perim;
else
Radius = 0;

ChannelAlpha->Drc = pow(ChannelN->Drc/grad * powl(Perim, _23),beta);

if (ChannelAlpha->Drc > 0)
ChannelQ->Drc = pow(Area/ChannelAlpha->Drc, beta1);
else
ChannelQ->Drc = 0;
to this ... (preloading)

Radius = 0;
if (Perim > 0)
Radius = Area/Perim;

ChannelAlpha->Drc = pow(ChannelN->Drc/grad * powl(Perim, _23),beta);

ChyannelQ->Drc = 0;
if (ChannelAlpha->Drc > 0)
ChannelQ->Drc = pow(Area/ChannelAlpha->Drc, beta1);

Here the variables are preloaded with one of the 2 if conditions. You either want to preload with the most likely to occur, or the least expensive to preload. Here preloading with 0 is the least expensive (xor var, var) even if ChannelAplpha->Drc is most likely to occur.

HTH

wysota

7th January 2011, 18:21

I don't really agree with the last part. Since Perim is much much more likely to be non-zero preloading a 0 radius is probably a waste of work. I don't know if the other statement has a similar probablility but it is likely it does. Besides, this is all academic talk, most of what is described should (and probably will) be done by the compiler anyway. The only thing that can cause a real difference is the use of "const" with const values and const methods. The compiler could optimize some of it even to an immediate value then and skip memory reads completely.

But all these optimizations are nothing compared to implementing your calculations in OpenCL. Then it doesn't matter if your cpu is 32 or 64 bit.

pkohut

7th January 2011, 20:04

I don't really agree with the last part. Since Perim is much much more likely to be non-zero preloading a 0 radius is probably a waste of work. I don't know if the other statement has a similar probablility but it is likely it does.
In this case the extra work amounts to the generation of "xor var, var" instruction or some other equally fast zeroing of the variable/register. If the work required to preload the variable and then drop into a single if statement is less work than running a if-then-else-then compare, then it is a reduction in work. Of course the only way to know for sure is to profile and/or look at the assembly output generated by the compiler.

Simple probabilities will give a good idea of what to expect, but branch prediction penalties between machine architecture are not equal or weighted the same. Measurements must be done.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/

Besides, this is all academic talk, most of what is described should (and probably will) be done by the compiler anyway.
I agree to a point. Best bet is to make some changes and profile. Because the OP's code is so loop heavy the changes should be measurable.

For instance, I wrote a kd-tree algorithm and then optimized it with every trick I could think of and performed timings after each change. One of the final optimization I did was if statement simplification and preloading. I don't recall how much time preloading saved, but I left it in in the code, so it must of saved something (will check after logging into Windows and recompile). Together, they saved 226 seconds from a prior runtime of 1178 seconds (from a point base of 1,000,000, find all sets of points within 10,000 units of each other)
http://www.theswamp.org/index.php?topic=32874.msg401784#msg401784 (see charts)

But all these optimizations are nothing compared to implementing your calculations in OpenCL. Then it doesn't matter if your cpu is 32 or 64 bit.
True enough, however this requires a complete rewrite of the code and data structures which might be beyond the abilities of the OP.

SixDegrees

7th January 2011, 20:34

Because of the && operator, lines 1 and 2 account for 4 if statements.

Hard to say. Most compilers perform short-circuit evaluation of such constructs; as soon as any evaluation fails the body of the loop is skipped, so putting your least likely cases first can often yield a big payoff.

Overall, the machine code generated by compilers is often very difficult to predict from source code statements, particularly if optimizations are cranked up. The only real way to be sure is to do timing runs; a decent profiler can be a big help here. Even then, however, execution can be greatly affected by the data being processed (as noted above) so you have to make sure your inputs are reasonable.

nroberts

7th January 2011, 21:36

Most normal applications will be made slower if you compile them 64 bit due to the larger memory space and larger data types. Doubt you'll notice, but it's certainly not bound to save you any time. Furthermore, most applications are pretty far from being 64 bit compatible. Even worse, the kind of bugs that produces are nastty little f'ers. You could have a lot of work ahead if you've got to convert.

Of course most applications will run faster on a 64 bit machine simply because the pure processing power of 64 bit processors is greater. Not so much because they're 64 bit but because they're just plain made to be faster. But unless you need the 64 bit stuff, it's hardly worth the trouble to mess with.

I even did some testing a long time ago with a chinese chess engine I made that used bitboards. Seemed like the perfect thing to test the difference between 32 vs. 64 bit. I tried 64 vs. 32 bit and constructed my own bitmap<> that used sse registers as a second variable. The fastest, iirc, was the 64bit w/ standard implementation of bitmap, but it wasn't my so much that I felt I was getting major value. A standard chess board may have gotten better results. I was especially disappointed in the result of the sse bitmap, it was actually slower even though the whole bitboard fit in the one register (chinese chess has an extra row and col so isn't a nice, round 64).

The Qt bits are almost sure to be no faster if not slower. What you should be checking is your own, underlying model code. Are you performing heavy calculations that get a major boost from 64bit? If not, then f'ck it.

pkohut

7th January 2011, 22:23

Just did the Kd-tree tests using a dataset of 1,000,000 points. The FindNearest routine is the most heavily used routine in the app. Because of L1 and L2 cache misses preloading the left or right node (leaf) is worse than than using if-then-else-then case. Since I brought it up before figured better show results (preloading works in some conditions, measurements have to be done though to weed out adverse changes)
edit: the tests were performed on already highly optimized code. This thread http://www.theswamp.org/index.php?topic=32874.msg400856#msg400856 and the one after talk in detail about the changes made a few months back which resulted in an 18% speed boost.

The test code connects all points within R(ange) of each other.

31.81126 seconds for query range of 1000, of 1,000,000 points

pNode1 = pNode->left;
if(dx > 0.0)
pNode1 = pnode->right;

31.53121 seconds for query range of 1000, of 1,000,000 points

if(dx > 0.0)
pNode1 = pNode->right;
else
pNode1 = pNode->left;

Current code optimization, eliminate the if statement.
30.50571 seconds for query range of 1000, of 1,000,000 points

int bDxGreaterThanZero = dx > 0.0;
KDNODE<T, DIM> * pNode1 = (KDNODE<T, DIM> *) ((sizeof_t*)&pNode->left)[bDxGreaterThanZero];

This works because the left and right nodes are purposely paired next to each other in the KDNODE structure.

Hard to say. Most compilers perform short-circuit evaluation of such constructs; as soon as any evaluation fails the body of the loop is skipped, so putting your least likely cases first can often yield a big payoff.

From the latest divide and conquer code of that swamp thread,

if( fabs(pts[high].x - pts[low].x) <= distance &&
fabs(pts[high].y - pts[low].y) <= distance &&
pts[low].distanceTo(pts[high]) <= distance) {
is slower than

if( (fabs(pts[high].x - pts[low].x) <= distance) &
(fabs(pts[high].y - pts[low].y) <= distance) &&
pts[low].distanceTo(pts[high]) <= distance) {

The above changes to the code shaved 3.5% off the previous runtime.
http://www.theswamp.org/index.php?topic=32874.msg400856#msg400856
p.s. the thread after talks about preloading in detail. There I describe the reduced run time results from preloading.

wysota

7th January 2011, 23:48

Simple probabilities will give a good idea of what to expect, but branch prediction penalties between machine architecture are not equal or weighted the same. Measurements must be done.
What I meant doesn't require branch prediction. Handling division by zero is an uncommon case as it is very uncommon to find any useful algorithm that in most cases causes a division by zero so the pure algorithm analysis should determine that it is a rare case (or not).

True enough, however this requires a complete rewrite of the code and data structures which might be beyond the abilities of the OP.
Still the extra speed is worth considering such an approach. And even if not, it's a good guess (I haven't seen the exact code so that's just a guess) some parts of the current structure of the algorithm can be paralellized (e.g. by using QtConcurrent) especially if the calculations are iteration based.

Hard to say. Most compilers perform short-circuit evaluation of such constructs; as soon as any evaluation fails the body of the loop is skipped, so putting your least likely cases first can often yield a big payoff.
"Most compilers" or "all compilers"? Isnt't this part of the standard that the very first failed component stops the evaluation of the whole condition? The same way as with OR. If only "most" compilers optimized it, code would not be deterministic.

qt_gotcha

9th January 2011, 16:13

dear all, thanks alot for all your answers and even helping me with my code. I'll implement some of your advises and start some testing

qt_gotcha

11th January 2011, 14:20

pkohut

11th January 2011, 16:40

if you are still following this, by far the most operations are of the kind:
for r = 0 to nrrows
for c = 0 to nrcols
do somthing

are tghese for loops expnsive and can it be done more efficient?
thanks

edit: I see there is tons of this on the internet, I''ll start reading...

>>snippet from previous message
This type of models typically simulate the water balance for gridded maps of 500x500 cells and use some 400 steps to do that<<

It's the work done in the loops that can get expensive, with the possibilibty of each of the 400 steps processed serially. Without changing the core of your the application, apply const to all the variables that should be/are constant, especially inside those loops.

How long does it take to process a typical model?

pkohut

11th January 2011, 18:46

In the code you have these macros defined

#define Drc Data[r][c]
#define DrcOutlet Data[r_outlet][c_outlet]
#define MV(r,c) IS_MV_REAL4(&Mask->Data[r][c])
#define FOR_ROW_COL_MV for (int r = 0; r < nrRows; r++)\
for (int c = 0; c < nrCols; c++)\
if(!IS_MV_REAL4(&Mask->Data[r][c]))

#define FOR_ROW_COL_MV_CH for (int r = 0; r < nrRows; r++)\
for (int c = 0; c < nrCols; c++)\
if(!IS_MV_REAL4(& ChannelMask->Data[r][c]))
Normally, not a big fan of macros, but in this circumstance they work well.

With the exception of any code that relies on variables r and c directly (looked like only a few), the macros can be changed and loops simplified (flattened). How much of a speed up is unknown. I'd change a few functions and benchmark to see if the effort is worth it.

// return the row and col index, computed from input idx and nCols
void RowColFromIndex(int & row, int & col, int idx, int nCols)
{
row = idx / nCols;
col = idx % nCols;
}

// return the flat index computed from inputs row, col, nCols
int IndexFromRowCol(int row, int col, int nCols)
{
return row * nCols + col;
}

#define DrcIdx Data[rIdx][cIdx]
#define FOR_ROW_COL_MV for (int idx = 0; idx < nrRows * nrCols; ++idx)\
int rIdx, cIdx; \
RowColFromIndex(rIdx, cIdx, idx, nrCols); \
if(!IS_MV_REAL4(&Mask->Data[rIdx][cIdx]))

#define FOR_ROW_COL_MV_CH for (int idx = 0; idx < nrRows * nrCols; ++idx)\
int rIdx, cIdx; \
RowColFromIndex(rIdx, cIdx, idx, nrCols); \
if(!IS_MV_REAL4(& ChannelMask->Data[rIdx][cIdx]))

So, basically anywhere variable r and c are used need to be changed to rIdx and cIdx (using rIdx and cIdx helps find errors while porting).

wysota

11th January 2011, 19:50

All those loops can be easy paralellized. These that don't depend on the order of calculations can be done with QtConcurrent or with a simple OpenCL kernel. Those that depend on the order of execution are a bit more tricky but I'm sure they can be sped up as well. If you use OpenCL you can get a speed-up in the magnitude of 10-50x. With QtConcurrent the speed-up depends on the number of cpu cores (so it's probably around 2-4x max). There is also a possibility to use SIMD optimizations available in your CPU such as SSE.

pkohut

11th January 2011, 21:10

wysota

11th January 2011, 21:38

While technically true, there is no free lunch. "easy paralellized" "simple OpenCl kernel", again are dependent an the OP's capabilities and understanding of parallel programming issues.

It's not that bad. If you have a for loop such as the following:

for(int r = 0; r < rows; ++r) {
for(int c = 0; c < cols; ++c) {
doSomething(r,c);
}
}
Can be rewritten as

QFutureSynchronizer<void> waiter;
for(int r = 0; r < rows; ++r) {
for(int c = 0; c < cols; ++c) {
waiter.addFuture(QtConcurrent::run(doSomething, r, c));
}
}
waiter.waitForFinished(); // or do something else in the meantime

This doesn't require any skills and provided that doSomething() does more than just add two ints together, you will get a decent speed improvement on a multicore system. Of course there is a good chance the loop can be substituted with a flat call to QtConcurrent::mappedReduced(). OpenCL is more tricky although I see the OP's calculations are mostly simple vector operations (additions and multiplications) so the code doesn't need many changes.

qt_gotcha

21st January 2011, 09:15

thanks for the help and suggestions guys, I really apreciate it.