update progressBar in parallel loop, using OpenMP [Archive]

View Full Version : update progressBar in parallel loop, using OpenMP

jesse_mark

22nd January 2013, 21:25

Hello,

I have subclassed a Qthread to do some work, and I also wanted to update a progressbar value.

for that, i have done this:

#pragma omp parallel for private(...) firstprivate(...) reduction(+:progressValue)
for (int i =0 ; i< n ; i++)
{
....
......
for (int j = 0 ; j <n; j++)
{
mLoack.lock();
access vector[j];
mLoack.unlock();
.......
.......
} // forj
#pragma omp atomic
progressValue +=1;

emit updateProgressbarValue(progressValue);
}

This loop in a the mythread, I emit a signale to update the progressebar.

but when i run it, the progressbar keep plinking and just reach to 12% only.

can anyone please tell me how can i update the progressBar value in parallel loop.

Thank you so much.

wysota

22nd January 2013, 21:28

Where is this code located?

jesse_mark

22nd January 2013, 22:37

In a separate thread. I emit the signal and i connect it to a slot in the main thread.

wysota

22nd January 2013, 23:03

So you have a QThread and somewhere in its run() implementation you start a bunch of other threads using OpenMP and the code you posted is inside this MP block?

Can we see the exact code?

jesse_mark

22nd January 2013, 23:23

Yes, this is the case. as you can see i do have the OpemMP pragma with the code i posted. this OpenMP pragma will fork threads based on your how many cores the machine has. in my case they are 8.

// in the main thread

mythead threadCompute = new compute(VecotrA,VecotrB);
connect(threadCompute,SIGNAL(updateProgressbarValu e(int)),this,SLOT(onUpdateProgressBarValue(int)));

void myClass::onUpdateProgressBarValue(int value)
{
ui->ProgressBar->setValue(value);
}

void myClass::computeVectorA()
{
threadCompute->start();
}

///in the QThread ,

//header
SIGNAL:
void updateProgressbarValue(int value);

//cpp
void myThread::run()
{

compute();
}

void compute()
{
#pragma omp parallel for private(...) firstprivate(...) reduction(+:progressValue)
for (int i =0 ; i < n ; i++)
{
for (int j = 0 ; j <n; j++)
{
updateVectorAat( i ,VecrtorB.at(i));

} // forj

#pragma omp atomic
progressValue +=1;

emit updateProgressbarValue(progressValue);
}

}

this is almost what i am doing, I have not problem with the computation actually, only with updating the progressBar.

thanks

wysota

23rd January 2013, 00:34

Is progressValue incremented properly for you reaching the expected value?

jesse_mark

23rd January 2013, 14:56

Actually this is was my problem .. so was hoping that someone has tried that before.

I solved the problem by removing the "progressValue" from the reduction clause, and just use the atmoic pragma.

.. so when i create the thread >> new myThread (....), does that mean .. this thread is created and will not be used by the system ?? or its will be mine just when it has work to do ?? only. I mean it only be mine when i called "Qthread.start()".

Thank you.

wysota

23rd January 2013, 15:26

A thread is a thread. It's not "yours" or "anyone else's". A thread belongs to the process and each process has its own set of threads. All threads in the system share the processor -- at some moment a thread from process A can use the CPU but a moment later a thread from a different process might use the same CPU.

As a side note, I think in your particular case there are no benefits from using multiple threads. I don't know how exactly OpenMP works so I can't say what the compiler will do to your code however a for loop working on the same data in each iteration can't really be parallelized that much.

jesse_mark

23rd January 2013, 16:45

OpenMP will just divide the outer loop, on the number of threads threads.
so if there is no dependencies between i =1 and i =2, then i=1 and i=2 can be each given to different threads to run them in parallel.
ex.
if we have a loop from 1 to 100 and all the logic in this loop has no dependence between different iterations ,then we can safely run i =1 or i = 100 which first does not affect the result. in such cases we can use OpenMP.
if we assume we have 5 threads then each 20 iteration are given to one thread and execute them in parallel. our loop will become groups of (20, 20 ,20 ,20,20) each group will be run by a thread in the same time.

maybe you are right, i am new to parallel programming tho.
but my case is that, i am working on two QVectors, each vector is could be over 1M elements , and holds a strut for each element.

first vecrtorA has (iD,point,value)
2nd vectorB has (point, value)
all what I am trying to do is :

for each point in VectorA, trying to find the closest point to it in vectorB. If IU find one i take the valaue of that point to assinge it to vectorA not tht value of vectorA for that point will be -1

QVector<mystruct1> vectorA;
QVector<mystruct1> vectorB;

laodVecotrA(); // vlaue is empty now.
loadVectorB()
/////////////////////////////////////////////

mystrcut1 instanceOfMyStruct1;
mystrcut2 instanceOfMyStruct2;
progressValue = 0;

#pragma omp parallel for private (mystrcut1,mystrcut2)
for (int i = 0 < i < vectorA.size();i++) // vectorA.size() > 1M
{
instanceOfMyStruct1 = vectorA.at(i);
bool found = false;

for(int j = 0 ; j < vectorB.size(); j ++ ) // vectorB.size() > 1M
{
instanceOfMyStruct2 = vectorB.at(j);
found = finedClosestPoint(instanceOfMyStruct1.point , instanceOfMyStruct2.point);
if (found)
{
break ;
}

}//for j

if (found)
{
instanceOfMyStruct1.value = instanceOfMyStruct2.value ;
vectorA[ i ] = instanceOfMyStruct1;

}
else
{
instanceOfMyStruct1.value = -1
vectorA[ i ] = instanceOfMyStruct1;
}

#pragma omp atomic
progressValue++;

emit updateProgressbarValue(progressValue);
}

any suggestion in doing that in a better way is really appreciated.

thank you so much for your time and your help

wysota

23rd January 2013, 17:21

so if there is no dependencies between i =1 and i =2, then i=1 and i=2 can be each given to different threads to run them in parallel.
There is a dependency on j.

any suggestion in doing that in a better way is really appreciated.
I think your algorithm is too naive. I think it is possible to come up with an algorithm that makes use of already calculated results to cut off calculations for further input. I think doing some clustering before trying to find neighbours should give a large increase in speed. It's just a thought though.

jesse_mark

23rd January 2013, 20:49

There is a dependency on j.

we are runnig the out loop in parallel not the inner loop. so for each i and j loop will run in sequential case1 = (i =1 and j =0 to n), case2 = (i = k and j =0 to n)
if we run case1 1st then case2 or in the same time it wont make a difference .

I think your algorithm is too naive

Yeah, I know.

I think it is possible to come up with an algorithm that makes use of already calculated results to cut off calculations for further input
if i removed the matched result from vectorB, will that improve the speed ?? such as : :VectorB.remove(j)" each time i found a match. idk what slow down from removing element from random location in a vector

wysota

23rd January 2013, 21:01

we are runnig the out loop in parallel not the inner loop. so for each i and j loop will run in sequential case1 = (i =1 and j =0 to n), case2 = (i = k and j =0 to n)
I know that but the thing is multiple threads are concurrently accessing the same data which may lead to race conditions or data corruption. I don't know how the OpenMP compiler will react to it. Based on your problems with incrementing the counter, it doesn't exactly work correctly.

if i removed the matched result from vectorB, will that improve the speed ?? such as : :VectorB.remove(j)" each time i found a match.
No, that would lead to a race condition among threads and it would also simply be incorrect. One point in B can be the closest point of more than one point in A.

I was rather thinking about some heuristics that if B1 is closest to A1 and there is some relation between A1 and A2 then Bx will not be closer to A2 than B1. I think Dijkstra's algorithm could be adapted for this case.

jesse_mark

23rd January 2013, 21:27

No, that would lead to a race condition among threads and it would also simply be incorrect. One point in B can be the closest point of more than one point in A.

if we consider the sequential case of using one thread. Also in my case each point in B is close to only one point in A. ?? does the remove could do any improve??

Actaully i have tried it. and i had improve in one case and other case it slow it down.

case1 was vectorA >> vectorB .... time improve.

case2 was vectorB >> vectorA .... slow down.

thank you so much, your suggestions are highly appreciated.

wysota

23rd January 2013, 22:08

does the remove could do any improve??
Only if you use just one thread.

Try keeping both vectors sorted by the X position (sort while inserting, not after all data is there)

Then start iterating over the B vector. If the point you are checking fulfills condition (Xb-Xa) >= dist where "dist" is the distance between point in vector A which you are trying to find the closest point to and point in B that has the minimum distance found so far then you can stop checking all further points because they will be further away from point in A than the currently found point. If you start searching from the middle of vector B, you can do the same in the other direction. In many cases this will shorten the lookup significantly. If you connect that with removing already taken points from the B vector, you should get additional speedup for single-threaded approach.

Added after 27 minutes:

Here is a potential Qt5 implementation (I don't know if it works correctly, I haven't checked the results):

#include <QCoreApplication> // QT += core
#include <QtConcurrent> // QT += concurrent

bool lessThanX(const QPoint &pt1, const QPoint &pt2) {
return pt1.x() < pt2.x();
}

static QAtomicInt ignored = 0;

struct FindClosestPoint
{
FindClosestPoint(const QVector<QPoint> &vec) : B(vec) {}

typedef QPoint result_type;

QPoint operator()(const QPoint &point)
{
QPoint candidate;
double distance = -1;
const int len = B.size();
candidate = B.at(0);
distance = (point-B.at(0)).manhattanLength(); // for simplicity, you can use Euclidean distance
for(int i=1;i < len; ++i) {
if(qAbs(point.x()-B.at(i).x()) >= distance){
qDebug() << "Ignored" << len-i-1 << "elements";
ignored.fetchAndAddAcquire(len-i-1);
return candidate; // stop condition
}
double newDist = (point-B.at(i)).manhattanLength(); // for simplicity
if(newDist < distance) {
distance = newDist;
candidate = B.at(i);
}
}
return candidate;
}

QVector<QPoint> B;
};

int main(int argc, char *argv[])
{
QVector<QPoint> A;
QVector<QPoint> B;

qsrand(QDateTime::currentDateTime().toTime_t());

int lenA = qrand() % 1000;
int lenB = qrand() % 1000;

for(int i=0;i<lenA;++i) {
QPoint pt(qrand() % 100, qrand() % 100);
QVector<QPoint>::iterator iter = qLowerBound(A.begin(), A.end(), pt, lessThanX);
A.insert(iter, pt);
}

for(int i=0; i<lenB; ++i) {
QPoint pt(qrand() % 100, qrand() % 100);
QVector<QPoint>::iterator iter = qLowerBound(B.begin(), B.end(), pt, lessThanX);
B.insert(iter, pt);
}

QVector<QPoint> results = QtConcurrent::blockingMapped(A, FindClosestPoint(B));

for(int i=0;i<results.count();++i) {
qDebug() << i << A.at(i) << results.at(i);
}
int totIgnored = ignored.load();
qDebug() << "Average ignored elements:" << totIgnored << "out of " << A.size()*B.size();
return 0;
}

Of course every run is different but my tests show a 40-60% reduction in execution time (using 6 core CPU) compared to an unsorted approach (e.g. 10ms vs 33ms). To calculate real speedup, you need to also consider the time needed to presort those vectors so the real speedup is somewhat smaller. Of course provided my algorithm is correctly implemented. Furthermore if you use the two-way approach, the speedup will be even nicer.