Is there a maximum vector length other than a memory limit? [Archive]

agerlach

28th May 2010, 18:56

I am using Win 7 64 bit with 12 GB RAM.
I want to allocate a QVector<double> of length 714838*256 by:

QVector<double> *qwert = new QVector<double>(714838*256);but I get a bad_alloc() exception. So I tried it with just:

double * qwert = new double[714838 * 256];And I still get an error. After a few trials I found that I can allocate 178077685 doubles but not 178077686. But,

QVector<double> *qwert = new QVector<double>(178077685);still throws the same exception. I show this to be ~1.32 GB of doubles wihich is << 12 GB or RAM available.
I've looked all day online to find a reason why I cannot allocate a double array any larger and the only explanation I can find is a RAM limitation, which is not the case here.

tbscope

28th May 2010, 19:11

There are more limits on your pc besides the amount of ram. (as in, do you know exactly where this memory is reserved?)

Why, dear oh dear, why?
Why do you need that much memory?

To me this screams "Complete and utter bad design"
Can you tell why you need to do this?

SixDegrees

28th May 2010, 19:26

What happens if you try to allocate a simple array of doubles, as in

double* t = new double[178077686];

Are the results different?

agerlach

28th May 2010, 19:29

What happens if you try to allocate a simple array of doubles, as in

double* t = new double[178077686];

Are the results different?

I may have been unclear but I tried that. This works:
double* t = new double[178077685];
but this does not:

double* t = new double[178077686];

agerlach

28th May 2010, 19:43

There are more limits on your pc besides the amount of ram. (as in, do you know exactly where this memory is reserved?)

Why, dear oh dear, why?
Why do you need that much memory?

To me this screams "Complete and utter bad design"
Can you tell why you need to do this?

The 714838*256 size represents 714838 16x16 "images" called a training stack. This is used by taking another 16x16 "image" and correlating it with all the images in the stack. This needs to be performed fast. I currently have this algorithm built in Matlab and I have no problem allocating this much memory. Based on the time it takes to do this in Matlab and based on the speed ups I have experienced in the past from moving to C++ I expect to do this in < 1 s. Due to the speed requirements I would rather not have to write intermediate steps to the hard disk. Plus this code will not be deployed so I can control any hardware constraints.

If anybody has any suggestions on betters ways to do this that will not sacrifice speed, I am open to ideas.

What other limits would I have other than RAM? And what do you mean by "do you know exactly where this memory is reserved?"

tbscope

28th May 2010, 19:53

What other limits would I have other than RAM? And what do you mean by "do you know exactly where this memory is reserved?"

What sits between your physical ram and your program code?
It's a chain with limits.

For example, you will not be able to use the 12GB of ram on a 32 bit system

agerlach

28th May 2010, 19:55

I understand that I will not be able to use 12GB RAM on a 32 bit system, but this is on a 64 bit OS.

tbscope

28th May 2010, 19:59

What are your stack and heap sizes?

SixDegrees

28th May 2010, 20:26

I'm not a Windows guy, but the failure of even Plain Old C Arrays to work suggests some sort of OS limit imposed on your processes. In Unix, both the OS and the user can fiddle with the maximum memory size a given process can allocate; this is normally set to "unlimited," but I've seen it restricted on some systems. I suspect Windows is probably "helping" you by imposing such a limit. No idea what the command would be to increase it, though.

Here's an article on the subject - (http://msdn.microsoft.com/en-us/library/aa366778%28VS.85%29.aspx) sounds like this might be worth investigating.

agerlach

28th May 2010, 22:19

What are your stack and heap sizes?

I'm sorry, but could you tell me how to check that?

agerlach

28th May 2010, 22:22

I'm not a Windows guy, but the failure of even Plain Old C Arrays to work suggests some sort of OS limit imposed on your processes. In Unix, both the OS and the user can fiddle with the maximum memory size a given process can allocate; this is normally set to "unlimited," but I've seen it restricted on some systems. I suspect Windows is probably "helping" you by imposing such a limit. No idea what the command would be to increase it, though.

Here's an article on the subject - (http://msdn.microsoft.com/en-us/library/aa366778%28VS.85%29.aspx) sounds like this might be worth investigating.

You are on to something, it looks like my program is compiling at a 32 bit application which is limited to 2 GB of RAM. I created an empty project in VS2010 and compile as x64 and I was able to allocate significantly more memory. I was using Qt creator for development but I don't really know how to set it up to build my projects as x64. Would it be easier for me to just start using VS instead?

SixDegrees

28th May 2010, 22:45

Again, I'm not Windows-savvy, but make sure your QMAKESPEC is pointing to the 64-bit version of its files, not the 32-bit version.

ChrisW67

29th May 2010, 00:32

As I understand it, the bundled MingW environment in the Qt SDK is 32-bit only.

agerlach

3rd June 2010, 21:08

OK, for my project I am using Qt, VTK, and the newmat C++ matrix library. I compiled newmat to x64 and downloaded the source for Qt and ran configure.exe -> nmake -> nmake install from the VS2008 x64 command prompt. Everything built fine. I then built VTK directed at the x64 Qt build in x64. Again, everything built fine. I guess my question is on how to start using the x64 version of Qt I built in Qt Creator and how do I handle QMAKESPEC. I don't see anything in the mkspecs folder to seems appropriate. There is only win32 for msvc2008.

JohannesMunk

3rd June 2010, 22:49

The problem is that the vector requires all that data to be in one big chunk.

Your correlation will probably be interframe for each pixel? So that you need fast access to different frames of the same pixel.

Why don't you manage your data as a 256 long QList of 714838 double vectors?

Access won't be slower. But you allow for some fragmentation.

Joh

agerlach

9th June 2010, 14:35

OK, for my project I am using Qt, VTK, and the newmat C++ matrix library. I compiled newmat to x64 and downloaded the source for Qt and ran configure.exe -> nmake -> nmake install from the VS2008 x64 command prompt. Everything built fine. I then built VTK directed at the x64 Qt build in x64. Again, everything built fine. I guess my question is on how to start using the x64 version of Qt I built in Qt Creator and how do I handle QMAKESPEC. I don't see anything in the mkspecs folder to seems appropriate. There is only win32 for msvc2008.

Apparently Qt Creator handles the QMAKESPEC for you based on the build of qmake you point it to. Everything works good except debugging. I download cdb for x64 and pointed Qt Creator to it but it has the following error:

Unable to load the debugger engine library 'C:\Program Files \Debuggint Tools for Windows (x64)\dbghelp.dll': Cannot load library C:\Program Files\Debugging Tools for Windows (x64)\dbghelp.dll

even though these files exist at that location.

agerlach

9th June 2010, 14:48

The problem is that the vector requires all that data to be in one big chunk.

Your correlation will probably be interframe for each pixel? So that you need fast access to different frames of the same pixel.

Why don't you manage your data as a 256 long QList of 714838 double vectors?

Access won't be slower. But you allow for some fragmentation.

Thanks, that makes sense, but what do you mean by "interfram for each pixel". The following equation is how the correlation is calculated. P is the randomly selected image and Q is one of the 714838 images in the training stack. pi and qi are the ith pixel values of P or Q respectively. Their are 256 pixels per image. Normally W^2 = 256, but in this case it equals the number of pixels that are nonzero in both P and Q.
l 4753

JohannesMunk

9th June 2010, 15:19

Ah ok. So you correlate whole frames. And you need to correlate every new frame against your complete training?

Then you should definitely store the individual pixel and squarepixels sums to speed up calculation of the denominator of your formula.

Store a struct/class for each frame with the rawpixel data and the precalculated values.

Put all your frames into a QList. That will effectively be an array of pointers to your structs/classes. As that array will be quite small and each frame-struct is small, you should not have a problem allocating that memory.
As you will need to resolve those only once for each much more expensive correlation-calculation, that won't have an impact on performance, either.

HIH

Johannes

agerlach

9th June 2010, 19:01

Ah ok. So you correlate whole frames. And you need to correlate every new frame against your complete training?

Then you should definitely store the individual pixel and squarepixels sums to speed up calculation of the denominator of your formula.

Store a struct/class for each frame with the rawpixel data and the precalculated values.

Put all your frames into a QList. That will effectively be an array of pointers to your structs/classes. As that array will be quite small and each frame-struct is small, you should not have a problem allocating that memory.
As you will need to resolve those only once for each much more expensive correlation-calculation, that won't have an impact on performance, either.

Just to make sure I understand everything, when you say "... as that array will be quit small" you are refering to its size in memory since it is just a array of pointers? If so, this make a ton of sense and is a huge help. Thanks!

On a second note, I never thought about pre calculating the sums in the equation. I never did in the past because only the pixel values in Q that correspond to non zero pixel values in P are used in the calculation. So if Q = {1 2 3} and P = {0 0 1} then sum(Q) = 3 since Q(0) and Q(1) are not used. But now that I think about it since the images are never all zero it would be more efficient to calculate the sum and then subtract the elements that correspond to a 0 value in P.

JohannesMunk

9th June 2010, 19:15

Just to make sure I understand everything, when you say "... as that array will be quit small" you are refering to its size in memory since it is just a array of pointers? If so, this make a ton of sense and is a huge help. Thanks!
Thats exactly what it means! Of course the overall memory usage will be equally big, but now the memory manager can decide where to put all the little pieces(=frames), and is not forced (and doesn't fail) to allocate a big continuous chunk.

On a second note, I never thought about pre calculating the sums in the equation. I never did in the past because only the pixel values in Q that correspond to non zero pixel values in P are used in the calculation. So if Q = {1 2 3} and P = {0 0 1} then sum(Q) = 3 since Q(0) and Q(1) are not used. But now that I think about it since the images are never all zero it would be more efficient to calculate the sum and then subtract the elements that correspond to a 0 value in P.
If Q={1,2,3} then sum(Qi)=6. if P={0,0,1} sum(Pi*Qi)=3. I was only talking of the denominator of your formula. There are no mixed (p and q) sum expressions there.

Johannes

agerlach

9th June 2010, 22:12

If Q={1,2,3} then sum(Qi)=6. if P={0,0,1} sum(Pi*Qi)=3. I was only talking of the denominator of your formula. There are no mixed (p and q) sum expressions there.

No, thats not correct. The calculation is only performed using pixels in Q and P that whose values are non zero in BOTH P and Q. Imagine you have a for loop before the correlation calculation that modifies Q based on P. Say Q = {1,2,3} and P = {0,0,1} with the following for loop:

for(ii=0; ii < 3; ii++) {
if(P[ii] == 0)
Q[ii]=0;
elseif(Q[ii] == 0)
P[ii]=0;

then Q = {0,0,3} and sum(Qi) = 3 not 6.

Then the equation is used, but since you don't know P when the training stack is created you cannot precalculate the sums. My point was that time required to calculate the correlation is important not the time to calculate the training set. So if I precalculate the sums when I generate the training set and then subtract the values at Q[ii] when P[ii]==0 when calculating the correlation I would reduce the computation in calculating the correlation although I increased the total number of calculations.

JohannesMunk

9th June 2010, 22:44

Mmh.. thats a strange correlation, if you look only at those pixels that are present in both images. You loose most of the 'geometric' non-matching information. But maybe you are well aware of that, and intend to do so.

Going by your definition. If two of your frames were checkerboards, but one of them shifted by just one pixel, your way of calculating the correlation, would result in 0. Whereas there is a maximum anticorrelation in those two frames => R(P,Q) should be -1.

I would expect that the nominator of a correlation-term, contains only those values that match (when both Pi and Qi are nonzero = Pi*Qi) and the denominator (=normalization) contains all possible values... And then you definitely can precalculate the sums for the individual frames and calculate the denominator for each pair very fast. The nominator of course is a different story!

See http://en.wikipedia.org/wiki/Correlation .

To make this clear: Your Pi and Qi should just be the untouched rawdata. Not some Pi*Qi as you suggested. Otherwise this is not really a correlation.

If you do it your way: Have you checked your normalization? If you compare different R(P,Q) is it even comparable the way you do it?

Joh

agerlach

9th June 2010, 23:09

I'll try to explain a little bit more on what the image actually is and that should help explain why I do things this way. These images are "spin-images" imagine if you took a 3D surface mesh, spun a finite plane about a randomly selected vertex's normal and record where each other vertex in the mesh intersects the plane. This can then be discreatized into bins and is actually a 2D histogram representing the accumulation of mesh vertices on the plane. When training the 3D mesh is an isolated model of your object of interest, but when you observe a real world scene of that object it is typically not isolated (there is clutter) and other objects would then be included when calculating spin-images for that surface. By only using pixel (or more correctly bins in the histogram) that are nonzero in both the training, Q, and the observed scene, P, you are then only considering the overlap of the images. Even though the modified correlation coefficient is robust to clutter, it can erroneously report high or complete correlation between images with little overlap. However, with more overlapping pixels, there is more confidence in the value of the correlation coefficient obtained. The correlation coefficient can then be used to calculate a similarity measure that combines the correlation coefficient and its variance into a single function. This can be done by performing a change of variables of the correlation coeff with the hyperbolic arctangent, where in which the variance becomes a direct function of the number of non zero pixels used in the calculation.

Normalization is still fine since the W^2 in the equation becomes the number of nonzero bins in both P and Q.

JohannesMunk

9th June 2010, 23:44

The spin-images sounds like an interesting concept. But I don't understand your "overlap" reasoning. But you've obviously put a lot of thought into it, so it'll be allright :->

Back to computation:

1) When you calculate a correlation, make sure you iterate through your pixel data only once and gather all required variables in one pass. Memory fetching is always time consuming.

2) Inside your loop, calculate Pi*Qi first. That goes fast if one of them is zero. And you can use the result to decide if you add the pi and qi to their respective sums..

3) You can very easily parallize the computation of the different correlation coefficients => Multithreading. See QtConcurrent.

4) If you need this REALLY fast, consider going for OpenCL (GPU).

5) Make sure you really need doubles. Depending on your platform that could speed up computation a lot if you used singles or even integers. On GPUs double performance is about half of single performance. On CPUs the difference between f64 and f32 will probably be insignificant.

6) Make sure you compile with some high stream enables (SSE2, ... ) optimization (O3 ..) Depending on your code this might resolve in fast (SIMD) SSE code.

Good luck with your project!

Joh

agerlach

10th June 2010, 00:16

I'm way ahead of you! I have this all working on the GPU, not OpenCL but NVIDIA's CUDA. I have all the algorithms working in Matlab and I can call the CUDA kernals through mex functions. I also have a rough c++ mex of just the correlation calculation portion of the algorithm. I get about a 22x speedup when going from Matlab to c++ and over 1000x speed-up when going from Matlab to CUDA! For all those implementations I have using floats instead of doubles, which is probably what I'll use here as well.

Thanks for your ideas, you've been a huge help.

One quick question. Would I be better off using QVector instead of QList since I know the desired length of the array and can allocate it on initialization with QVector? Also, lets say I make a class called siStore to store the pixel and any precalculated values. I then have
QList<siStore> *siStack = new QList<siStore>(); how then do I tell the QList how many elements it has and how do I tell each siStore in the QList how many elements it has? All my programming experience is in Matlab, so classes, pointers, and templates are all new to me!

JohannesMunk

10th June 2010, 11:37

That's great!

QList vs QVector will not make much of a difference. See http://doc.qt.nokia.com/4.6/containers.html#container-classes
(http://doc.qt.nokia.com/4.6/containers.html#container-classes)

#include <QtCore>
#include <QtGui>

struct siFrame {
// default constructor - initialize sums..
siFrame() {sum = 0;sqsum = 0;}
float pi[256];
float sum;
float sqsum;
};

int main(int argc, char *argv[])
{
QApplication a(argc, argv);

qDebug() << "Start Vector";
QVector<siFrame* > siStore(714838);
for (int i=0;i<siStore.count();++i)
{
siFrame* frame = new siFrame();
for (int j=0;j<256;++j) {frame->pi[j] = j;frame->sum += j;frame->sqsum += j*j;}
// if you have the data already in memory use: memcpy()
siStore[i] = frame;
}

qDebug() << "Start List";

QList<siFrame* > siStoreL;
for (int i=0;i<714838;++i)
{
siFrame* frame = new siFrame();
for (int j=0;j<256;++j) {frame->pi[j] = j;frame->sum += j;frame->sqsum += j*j;}
// if you have the data already in memory use: memcpy()
siStoreL.append(frame);
}

qDebug() << "End";

// Just a widget
QWidget wdg;
wdg.show();

int result = a.exec();

// Clean up - just loop through all elements and call delete on the pointers.
qDeleteAll(siStore);
siStore.clear();
qDeleteAll(siStoreL);
siStoreL.clear();

//int result = a.exec();

return result;
}
This requires 1.5 GB RAM and works without problems in Win32.

I don't know if saving sum and sqsum for your kind of correlation really makes sense. Your implementation will look a lot uglier, if you substract only some of the values. And you still need to iterate through all pixels.

Joh

agerlach

10th June 2010, 15:42

Thanks that works. However, how would I set this up in a class? Both the 256 and 714838 are dynamic. Instead of a struc for siFrame() I made a class with a constructor that allows me to set the size of the array by using QVector in siFrame instead of float[];

I have a class called siTrainer that needs to load a queue of things to train and processes that queue by allocating a siStore list, calculating each siFrame, populating the siStore list, writing the siStore list to harddisk, delete the siStore list and then repeat. So I need the class to be able to initialize and clear the siStore list. I need something like this:

class siTrainer
{
public:
siTrainer();
~siTrainer();

private:
void initStore();
void deleteStore();
int numPts; //The number of images, 714838 in my example
int siWidth; //The size of each image, 256 in my example
QVector<siFrame *> *siStore;
}
How would I implement initStore()?
QVector<siFrame* > siStore(714838); won't work since it will just create a local variable that cannot
latter be accessed by cleatStore(). When I try
siStore = new QVector<siFrame *>(numPts) I get the following error

binary '=': no operator found which takes a right-hand operand of type 'siFrame *' for the following line

siStore[i] = frame;

JohannesMunk

10th June 2010, 16:02

I'm not sure of your trainer logic. But you should get everything out of here. I had already shown you how to insert a frame into a container.

#include <QtCore>
#include <QtGui>

struct siFrame {
siFrame(int numpixels) {sum = 0;sqsum = 0;pi = new float[numpixels];}
float* pi;
float sum;
float sqsum;
};

class siTrainer {
public:
siTrainer(int numframes,int numpixels) {
for (int i=0;i<numframes;++i)
{
siFrame* frame = new siFrame(numpixels);
// initialize your frame..
for (int j=0;j < numpixels;++j)
{
frame->pi[j] = j;
frame->sum += j;
frame->sqsum += j*j;
}
// store the frame
siStore.append(frame);
}
}
~siTrainer() {
qDeleteAll(siStore);
siStore.clear();
}
private:
QList<siFrame* > siStore;
};

int main(int argc, char *argv[])
{
//for each training ..
{
siTrainer* s = new siTrainer(714838,256);
// do something with your trainer..

// free it up
delete s;
}
return 0;
}

To make this clear:

siFrame* frame = new siFrame(numpixels);

defines and initializes a reference to a frame instance. (BTW you could just change struct to class if you need more functionality. Both are the same thing except that struct members are public by default)

siStore.append(frame);

Saves the reference to the list. When frame goes out of scope (in each iteration) only the pointer variable itself is freed, not the object it points to. So the other reference in the list still points to the valid object.

Joh

JohannesMunk

10th June 2010, 16:11

Your siStore is a pointer to QVector so try:

(*siStore)[i] = frame;

If you insist on staying with a Vector. But if you ran my little app, you would see though, that there is no performance difference between QList and QVector. QList is internally an array too.. which resizes according to some growing strategy.. Append is Amort(1).. The memory overhead for an array of pointers that is slightly to big is insignificant. It justs handles better.

Joh

agerlach

10th June 2010, 16:13

Thanks again you've been a huge help. I just didn't know if there would be something better than to use .append() since I know the length needed. I understand that QList allocates more memory than it needs to make the .append() operation fast, but I figured that I would be doing so many appends that it would have a performance effect, but I don't see any when I run it.

agerlach

10th June 2010, 16:33

OK everything works but the qDeleteAll() doesn't work the way I would expect. In my siTrainer class I have a method run() that gets called to process the training queue and it bassically works like this:

void siTrainer::run()
{
foreach(item in queue)
{
this->initStore();
this->doWork();
this->deleteStore();
{
}

Where the initStore() an ddeleteStore() is implemented with the QList as you suggest above.
qDeleteAll(siStore) works in that I don't have a memory leak but the counter in siStore never resets so when I perform initStore() the second time in the loop the siFrames get appened to a list with a count of 714838.

JohannesMunk

10th June 2010, 16:37

my fault! qDeleteAll just iterates through each item an deletes it. But the now invalid references stay in the container. You have to clear the container!

qDeleteAll(c);
c.clear();

Another question is, if you really need to delete all the frames. Is it a whole new set every time, or do you recreate all of them for each incoming frame?

Joh

agerlach

10th June 2010, 16:48

That works perfect!

It is a whole new set everytime. The gui for this allows a user to load a 3D model, set the desire spin-image parameters and then add it to a queue. The user then can load another completely different model set its parameters and append it to the queue. Once the user has added all the models the code goes through the queue and trains each model independently.

Since training takes so long it nice to add a bunch of models and then be able to walk away from the machine!

JohannesMunk

10th June 2010, 16:51

Depending on what your trainer class does on top of what I've seen, it could be a better design to just create a new trainer for a completly different problem. But thats just esthetics.

Good luck with your project!

Joh