QList of variable data type? [Archive]

View Full Version : QList of variable data type?

Phlucious

15th November 2011, 23:07

In my program I import a set of millions of 3D points with multiple properties each (RGB values, XYZ values, alpha, etc) from an external file and store it in memory. Rather than storing them as a simple array, I was hoping to store them in a Qt container class like QList so that I could tap into Qt's sorting capabilites. In order to conserve memory, I've written several class-based point data types to choose fromâ€”some with XYZ only, others that store XYZ and RGB data, etcâ€”all that inherit from a core data type:

class p3d_core_t {/*variables here*/}
class p3d_0_t : public p3d_core_t {/*variables here*/}
class p3d_1_t : public p3d_core_t {/*variables here*/}
class p3d_2_t : public p3d_core_t {/*variables here*/}
A given data file specifies in its header which data type it contains (0, 1, 2, ..., N), so I don't know which type to store in my QList until run-time. I have two questions:

How do I construct a new QList during run-time once I know which data type I'm importing?
Is it possible to list the data types themselves in a QHash so that I can index them instead of having a series of if-else statements? For example, something like this

int dataTypeIndex = 2;
QList* list;
QHash<int, pointdatatype> hash;
list = new QList<hash.value(dataTypeIndex)>
would create a QList of p3d_2_t data, ready for importing.

I can hardcode this with a huge set of if-else statements, but I have a feeling there's an elegant "Qt way" to do this, especially since the number of point data types changes periodically. I'd rather it be a little more flexible. Because these data sets get huge, even a little memory or speed boost makes a huge difference.

Thanks in advance!

I'd also like to add that I'm open to alternatives to QList or QHash that are more appropriate for lists that get into the 10-999 million member range.

Phlucious

16th November 2011, 01:11

I suppose that I could achieve the desired functionality using the Property System (http://doc.qt.nokia.com/4.7/properties.html) with a QList of QObjects, one QObject for every point. However, wouldn't doing so significantly inflate the memory use of the point dataset? When you have 100,000,000 QObject points, even a couple bytes per point makes a big difference.

nightghost

16th November 2011, 10:07

Thats not possible since c++ templates are evaluated at compile time and not at runtime.

Phlucious

17th November 2011, 20:46

Apparently QList pointers require a type, too, so I'm going to have to re-evaluate how I will accomplish this. I don't want to have to create an array of 1,000,000 structs that each occupy 423 bytes (v8) when I'm importing a data version that only occupies 67 bytes (v2), just to account for the possibility that I might import v8 data.

Versioning sure makes things tricky... Know any good resources for that? I'm always willing to learn.

The QList documentation mentions that it's optimized for <1000 members. Anyone know if it can manage 100,000,000 members efficiently? I'm thinking a basic C array allocated with malloc() may be the best (most efficient) option.

MarekR22

17th November 2011, 21:41

I don't know what are you trying achieve, but IMHO you should use simple data format.
Using polymorphism or other universal solution cost memory for handling type recognition and storing information about heap (in case of QVariant copy will be kept in heap).
If you have only 1M of small/average size elements then soring them in single fragment of memory would be more effective (423 byte it is small size and will give you about 500 GB size). QVector or QVarLengthArray can be better choice.
For storing 100M of average size elements you have to store data on hard drive cache. Qt has some templates to handle that:
QCache - for random access,
QContiguousCache - for sequential access (this is probably for you).

Without details it is hard to suggest something more.

d_stranz

17th November 2011, 21:44

If your data types are all derived from a common base class, why not make your QList a list of pointers to the base class instead of a list of actual object instances? You could then use virtual methods of the base class to get at the actual content of the real instances.

Consider using std::vector<> instead of QList<> or QVector<>. The Qt purists here may disagree, but for PC platforms, I think the C++ Standard Library is better for large containers. While I don't typically have containers of 100 million things, I do have containers with a million or more and use STL containers without hesitation.

If you haven't ever read about C++ template metaprogramming, you should check into that. Much of it is concerned with how to write algorithms on containers of arbitrary things.

Phlucious

17th November 2011, 23:40

Generally speaking, my application is to open, manipulate, and close a LiDAR (http://en.wikipedia.org/wiki/LIDAR) point cloud data file. These files can reach a few GB in sizeâ€”each!â€”with around 1-500 million points apiece. Data quantities add up pretty fast when you collect around 120,000 points per second, minimum.

If your data types are all derived from a common base class, why not make your QList a list of pointers to the base class instead of a list of actual object instances? You could then use virtual methods of the base class to get at the actual content of the real instances.
I like the idea of using virtual methods declared by the base class. I didn't know that calling the virtual method of a base class pointer would call the method of the child it's pointing to. For example,

p3d_1_t pt1; //p3d_1_t's constructor calls p3d_1_t::clear(), which sets x=0
p3d_core_t* pt1_ptr = &pt1; //p3d_core_t's constructor calls p3d_core_t::clear(), which sets x=9999
pt1_ptr->clear(); //this calls p3d_1_t::clear(), not p3d_core_t::clear(), even though the pointer's type is p3d_core_t*
would allow me to code in a generalized fashion given a set of get/set functions in the core class for each variable, even though not all of the derived classes use every variable. However...

Using polymorphism or other universal solution cost memory for handling type recognition and storing information about heap (in case of QVariant copy will be kept in heap).
That's basically my concern with using a polymorphic class-based approach. I'm unsure how class functions are stored in memory or in the executable when you have multiple instances of the same class. How much overhead is added by having 100,000,000 instances with get/set functions? Do the functions occupy space in memory for each instance or are they shared somehow? As I understand it, static class functions are shared across every instance, but I don't know how normal non-static functions are stored/called. I'd rather not make them all static and force myself to pass a "this" every time, but I'll do it if it means saving that much memory. I've already pushed past 4 gis of RAM. Speed's the main issue, though.

btw, I've pretty much abandoned the QObject::setProperty() concept, so I'm not going to bother with QVariants and type conversions.

If you have only 1M of small/average size elements then soring them in single fragment of memory would be more effective (423 byte it is small size and will give you about 500 GB size). QVector or QVarLengthArray can be better choice.
For storing 100M of average size elements you have to store data on hard drive cache. Qt has some templates to handle that:
QCache - for random access,
QContiguousCache - for sequential access (this is probably for you).
I believe that I'm looking for random access, so QCache is probably more applicable. I'd forgotten and misunderstood how those worked. Would it make sense to process the data in one QThread and retrieve it from the disk in another thread? Otherwise I don't see how a QCache would help me in this case because I've always had HUGE slowdowns when I try to process off the hard drive.

It'd be nice if I could spatially sort the data somehow (bin and grid the data according to XYZ values) using prebuilt methods, but that's probably asking too much of the Qt library. :)

Consider using std::vector<> instead of QList<> or QVector<>. The Qt purists here may disagree, but for PC platforms, I think the C++ Standard Library is better for large containers. While I don't typically have containers of 100 million things, I do have containers with a million or more and use STL containers without hesitation.

If you haven't ever read about C++ template metaprogramming, you should check into that. Much of it is concerned with how to write algorithms on containers of arbitrary things.
I've been leery of STL containers because I had so many headaches from them, vectors in particular, early in my programming experience (I wouldn't call it a career... I started dabbling in coding 2 years ago). I'll give vectors another look! TMP looks promising, but still a little over my head. I like how fast it sounds, though.

Thanks for the tips, both of you! This has been a blast to learn.

d_stranz

18th November 2011, 00:09

The issues with LiDAR point clouds are similar to what we face with mass spectrometric images. Each point in the image is an entire spectrum, so MS image files are routinely 40 GB or larger.

I'm unsure how class functions are stored in memory or in the executable when you have multiple instances of the same class. How much overhead is added by having 100,000,000 instances with get/set functions?

The only thing that is allocated with each instance of an object is the memory for the data variables themselves. There is only just a single copy of the code, ever, regardless of whether it is a static method or not. The main difference between static and non-static methods is that a static method is basically the same as an ordinary function call, except that it is defined within a class' namespace. Static methods have no "this" pointer, so they can only modify static member variables of the class, not the member data associated with a particular instance of the class.

There is some small overhead associated with using virtual methods. Most compilers implement classes with something called a "vtable", which is a lookup table of function locations. Depending on the implementation, virtual methods might have to go through another lookup at run time to determine which of the overridden functions to call, based on the actual type of the instance.

As for deriving your data types from any QObject type, I wouldn't do it. If you really have millions of these, the overhead would kill performance. And Q_PROPERTY goes through all sorts of lookup and conversion to get or set values.

I referred to metaprogramming because there are tricks in there for retrieving information out of object instances when you don't actually know (or care) what type the object really is. Look here (http://www.cplusplus.com/reference/stl/vector/) under "Member types" for how std::vector<> does this.

Phlucious

19th November 2011, 00:30

Thanks for the tips! It seems to be working pretty well, at least conceptually. I seem to be having some issues with the actual size of the class in that it's somehow getting inflated.

For example, if you add up all of the variables stored in this example class, it should add up to 20 bytes, and yet when I query its size with sizeof(), I get a value of 32. Why would that be if all each instance stores is the variables, not the functions?

class ptcore_t
{
public:
ptcore_t();
virtual void clear();

qint32 xyz[3];
quint16 intensity;
quint8 retid;
quint8 classification;
qint8 scanangle;
quint8 misc;
quint16 ptsourceid;
}

If it's relevant, I'm coding this with a 64bit version of Qt, on Windows 7 x64.

d_stranz

19th November 2011, 18:14

Probably because the compiler aligns storage for each member on 4-byte boundaries in a 64-bit machine. Possibly if you rearrange the order of your member variables, you could end up with a different size because the alignment changes.

Why would that be if all each instance stores is the variables, not the functions?

Think about what this means. Functions (code, basically) are invariant. Every single instance of an object of a given type executes exactly the same code in its member functions. The only differences between different invocations of the same method are the arguments passed in and the values of the member variables at that particular time. So why would the compiler need to store the addresses of each of the functions in every object, instead of one table for each class?

Each compiler can make its own choices, but one way is that there is an "invisible" pointer that is passed as the first argument of every member function call, the pointer to "this", the actual object instance for which the method is being invoked. C++ hides this, so that when you refer to your member variable "intensity" within some method for ptcore_t, the compiler is actually looking at "this->intensity".

Phlucious

30th November 2011, 01:54

@d_stranz: You've been very helpful. I'm pretty much there with my implementation following your advice. My custom QFile class has an empty std::vector<ptcore_t*> in its constructor. The ptcore_t abstract class has virtual get/set functions for every possible variable.

class ptcore_t
{
public:
ptcore_t();
virtual void clear();

/* Generalized data recording and retrieval functions */
virtual double setXYZ(int i, double v);
virtual quint16 setIntensity(quint16 v);

/* Placeholder functions for other formats */
virtual double setTimeStamp(double v) { return 0.0; }
//Lots and lots more placeholders that can be mixed and matched

private:
/* Point data fields */
double xyz[3];
quint16 intensity;
}
An inheriting class (such as pt1_t) overrides the get/set function for its specialized data.

class pt1_t : public ptcore_t
{
public:
pt1_t();
void clear();

double setTimeStamp(double v);
private:
double gpstime;
}
While reading in the data from an external binary file, I declare a ptN_t*, with the value of N depending on a flag in the file itself. Once the point is read, I push_back the ptN_t* into the vector container and move onto the next point. Can I safely assume that vector::clear() will release the memory like I expect it to if I want to read in a different set of points?

One question regarding optimization. When I start reading the points, I already know what point format they will be since that was in the file's header. After reading the XYZ and Intensity data from the file, what comes next will vary depending on the point format. I'd rather not do a dozen if() checks for every point. Is there an efficient way to do this? I'm coding this at the file level, not the point level. Can I check if gpstime is defined, for example? (like #ifdef, except during run-time)

d_stranz

30th November 2011, 16:42

Can I safely assume that vector::clear() will release the memory like I expect it to if I want to read in a different set of points?

No. std::vector<>::clear() does not do anything to free the memory pointed to by the objects it stores. If you are storing pointers, you must manually delete each pointer before clearing the vector. Likewise, you must do this with any operation that removes a vector element (erase) or replaces it with a different one (assign, operator=, etc.). Otherwise, you leak the memory occupied by the pointer that is left dangling.

Whenever I create vectors of pointers, I usually use a smart pointer instead of a raw pointer. Smart pointers use reference counting and will automatically delete themselves when the reference count reaches zero (i.e. no one is holding onto a copy of the pointer anymore). Smart pointers might be too much overhead and slow performance too much for your application. If you want to look into it, we use the smart pointer implementation from the Boost C++ Libraries. (http://www.boost.org/)

Note also that for some implementations (Microsoft VC++ in particular), calling clear() does not actually free up the memory occupied by the vector itself. So if your vector is sized to hold 10^6 points and you then clear it, it still occupies the same amount of storage (10^6 * sizeof( ptN_t *) + a little overhead ). The only way we have found that actually reduces the size of the vector itself is to do the following:

myVector.clear();
{
std::vector< double > reallyAndTrulyEmpty;
myVector.swap( reallyAndTrulyEmpty );
}

In other words, forcing a swap with a local empty vector does the trick. Simply calling clear just sets the value returned by size() to zero.

Can I check if gpstime is defined, for example?

I don't know what you mean by this. If what you want is a way to parse each line of the file to retrieve N-dimensional point data, without a multi-level if-else logic tree for each point, define a virtual method for each point type that parses the line into fields:

class ptcore_t
{
public:
ptcore_t();
virtual void clear();

// ...

/* Reimplement for each concrete type to extract type-specific fields */
virtual bool parseLine( const std::wstring & line );

private:
// ...
}

So your file reader will basically read a line of text (if that's how it is stored), construct an object of the type stored in the file, then pass the line to that object so it can parse its contents out. You could even make a constructor that did this as part of the allocation:

ptN_t * ptr = new ptN_t( line );

Alternatively, when you first open the file and determine the type of points it contains, you could construct function objects (specific for the point type) to create a point of the appropriate type and to parse the line into the member variables. The parser function object then parses the line and fills in the appropriate member variables for the pointer. The function objects basically look like this:

struct pt_t_builder
{
virtual ptcore_t * operator()() const = 0; // pure virtual, must be defined in derived structs
};

struct pt1_t_builder : public pt_t_builder
{
virtual ptcore_t * operator()() const { return new pt1_t; }
};

struct pt_t_parser
{
virtual bool operator()( ptcore_t * ptcore, const std::wstring & line ) const = 0;
};

struct pt1_t_parser : public pt_t_parser
{
bool operator()( ptcore_t * ptcore, const std::wstring & line ) const
{
pt1_t * pt1 = static_cast< pt1_t * >( ptcore );

// parse "line" into fields, store into pt1 members
// return true / false based on success
}
};

You then use this roughly as follows:

// When point type is first determined
pt_t_builder * ptBuilder = 0;
pt_t_parser * ptParser = 0;
switch( fileType )
{
case pt1_t_file:
{
ptBuilder = new pt1_t_builder;
ptParser = new pt1_t_parser;
}
break;

// .. additional cases for other point types
}

// Later, when reading each line:
std::wstring line; // read the line into "line"

ptcore_t * point = (*ptBuilder)(); // create the appropriate point type based on the builder
if ( (*ptParser)( point, line ) ) // load the new point's contents based on the parser and store if OK
myVector[ nPoint++ ] = point;

HTH. Obviously not compiled or tested, so some tweaking might be required.

Phlucious

30th November 2011, 18:40

Thank you so much for the vector information! I've found it very difficult to find good resources on vectors in a high-performance environment, so I really appreciate your input. In my experience, Boost tends to slow things down enough that I'm okay with programming the deletion myself. It'll be good practice, anyway. :)

I really like the idea of passing a data chunk into a point constructor. That sounds very clean, especially since the file header also gives me the size-on-disk of each point format, I already know programmatically how big of a chunk to read for each point.

Since my data file is stored in a binary format (not ASCII text), would it be faster to pull data one point (30 bytes, for example) at a time, or one component at a time (2 bytes)? I imagine it would be the former because of the reduced number of calls to the disk, especially with a typical HDD. In that case, would it make sense to read in a QByteArray using QIODevice::read(qint64) and pass that to the point constructor?

I desperately need to take a good course on code optimization so I don't have to rely on you all. Thanks again! This should be my last question.

Phlucious

2nd December 2011, 00:25

One more... Why would I use static_cast as in
pt1_t * pt1 = static_cast< pt1_t * >( ptcore ); instead of simply equating it and calling the constructor?
ptcore_t* ptcore;
pt1_t* pt1 = new pt1_t();
ptcore = pt1;
//or....
ptcore_t* ptcore = new pt1_t();