I just wanted to say that the problem was finally solved, but turned out to be a lot more complex than I had expected.
Without going into too much detail, it turns out that there is a problem with aligning the 16 byte sse2 variables inside a c++ class with a microsoft compiler.
So jacek, I am guessing that you are not using a microsoft-compiler since you did not get any errors with the little test-program I posted. I am even guessing you are not using anything microsoft (like Visual Studio) at all...?
The solution was to overload the "new" operator to make it align them properly.
Bookmarks