Since NCOL is 40, I thought I would have 40*8 objects
Actually 40 * 40 * 8 if your math had been correct. But I am not sure if I can calculate the true number.

So let's see:

For the first row:

mat[0][0] contains 8 copies of mat, none of which have copies of mat.
mat[0][1] contains 8 more copies, one of which (mat[0][0]) also has 8 copies
mat[0][2] - mat[0][38] also each contain 8 copies, one of which (mat[0][n-1]) has 8 more
mat[0][39] has 8 copies plus 16 more (mat[0][0] and mat[0][38]) (I think your grid wraps)

For the second row:

mat[1][0] has 8 copies, plus mat[0][0] and mat[0][1] for 16 more. But mat[0][1] also has mat[0][0] for 8 more.
mat[1][2] has 8, plus mat[0][0], mat[0][1] and mat[0][2] for 24 more. The last two have [0][n-1] for another 16
same for 3 - 38.
mat[1][39] has 8 plus 24 plus 16 plus the 32 from mat[1][0] because of wrapping

and it just gets worse. By the time you get to [39][39] the total is somewhere near a gazillion. :-)

Glad the simpler solution worked for you.