Dear All, I would like to receive comments on the following problem. Suppose you have a matrix K and a matrix M. Matrix M is small compared to K. The problem I have is that matrix K is large and sparse. Matrix K is formed by assembly of several matrix M. So, I work with one matrix M whose size is changing every time that I need to perform an assembly in matrix K. I always know the size of matrix M just before performing the assembly. Because of matrix K is sparse and large, I just want to store non-zero elements. Here is the question: What would be the most efficient programming technique for the assembly stage, i.e., put matrix M into matrix K? Any comments is much appreciated. Thanks, Alejandro.

Maybe I need to explain more deeply in order to get a reply. First of all, I am using allocatable arrays. I don't have problems with the use of C++. It is about an algorithm issue and how to do it efficiently from the memory manage and running time to do the assembly stage point of view. Actually, this is a code for a meshfree method (MM) to solve a Partial Differential Equation (PDM). The method is similar to Finite Element Method (FEM). Basically, I need to construct a system of linear equations: Ku = f, where K is large and sparse (a lot of zeros inside). I don't need the zeros so I am using a row compressed format to store matrix K. The solution of the linear system is not a problem for me since I am using TAUCS which is a library to solve sparse linear systems. My problem is the assembly stage. In FEM the M matrices are always of the same size, so I can assembly them into K very efficiently (BTW K has fixed size also). In MM the size of M is not fixed, meaning that in one assembly stage it can be, for instance, 10 x 10 and in the next stage it can be 50 x 50, and in the next stage 20 x 20, and so on. Some other things: Matrix K is formed by the elements of M and K has a fixed size, say 6000 x 6000. Once M is assembled (copy into K), element of matrix M are not needed anymore, so I can deallocate it to allocate again in the next stage. So, I need to send the elements of M to an appropriate index in K. There are a couple of way to do this, but I would like to know which is more efficient. Maybe, I should detail how can it be done, so you can figure out what would be more efficient from the memory and running time point of view. First option: Since I know the size of M previous to perform the assembly, I can form the M matrix, say of size m_x_m, and then assembly the complete matrix M with a loop structure into K with the aid of a vector that has stored in some way the index (i,j) of the matrix K in which element (k,r) of matrix M must be copied. Second option: Instead of forming matrix M. I can just compute each element of M, say (k,r) directly into position (i,j) of K with the aid of the same vector described in the first option. I really don't like to bold words, but I did it above to make evident the difference between the two options. The difference is in option 1, I assembly a complete matrix M into K while in option 2, I compute each element of matrix M directly into K. In both options, I have to do the operation several times until K has been completely filled. In the assembly process when a position (i,j) of matrix K has already been copied with another number in a previous stage, simply the new element of (i,j) is added to the previous one. So, I would like to know which of the above options should run faster. Thanks again.

I think the only way to determine which is more efficient is to run trials. If I understand it correctly, option 1 means K is a sparse array of M locations and you have a separate list of Ms, whereas in option 2 the Ms don't exist and K is just a sparse array of individual (albeit clustered) values. In 1, to determine K[p,q] you'd have to determine which M p,q corresponds to, then lookup M[p-M.xpos, q-M.ypos]. Locating M in K would be quicker as there are fewer of them. In 2, to determine K[p,q] you just have to look down the list of entries in K. If you hash the positions in K of the Ms or the individual values, say as p*QMAX+q, then you can do a binary search. This applies to both options. My feeling is that option 2 would be quicker (whether you use a binary search or not) as although the search would be longer, this would be offset by the fact that once the search is complete you then have the value you want, whereas option 1 requires two subtractions, an add and multiply and two memory lookups to pull the value out of M. A binary search iteration comprises a compare, a divide by two, add or subtract and a memory lookup, so the overhead of the longer search is shorter than the overhead of looking up in the list of Ms even if M is only 2x2, and option 1 gets worse as the size of M increases. That might be completely wrong though. If K comprises a large number of small Ms then an individual M will take longer to find than if K comprises a small number of large Ms. Either way the final lookup into the specific M will be just as quick as it doesn't matter how large p-M.xpos and q-M.ypos are, but where K is a large number of individual values then the search, while immediately returning the final value, will likely favourably compare with the search time for a large number of small Ms, but be considerably slower than if K contains a small number of large Ms. If M is 50x50, that means K has 2500 times as many entries in option 2 as it does in option 1, which is likely to be quicker. You might find there's a particular size of M, for the same number of individual K[p,q] values, where option 1 is better for M one side of that size and option 2 better on the other side.

Thanks for your reply. There is another data that maybe it is needed to know. Entries in matrix K are not known while entries in M are known after you compute them and before placing into K. In option 1, you compute a complete matrix M (too small compared to K) and then, with the aid of a vector which has stored the index of K in which the elements of M must be placed, you do the assembly of the complete M into K. Thus, option 1 has a loop over elements of M, but not in K becuase of the index vector you know exactly the position in K. In option 2, you don't have to allocate memory for M, simply you compute a number for which you have a variable of type double to store it, and using the index stored in the vector mentioned in option 1, you place it directly into K. This option also has a loop but not over a matrix M. The loop is actually to compute several times the number in the variable of type double which is then placed into K.

Before computing them. Remember, you first need to compute the values of M (option 1) or the value of a double variable (option 2). Then when you have placed them (ot it) into K you know the value of K. When you have completed the procedure, you will hava a large and sparse matrix K to solve the system Ku = f.

No, I'm still not following you. Let me use a couple of 1D examples, perhaps you can show me by modification what I'm not understanding: Let's say M1={M1a M1b} M2={M2a M2b M2c} and the matrix we want K to represent is {0 0 0 0 M1a M1b 0 0 0 0 M2a M2b M2c 0} Option 1: Let (2,4) represent "2 elements starting at position 4" Store K as {M1=(2,4), M2=(3,10)}, total size 14. Also store M1 and M2. So to calculate K[n]: if n in (2,4) return M1[n-4] else if n in (3,10) return M2[n-10] else return 0 "n in (2,4)" would be coded for example as (n>=4 && n<6) Option 2: Now let (4) represent "1 element starting at position 4" Then K={M1a=(4), M1b=(5), M2a=(10), M2b=(11), M2c=(12)} So to calculate K[n]: if n==4 return M1a else if n==5 return M1b etc. So this lot is done immediately before you want to solve Ku=f and after you've computed the values for M. Perhaps you could post some (simplified) code; as I've said before, English descriptions of code can sometimes be a lot harder to understand than the code itself.

Hello, thanks for your answer. This is an example: First you initialize K to zero. Option 1: M = [ m11 m12 m13 m21 m22 m23 m31 m32 m33] Then assembly M into K K = [ m11 0.0 0.0 0.0 m12 0.0 m13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 m21 0.0 0.0 0.0 m22 0.0 m23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 m31 0.0 0.0 0.0 m32 0.0 m33] Now, I don't need M anymore, so I can deallocated and reallocate for a new M that is going to be assembled into K. If K is already ocuppied by a number, then the new number it is added to the previous one. The assembly process it is done several times until you get K which is used to solve Ku = f. Option 2: Suppose M in this stage is 3x3 as in the example for option 1, but you don't need to allocate memory for M, just you have a double variable in which you are going to compute each of the elements that M would be. Then you need to pass 9 doubles to K element by element: Compute m11 in a double variable and assembly it into K: K = [ m11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0] Compute m12 in a double variable and assembly into K: K = [ m11 0.0 0.0 0.0 m12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0] This is done until all elements of M are assembled. Again you need to do this several times until you get K which is used to solve Ku = f. The difference between both approaches is: In option 1 you compute first a matrix M which is not fixed in size (i.e., in one stage it can be 3 by 3 and in the next stage it can be 50 by 50, and so on) and assembly into K using a loop structure over M. While in option 2, you compute the values of M one value at a time directly into matrix K. In both cases you are going to assembly the same quantity of values, but in option 1 you assembly several values in a loop and in option 2 you assembly one value at a time. Hope now it is more clear.

Oh I see. In that case I think option 2 is the clear winner. Option 1 becomes for each new M allocate memory for M for n*n entries in M calculate entry place in M end for for n*n entries in M calculate location in K add to K[loc] end for deallocate memory end for and option 2 becomes for each new M for n*n entries in M calculate entry calculate location in K add to K[loc] end for end for So if anything option 1 uses more memory and is slower, because it loops over the same number of values twice, stores the data in the additional location M, as well as the additional overhead of the memory allocate and deallocate for each new M. So I'd probably prefer option 2 unless you specifically need M for anything. If option 1 is preferable for code read-/maintainability then the extra for and reading the data out of M won't make a lot of difference on a modern processor but you should drop the deallocate/allocate part and use the same memory for all M's, reallocating only to increase the scratch space. So for 3x3 you allocate 9, then if the next is 50x50 you have to resize this to 2500, but then if the next is 20x20 just reuse 400 of the 2500 size array. This will also help avoid memory fragmentation which can become a problem if you have vast amounts of allocate/deallocate.

Hello, thanks again for your reply. That is exactly what I wanted to see. BTW I don't need M for any other purpose so, option 2 is fine.