I store matrix entries in std::vector<double>, such that the reading is row by row. This means, for matrix 1 3 4 8 9 3 3 6 8 1 1 2 2 0 9 8 7 6 the std::vector<double> would have entries: {1,3,4,8,9,3,3,6,8,1,1,2,2,0,9,8,7,6}. To transpose the matrix I use the naive approach of iterating through std::vector, calculating for each entry its position in a matrix transpose. Recently, I got one suggestion: "When you have a matrix and a mapping to a one-dimensional vector, transposing is equivalent to a permutation. Now execute this permutation by sorting the elements by requested rank in the permutation result." I understand that transposing is equivalent to a permutation, but I dont know how to approach the problem this way. Any suggestion on how to do this (or to do matrix transpose is a faster way) is welcome.

It basically means swapping elements around. If it's a non-square array you'll also need to change the height and width variables. If you have a 2x3 matrix 1 2 3 4 5 6 stored left to right and top to bottom as 1 2 3 4 5 6, and you want to transpose it to 1 3 5 2 4 6 then the new storage will be 1 3 5 2 4 6. To get there: 1 2 3 4 5 6 1 4 3 2 5 6 swap 2 4 (positions) 1 5 3 2 4 6 swap 2 5 1 3 5 2 4 6 swap 2 3 and the matrix is now 3x2 instead of 2x3. Note we're swapping the data at the numbered positions, so the C code for "swap 2 4" would be tmp=arr[1]; arr[1]=arr[3]; arr[3]=tmp; - remembering that C counts from zero. A pi/2 rotation of the original would give 2 4 6 1 3 5 storage 2 4 6 1 3 5, so to get here: 1 2 3 4 5 6 2 1 3 4 5 6 swap 1 2 2 4 3 1 5 6 swap 2 4 2 4 6 1 5 3 swap 3 6 2 4 6 1 3 5 swap 5 6 If profiling shows this to be a bottleneck (AND NOT BEFORE - don't start suffering from premature optimisation) you could optimise it by minimising assignments to tmp variables (Tn) and copy instead of swap: 1 2 3 4 5 6 * 2 3 4 5 6 1->T1 2 * 3 4 5 6 2->1 2 4 3 * 5 6 4->2 2 4 * * 5 6 3->T2 2 4 6 * 5 * 6->3 2 4 6 * * 5 5->6 2 4 6 1 * 5 T1->4 2 4 6 1 3 5 T2->5 which takes it down from 12 copies (each swap involves 3 copies) to 8, although uses 2 temp variables. A better optimisation (6 copies) is available if you just copy from one matrix to another, and the best optimisation is not to have to transpose at all (maybe access the members in a different way, for example using a lookup table to access the members in the order 2 4 6 1 3 5).