Matrix transposition is one of the most common things computers do as this operation creeps itself into many day-to-day computer operations

From optimized 1D convolution kernels that make gaussian blurs possible to optimized matrix multiply algorithms that are the heart of many machine learning operations, it is safe to say this operation has received adequate research

## The naive way to transpose a matrix

Using the earlier representation of a matrix, matrix transposition is a simple data access and data write for every element in the array

1
2
3
4
5
6
7
8
9
10
11
12
13

pub fn transpose_scalar(in_matrix: &[u8],
out_matrix: &mut [u8],
in_stride: usize,
out_stride: usize)
{
for i in 0..out_stride
{
for j in 0..in_stride
{
out_matrix[(j * out_stride) + i] = in_matrix[(i * width) + j]
}
}
}

This is simple, ergonomic, works, does nothing fancy and gets the job done and it’s slow.

The latter is what comes to pinch us here.

From a CPU standpoint, array accesses to `in_matrix`

happen linearly, i.e we move from start to end, which is great, and the cpu can prefetch elements, but accesses to `out_matrix`

happen very stride-wise, assuming `out_stride`

was 800, accesses to `B`

for 5 iterations would be

1

0 800 1600 2400 3200 4000

This makes it a bit harder for prefetch logic to kick in, and even when it does, it causes a lot of cache wastage as we are loading a whole cache line to write one value.

## Tiled matrix transpose

Another way to look at matrix transposition is that a matrix transposition can be viewed as a submatrix transposing followed by figuring out where the newly transposed submatrix should go to.

E.g in the above image, we can transpose the three-by-three matrix (coloured in red) and if we place it at the position coloured yellow, we have achieved transposition.

Or more generically, we are trying to achieve…

But why transpose sub-matrices? Well,the most important is that cache misses reduce *significantly* for the operation. The sub-matrices have good spatial locality, hence we use the cache better.

But there is another thing with sub-matrices, sub-matrix transposes can be done in SIMD registers.

If we use a small enough sub-matrix, we can get rid of intermediate buffers to store the results of the matrix transposition and use SIMD instructions to perfom matrix transposition inside SIMD registers.

Intel has a famous `_MM_TRANSPOSE4_PS`

that can transpose floats in four SIMD registers so if we choose a 4 by 4 sub-matrix of floats, we can do some tilling to get it to work (with some caveats of course ^{1})

For my use case, I happened to be working with u8’s/bytes/chars hence I didn’t have the luxury of enjoying macros.

The only difference arising is that I just need to roll up my routine, which does a submatrix transpose of any size that comfortably fits into a SIMD register, and am 8 by 8 transpose seemed to hit the spot.

There is a small awkward issue with this, that being the fact SSE register is 128 bits while 8 u8’s give me a total of 64 bits, meaning that data loading and storing are a bit awkward but otherwise everything is good.

The 8 by 8 SSE transposition algorithm is not my own, credits to Hamid Buzidi(powturbo) for the Stack Overflow answer.

So a high overview of the implementation we are trying to implement is.

- Load a submatrix
- Transpose the submatrix
- Write the submatrix.

So let’s get on with it

Loading data

SSE registers usually fit 16 bytes in them, but for our use case we are working with a 8 by 8 sub-matrix.

To load data into it, we use

`_mm_loadl_epi64`

to only load 8 bytes into the lower half and then we interleave the two subsequent loads into one register using`_mm_unpacklo_epi64`

instruction. This allows us to pack two rows separated by a stride into a single`xmm`

register.The effect is that we have two rows of our sub-matrix inside one SSE register,hence we just need 4 registers to hold our 8 by 8 sub-matrix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

pub unsafe fn transpose_inner(
in_matrix: &[u8], out: &mut [u8],
in_stride: usize, out_stride: usize,
)
{
let pos = 0;
// Load data from memory
// Load 64 bites to ensure we only take 8 u8's
let mn_0 = _mm_loadl_epi64(in_matrix[pos..].as_ptr().cast());
pos += in_stride;
let mn_1 = _mm_loadl_epi64(in_matrix[pos..].as_ptr().cast());
pos += in_stride;
// pack the first row with the second row
let rw_01 = _mm_unpacklo_epi64(mn_0, mn_1);
// repeat this 3 more times

- Transposing

Use `pushfb`

to perform a lookup that does some data moving for us.

1
2
3
4
5
6

let sv = _mm_set_epi8(15, 7, 14, 6, 13, 5, 12, 4, 11, 3, 10, 2, 9, 1, 8, 0);
//
let ov_0 = _mm_shuffle_epi8(rw_01, sv);
let ov_1 = _mm_shuffle_epi8(rw_23, sv);
// Two more times

`pushfb`

is a pretty nifty instruction we can use to do some data moving within a SIMD register. It’s usually hard to grasp but it has a nice magic to it.

To understand what the `shuffle`

will do, think about it like this, if the source register `rw_0`

was an array from 0-15, the shuffle instruction would move the data to be the same as itself, i.e `15`

,`7`

,`14`

,etc, etc in a single cycle, pretty neat.

We then unpack and interleave the data until we have two adjacent rows next to each other in the registers.

1
2
3
4
5
6
7

// repeat this two more times
let iv_0 = _mm_unpacklo_epi16(ov_0, ov_1);
let iv_1 = _mm_unpackhi_epi16(ov_0, ov_1);
// repeat this again two more times
let av_0 = _mm_unpacklo_epi32(iv_0, iv_2);
let av_1 = _mm_unpackhi_epi32(iv_0, iv_2);
//...

- Storing

To store the values, we write high and low values of the register to different destinations, separated by stride. extracting the high value into a lower value is done by a `_mm_unpackhi_epi64`

instruction and we store with `_mm_storel_epi64`

to only store the lower 64 bits of the `xmm`

register.

1
2
3
4
5
6
7
8
9
10
11

// Extract the higher part of the register since we will be writing it
// to a different stride, (one store should only add 8 bytes)
let sv_0 = _mm_unpackhi_epi64(av_0, _mm_setzero_si128());
let mut pos = 0;
_mm_storel_epi64(out[pos..].as_mut_ptr().cast(), av_0);
pos += out_stride;
_mm_storel_epi64(out[pos..].as_mut_ptr().cast(), sv_0);
pos += out_stride;
// Repeat this 6 more times.

And that’s it.

To recap,doing SIMD matrix transposition is:

- Load a sub-matrix into a SIMD register.
- Perform in-place matrix transposition.
- Write the data out to the appropriate destination

## Benchmarks

To see which one is faster, we can simply do a matrix transposition,utilizing Rust’s benchmark facilities (which require a nightly compiler)

The benchmark routine looks like this,

1
2
3
4
5
6
7
8
9
10
11
12
13
14

// -- Sniped ---
#[bench]
fn transpose_benchmark(b: &mut test::Bencher)
{
let width = 800;
let height = 800;
let dimensions = width * height;
let in_vec = vec![255; dimensions];
let mut out_vec = vec![0; dimensions];
b.iter(|| {
// transpose scalar or vector
})
}

It transposes an 800 by 800 matrix(medium size) array out of place and we time how many nanoseconds it takes to perform a single iteration.

And the results.

Test | ns/iter |
---|---|

transpose_scalar | 496,739 ns/iter (+/- 38,252) |

transpose_sse | 68,020 ns/iter (+/- 7,733) |

*Results are reproducible on x86 hardware*

Yep, that’s a 7x improvement right there.

## Further Reads

- Nvidia’s blog on Tiled matrix transposition on the GPU: https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/

### Footer

Only works for matrices whose widths and heights are multiples of 4 ↩