GEMM: From Pure C to SSE Optimized Micro Kernels

On the next pages we try to discover how BLIS can achieve such a great performance. For this journey we set up our own BLAS implementation!

In our ulmBLAS project we have implemented a simple matrix-matrix product that follows the ideas described in BLIS: A Framework for Rapidly Instantiating BLAS Functionality.

Note that all benchmarks on these pages were generated when doctool transformed the doc files to HTML. All this happened on my MacBook Pro which has a 2.4 GHz Intel Core 2 Duo (P8600, “Penryn”). The theoretical peak performance of one core is 9.6 GFLOPS.

