GEMM: From Pure C to SSE Optimized Micro Kernels

Note: Unfortunately on NA Digest I posted the https URL of this site. As our server uses only a self signed SSL certificate that is inconvenient, e.g. some browsers will not display formulas properly even if you trust the certificate. Use the http URL
instead. In the mean time I will order a proper signed certificate.

On the next pages we try to discover how BLIS can achieve such a great performance. For this journey we set up our own BLAS implementation!

In our ulmBLAS project we have implemented a simple matrix-matrix product that follows the ideas described in BLIS: A Framework for Rapidly Instantiating BLAS Functionality.

Note that all benchmarks on these pages were generated when doctool transformed the doc files to HTML. All this happened on my MacBook Pro which has a 2.4 GHz Intel Core 2 Duo (P8600, “Penryn”). The theoretical peak performance of one core is 9.6 GFLOPS.

Back to the main course