The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply. BLAS3 matrix-matrix operations usually have great potential for agressive optimization. Unfortunately, they usually need to be hand-coded for a specific machine and/or compiler to achieve near peak performance. We have developed a methodology whereby near-peak performance on such routines can be acheved automatically. First, rather than code by hand, we produce parameterized code generators whose parameters are germane to the resulting machine performance. Second, the generated code follows the PHiPAC (Portable High Performance Ansi C) coding suggestions that include manual loop unrolling, explicit removal of unnecessary dependencies in code blocks (if not removed, C semantics would prohibit many optimizations), and use of machine sympathetic C constructs. Third, we develop search scripts that, for a given code generator, find the best set of parameters for a given architecture/compiler. We have developed a BLAS-GEMM compatible multi-level cache-blocked matrix-matrix multiply code generator that has achieved performance around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, SGI Power Challenge R8k, SGI Octane R10k, and 80% on the SGI Indigo R4k. On the IBM, HP, SGI R4k, and the Sun Ultra-170, the resulting DGEMM is, in fact, faster than the GEMM in the vendor-optimized BLAS GEMM. Other generators, search scripts, and performance results are under development.

