The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply. BLAS3 matrix-matrix operations usually have great potential for agressive optimization. Unfortunately, they usually need to be hand-coded for a specific machine and/or compiler to achieve near peak performance. We have developed a methodology whereby near-peak performance on such routines can be acheved automatically. First, rather than code by hand, we produce parameterized code generators whose parameters are germane to the resulting machine performance. Second, the generated code follows the PHiPAC (Portable High Performance Ansi C) coding suggestions that include manual loop unrolling, explicit removal of unnecessary dependencies in code blocks (if not removed, C semantics would prohibit many optimizations), and use of machine sympathetic C constructs. Third, we develop search scripts that, for a given code generator, find the best set of parameters for a given architecture/compiler. We have developed a BLAS-GEMM compatible multi-level cache-blocked matrix-matrix multiply code generator that has achieved performance around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, SGI Power Challenge R8k, SGI Octane R10k, and 80% on the SGI Indigo R4k. On the IBM, HP, SGI R4k, and the Sun Ultra-170, the resulting DGEMM is, in fact, faster than the GEMM in the vendor-optimized BLAS GEMM. Other generators, search scripts, and performance results are under development.

References in zbMATH (referenced in 46 articles )

Showing results 1 to 20 of 46.
Sorted by year (citations)

1 2 3 next

  1. Jianyu Huang, Robert A. van de Geijn: BLISlab: A Sandbox for Optimizing GEMM (2016) arXiv
  2. Audet, Charles; Dang, Kien-Cong; Orban, Dominique: Optimization of algorithms with OPAL (2014)
  3. Du, Peng; Weber, Rick; Luszczek, Piotr; Tomov, Stanimire; Peterson, Gregory; Dongarra, Jack: From CUDA to opencl: towards a performance-portable solution for multi-platform GPU programming (2012) ioport
  4. Kalinnik, Natalia; Korch, Matthias; Rauber, Thomas: An efficient time-step-based self-adaptive algorithm for predictor-corrector methods of Runge-Kutta type (2011)
  5. D’Alberto, Paolo; Nicolau, Alexandru: Adaptive Winograd’s matrix multiplications (2009)
  6. Youseff, Lamia; Seymour, Keith; You, Haihang; Zagorodnov, Dmitrii; Dongarra, Jack; Wolski, Rich: Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software (2009) ioport
  7. Nishtala, Rajesh; Vuduc, Richard W.; Demmel, James W.; Yelick, Katherine A.: When cache blocking of sparse matrix vector multiply works and why (2007)
  8. Hitczenko, Paweł; Johnson, Jeremy R.; Huang, Hung-Jen: Distribution of a class of divide and conquer recurrences arising from the computation of the Walsh-Hadamard transform (2006)
  9. Korch, Matthias; Rauber, Thomas: Optimizing locality and scalability of embedded Runge-Kutta solvers using block-based pipelining (2006)
  10. Naono, Ken; Imamura, Toshiyuki: An evaluation towards automatically tuned eigensolvers (2006)
  11. Qasem, Apan; Kennedy, Ken; Mellor-Crummey, John: Automatic tuning of whole applications using direct search and a performance-based transformation system (2006) ioport
  12. Bender, Michael A.; Farach-Colton, Martín; Pemmasani, Giridhar; Skiena, Steven; Sumazin, Pavel: Lowest common ancestors in trees and directed acyclic graphs (2005)
  13. Lee, Yoon-Ju; Diniz, Pedro C.; Hall, Mary W.; Lucas, Robert: Empirical optimization for a sparse linear solver: A case study (2005) ioport
  14. Elmroth, Erik; Gustavson, Fred; Jonsson, Isak; Kågström, Bo: Recursive blocked algorithms and hybrid data structures for dense matrix library software (2004)
  15. Hunold, S.; Rauber, T.; Rünger, G.: Hierarchical matrix-matrix multiplication based on multiprocessor tasks (2004)
  16. Irony, Dror; Toledo, Sivan; Tiskin, Alexander: Communication lower bounds for distributed-memory matrix multiplication (2004)
  17. -: Bibliography of “Algorithms for memory hierarchies. Advanced lectures” (2003)
  18. Knijnenburg, P. M. W.; Kisuki, T.; O’Boyle, M. F. P.: Combined selection of tile sizes and unroll factors using iterative compilation (2003)
  19. Kowarschik, Markus; Weiß, Christian: An overview of cache optimization techniques and cache-aware numerical algorithms (2003)
  20. Quintana-Ortí, Enrique S.; van de Geijn, Robert A.: Formal derivation of algorithms: the triangular sylvester equation (2003)

1 2 3 next