Highly Efficient and Scalable Parallel Matrix Multiplication Algorithm Exploiting Advanced Communication Protocols For many scientific applications, dense matrix multiplication is one of the most important linear algebra operations. Because the optimized matrix multiplication can be extremely efficient, computational scientists routinely attempt to reformulate a mathematical description of their application in terms of matrix multiplications. Parallel matrix multiplication algorithms have been investigated for over two decades, with the currently leading SUMMA algorithm included in the ScaLAPACK and being predominantly used. A novel algorithm, called SRUMMA [1] (Shared and Remote memory access based Universal Matrix Multiplication Algorithm), was developed as that provides a better performance and scalability on variety computer architectures than the leading algorithms used today. Unlike other algorithms that are based on message-passing, the new algorithm relies on ARMCI, a high performance remote memory access communication (one-sided communication) library developed under the DoE PModels project. In addition to the fast communication (shared memory, remote memory access nonblocking communication), the new algorithm relies on careful scheduling of communication operations to minimize contention in access to blocks of distributed matrices. ARMCI exploits native network communication interfaces and system resources (such as shared memory, RDMA) to achieve the best possible performance of the remote memory access/one-sided communication. It exploits high-performance network protocols on clustered systems that use Myrinet, Quadrics, or Infiniband network.

Keywords for this software

Anything in here will be replaced on browsers that support the canvas element