Parallel Matrix Multiplication: 2D and 3D PDF

FAST MONTE CARLO ALGORITHMS FOR MATRICES I

We then present a second matrix multiplication algorithm which is similar in spirit to our main algorithm. In addition we present a model (the pass-efficient

Anatomy of High-Performance Many-Threaded Matrix Multiplication

approach” to implementing matrix multiplication (GEMM). While. GEMM was previously implemented as three loops around an inner kernel BLIS exposes two

Polynomial Codes: an Optimal Design for High-Dimensional Coded

in distributed matrix multiplication. Furthermore by leveraging the algebraic structure of polynomial codes

Communication-optimal parallel 2.5D matrix multiplication and LU

Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon's algorithm and be faster in practice.

Anatomy of High-Performance Matrix Multiplication

matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures

Anatomy of High-Performance Matrix Multiplication

Additional Key Words and Phrases: linear algebra matrix multiplication

Lecture 2: Approximating Matrix Multiplication 2 Approximating

Approximating the product of two matrices with random sampling or random projection methods is a fundamental operation that is of interest in and of itself

MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based

Unlike conventional methods using inner or outer product as the meta operation for matrix multiplication our approach is based on row-wise product

CAKE: Matrix Multiplication Using Constant-Bandwidth Blocks

arithmetic intensity matrix multiplication

Parallel Matrix Multiplication: 2D and 3D

Jun 11 2012 Abstract. We describe an extension of the Scalable Universal Matrix Multiplication Algorithms (SUMMA) from 2D to 3D process grids; ...

Matrix algebra for beginners Part I matrices determinants

There is one very important property of matrix multiplication that it is best to see early on Considerthe calculation below in which two square matrices are multiplied in a di?erent order 1 2 3 2 ?11 ?1 5 5 = 35 ?5 3 ?1 1 2 1 7 = 32 ?17 ?1 We see from this that matrix multiplication is not commutative

44 Matrices: Basic Operations - California State University Northridge

Algebra of Matrix Multiplication Identity Matrix Number of Solutions Properties of Matrix Multiplication Let A;B;C be matrices and c is a constant Assume all the matrix products below are de ned Then A(BC) = (AB)C Associativity Matrix Product A(B + C) = AB + AC Distributive Property (A+ B)C = AC + BC Distributive Property c(AB) = (cA)B = A(cB)

Lecture 4: Matrix multiplication Diagonal matrices Inverse

Matrix algebra: linear operations Addition: two matrices of the same dimensionscan be added by adding their corresponding entries Scalar multiplication: to multiply a matrixAby scalarr one multiplies each entry of Abyr Zero matrixO: all entries are zeros Negative: ?Ais de?ned as (?1)A Subtraction: A?Bis de?ned asA+ (?B)

Matrices and Linear Algebra - Texas A&M University

Chapter 2 Matrices and Linear Algebra 2 1 Basics De?nition 2 1 1 A matrix is an m×n array of scalars from a given ?eld F The individual values in the matrix are called entries

44 Matrices: Basic Operations - California State University

The method of multiplication of matrices is not asintuitive and may seem strange although this methodis extremely useful in many mathematical applications Matrix multiplication was introduced by an Englishmathematician named Arthur Cayley (1821-1895) We will see shortly how matrix multiplication can beused to solve systems of linear

Searches related to matrice multiplication PDF

Multiplication Just like adding and subtracting we first need to take a look at the size of the two matrices we want to multiply Matrix A Matrix B The number of columns in the first matrix MUST be the same as the number of rows in the second matrix otherwise the answer is “undefined”

What is a solution using matrix multiplication?

Solution using matrix multiplication ?We represent the number of each model sold using a row matrix (4X1) and we use a 1X4 column matrix to represent the sales price of each model. When a 4X1 matrix is multiplied by a 1X4 matrix, the result is a 1X1 matrix of a single number.

How many matrix multiplications are there?

0 0 2Note there are two matrix multiplications them, one for each Type 3 ele-mentary operation. by row operations. Called theRREF, it has the following properties. Each nonzero row has a 1 as the?rst nonzero entry (:=leading one). (b) All column entries above and below a leading one are zero.

Who invented matrix multiplication?

?Matrix multiplication was introduced by an English mathematician named Arthur Cayley (1821-1895) . ?We will see shortly how matrix multiplication can be used to solve systems of linear equations. Row by column multiplication

How do you multiply two matrices?

Just like adding and subtracting, we first need to take a look at the size of the two matrices we want to multiply. The number of columns in the first matrix MUST be the same as the number of rows in the second matrix, otherwise, the answer is “undefined”. same number of columns as the second matrix. tricky.

Parallel Matrix Multiplication: 2D and 3D

FLAME Working Note #62

Martin Schatz

Jack Poulson

Robert van de Geijn

The University of Texas at Austin

Austin, Texas 78712

June 11, 2012

Abstract

We describe an extension of the Scalable Universal Matrix Multiplication Algorithms (SUMMA)

from 2D to 3D process grids; the underlying idea is to lower the communication volume through storing

redundant copies of one or more matrices. While SUMMA was originally introduced for block-wise matrix distributions, so that most of its communication was within broadcasts, this paper focuses on

element-wise matrix distributions, which lead to allgather-based algorithms. We begin by describing an

allgather-based 2D SUMMA, describe its generalization to 3D process grids, and then discuss theoretical

and experimental performance benets of the new algorithms.

1 Introduction

The parallelization of dense matrix-matrix multiplication is a well-studied subject. Cannon's algorithm

(sometimes called roll-roll-compute) dates back to 1969 [7], and Fox's algorithm (sometimes called broadcast-

roll-compute) dates back to the 1980s [12]. Both suer from a number of shortcomings: They assume thatpprocesses are viewed as anrcgrid, withr=c=pp. Removing this constraint onrandcis nontrivial.

They do not deal well with the case where one of the matrix dimensions becomes relatively small. This

is the most commonly encountered case in libraries like LAPACK [3] andlibflame[22, 23], and their distributed-memory counterparts: ScaLAPACK [9], PLAPACK [21], and Elemental [17]. They do not treat the other common cases of matrix-matrix multiplication:C:=ATB+C,C:= AB

T+C, andC:=ATBT+C.1

These shortcomings were overcome by SUMMA. The original paper [20] gives four algorithms: ForC:=AB+C, SUMMA casts the computation in terms of multiple rank-k updates, and algorithm already independently reported in [1]. This algorithm is sometimes called the broadcast-broadcast-

multiply algorithm, a label we will see is somewhat limiting. We also call this algorithm \stationary

C" for reasons that will become clear later. By design, this algorithm continues to perform well in the

case where the width ofAis small relative to the dimensions ofC. ForC:=ATB+C, SUMMA casts the computation in terms of multiple panel-matrix multiplies, so performance is not degraded in the case where the height ofAis small relative to the dimensions of B. We have also called this algorithm \stationaryB" for reasons that will become clear later.1 cf. the general form ofdgemm[10] 1 ForC:=ABT+C, SUMMA casts the computation in terms of multiple matrix-panel multiplies, and so performance does not deteriorate when the width ofCis small relative to the dimensions ofA. We call this algorithm \stationaryA" for reasons that will become clear later. ForC:=ATBT+C, the paper sketches an algorithm that is actually not practical.

In [14], it was shown how stationaryA,B, andCalgorithms can be formulated for each of the four cases of

matrix-matrix multiplication, including forC:=ATBT+C. This then yielded a general, practical family

of matrix-matrix multiplication algorithms all of which were incorporated into PLAPACK and Elemental,

and some of which are supported by ScaLAPACK.

In the 1990s, it was observed that for the case where matrices were relatively small (or, equivalently, a

relatively large number of nodes were available), better theoretical and practical performance resulted from

viewing thepnodes as a3pp3pp3ppmesh, yielding a 3D algorithm [2] It was recently observed that the nodes could instead be viewed as anrchmesh, withr=cbut generalh, building on Cannon's

algorithm. This was labeled a 2.5D algorithm [19] since it was believed thathwould always be chosen to be

less than

3ppfor theoretical optimality reasons. It was subsequently observed by the authors of [19] that,

by using the SUMMA algorithm instead, at the expense of a slightly higher latency cost, the constraint that

r=ccould be removed.

This paper attempts to give the most up-to-date treatment yet of practical parallel matrix-matrix multi-

plication algorithms. It does so by rst discussing how collective communication is related to matrix-vector

multiplication (y:=Ax) and rank-1 update (C:=yxT+C). This is then used to link matrix distribution to

a 2D mesh of nodes to the communications required for these matrix operations. This then leads naturally

into various practical algorithms for all cases of matrix-matrix multiplication on 2D meshes of nodes. We

nally generalizes the notion of 2.5D algorithms to yield a generally family of 3D algorithms that build on

2D stationaryC,A, andBalgorithms. We call this general class of algorithms General 3D algorithms, or

G3D algorithms for short.

The paper makes a number of contributions:

It systematically exposes the link between collective communication, matrix and vector distribution, and matrix-vector operations like matrix-vector multiplication and rank-1 update.

It next shows how parallel algorithms for these matrix-vector operations systematically yield a family

of matrix-matrix multiplication algorithms that view nodes as a 2D mesh.

It exposes a set based notation for describing distribution of matrices and vectors that is used by the

Elemental library for dense matrix computations on distributed memory architectures. It shows how the family of 2D algorithms can be used to build a family of algorithms that view the nodes as a 3D mesh. It provides performance results that illustrate the benets of the family of 2D and 3D algorithms.

The exposition builds the reader's understanding so that he/she can easily customize the algorithms for

other matrix distributions.

2 Of Matrix-Vector Operations and Distribution

In this section, we discuss how matrix and vector distribution can be linked to parallel 2D matrix-vector

multiplication and rank-1 update operations, which then allows us to eventually describe the stationaryC,

A, andB2D algorithms for matrix-matrix multiplication that are part of the Elemental library.

2.1 Collective communication

Collectives are fundamental to the parallelization of dense matrix operations. Thus, the reader must be, or

become, familiar with the basics of these communications and is encouraged to read Chan et al. [8], which

presents collectives in a systematic way that dovetails perfectly with the present paper. 2

OperationBeforeAfter

BroadcastNode 0Node 1Node 2Node 3

xNode 0Node 1Node 2Node 3 xxxx

Reduce(-

to-one)Node 0Node 1Node 2Node 3 x (0)x (1)x (2)x (3)Node 0Node 1Node 2Node 3P jx(j)ScatterNode 0Node 1Node 2Node 3 x 0x 1x 2x

3Node 0Node 1Node 2Node 3

x 0x 1x 2x

3GatherNode 0Node 1Node 2Node 3

x 0x 1x 2x

3Node 0Node 1Node 2Node 3

x 0x 1x 2x

3AllgatherNode 0Node 1Node 2Node 3

x 0x 1x 2x

3Node 0Node 1Node 2Node 3

x 0x 0x 0x 0 x 1x 1x 1x 1 x 2x 2x 2x 2 x 3x 3x 3x

3Reduce-

scatterNode 0Node 1Node 2Node 3 x (0) 0x (1) 0x (2) 0x (3) 0 x(0) 1x (1) 1x (2) 1x (3) 1 x(0) 2x (1) 2x (2) 2x (3) 2 x(0) 3x (1) 3x (2) 3x (3)

3Node 0Node 1Node 2Node 3P

jx(j) 0P jx(j) 1P jx(j) 2P jx(j)

3AllreduceNode 0Node 1Node 2Node 3

x (0)x (1)x (2)x (3)Node 0Node 1Node 2Node 3P jx(j)P jx(j)P jx(j)P jx(j)All-to-allNode 0Node 1Node 2Node 3 x (0) 0x (1) 0x (2) 0x (3) 0 x(0) 1x (1) 1x (2) 1x (3) 1 x(0) 2x (1) 2x (2) 2x (3) 2 x(0) 3x (1) 3x (2) 3x (3)

3Node 0Node 1Node 2Node 3

x (0) 0x (0) 1x (0) 2x (0) 3 x(1) 0x (1) 1x (1) 2x (1) 3 x(2) 0x (2) 1x (2) 2x (2) 3 x(3) 0x (3) 1x (3) 2x (3)

3Figure 1: Collective communications considered in this paper.

3 CommunicationLatencyBandwidthComputationCost used for analysis

Broadcastdlog2(p)en{log

2(p)+nReduce(-to-one)dlog2(p)enp1p

n log

2(p)+n+n

Scatterdlog2(p)ep1p

n{log

2(p)+p1p

nGatherdlog2(p)ep1p n{log

2(p)+p1p

nAllgatherdlog2(p)ep1p n{log

2(p)+p1p

nReduce-scatterdlog2(p)ep1p np1p n log

2(p)+p1p

n+p1p n

Allreducedlog2(p)e2

p1p np1p n 2log

2(p)+ 2p1p

n+p1p n

All-to-alldlog2(p)ep1p

n{log

2(p)+p1p

nFigure 2: Lower bounds for the dierent components of communication cost. Conditions for the lower bounds

given in [8] and [6]. The last column gives the cost functions that we use in our analyses. For architectures

with sucient connectivity, simple algorithms exist with costs that remain within a small constant factor of

all but one of the given formulae. The exception is the all-to-all, for which there are algorithms that achieve

the lower bound for theandterm separately, but it is not clear whether an algorithm that consistently

achieves performance within a constant factor of the given cost function exists.

To make this paper self-contained, Figure 1 (similar to Figure 1 in [8]) summarizes the collectives. In

Figure 2 we summarize lower bounds on the cost of the collective communications, under basic assumptions

explained in [8] (see [6] for analysis of all-to-all), and the cost expressions that we will use in our analyses.

2.2 Motivation: matrix-vector multiplication

SupposeA2Rmn,x2Rn, andy2Rm, and label their individual elements so that A=0 B BB@

0;00;10;n1

1;01;11;n1............

m1;0m1;1m1;n11 C

CCA; x=0

B BB@ 0 1... n11 C

CCA;andy=0

B BB@ 0 1... m11 C CCA: Recalling thaty=Ax(matrix-vector multiplication) is computed as

0=0;00+0;11++0;n1n1

1=1;00+1;11++1;n1n1............

m1=m1;00+m1;11++m1;n1n1 we notice that elementi;jmultipliesjand contributes to i. Thus we may summarize the interactions of the elements ofx,y, andAby 0 1 n1 0 0;0 0;1 0;n1 1 1;0 1;1 1;n1. m1 m1;0 m1;1 m1;n1(1) which is meant to indicate thatjmust be multiplied by the elements in thejth column ofAwhile theith row ofAcontributes to i. 4 0 0

0;00;30;6

3;03;33;6

6;06;36;6

1 3

0;10;40;7

3;13;43;7

6;16;46;7

2 6

0;20;50;8

3;23;53;8

6;26;56;8

3 1

1;01;31;6

4;04;34;6

7;07;37;6

4 4

1;11;41;7

4;14;44;7

7;17;47;7

5 7

1;21;51;8

4;24;54;8

7;27;57;8

6 2

2;02;32;6

5;05;35;6

8;08;38;6

7 5

2;12;42;7

5;15;45;7

8;18;48;7

8 8

2;22;52;8

5;25;55;8

8;28;58;8

Figure 3: Distribution ofA,x, andywithin a 33 mesh. Notice that redistributing a column ofAin the same manner asyrequires simultaneous gathers within rows of nodes while redistributing a row ofA

consistently withxrequires simultaneous gathers within columns of nodes. In the notation of Section 3, here

the distribution ofxandyare given byx(VR;) andy(VC;), respectively, andAbyA(MC;MR). While the presented mesh of nodes is square, none of the results depend on the mesh being square.

2.3 Two-Dimensional Elemental Cyclic Distribution

It is well established that scalable implementations of dense linear algebra operations require nodes to be

logically viewed as a two-dimensional mesh. It is also well established that to achieve load balance for a wide

range of matrix operations, matrices should be cyclically \wrapped" onto this logical mesh. We start with

these insights and examine the simplest of matrix distributions that result. Denoting the number of nodes byp, anrcmesh must be chosen such thatp=rc.

Matrix distribution.The elements ofAare assigned using an elemental cyclic (round-robin) distribution

wherei;jis assigned to node (imodr;jmodc). Thus, node (s;t) stores submatrix

A(s:r:m1;t:c:n1) =0

B s;ts;t+c s+r;ts+r;t+c .........1 C A; where the left-hand side of the expression uses the MATLAB convention for expressing submatrices, starting indexing from zero instead of one. This is illustrated in Figure 3. Column-major vector distribution.Acolumn-majorvector distribution views thercmesh of nodes as a linear array ofpnodes, numbered incolumn-majororder. A vector is distributed with this 5 036
0 3 6

0;00;30;6

3;03;33;6

6;06;36;6

147
0 3 6

0;10;40;7

3;13;43;7

6;16;46;7

258
0 3 6

0;20;50;8

3;23;53;8

6;26;56;8

036
1 4 7

1;01;31;6

4;04;34;6

7;07;37;6

147
1 4 7

1;11;41;7

4;14;44;7

7;17;47;7

258
1 4 7

1;21;51;8

4;24;54;8

7;27;57;8

036
2 5 8

2;02;32;6

5;05;35;6

8;08;38;6

147
2 5 8

2;12;42;7

5;15;45;7

8;18;48;7

258
2 5 8

2;22;52;8

5;25;55;8

8;28;58;8

Figure 4: Vectorsxandyrespectively redistributed as row-projected and column-projected vectors. The

column-projected vectory(MC;) here is to be used to compute local results that will become contributions

to a column vectory(VC;) which will result from adding these local contributions within rows of nodes. By

comparing and contrasting this gure with Figure 3 it becomes obvious that redistributingx(VR) tox(MR)

requires an allgather within colums of nodes whiley(VC) results from scatteringy(MC) within process rows.

distribution if it is assigned to this linear array of nodes in a round-robin fashion, one element at a

time. In other words, consider vectory. Its element iis assigned to node (imodr;(i=r) modc), where= denotes integer division. Or, equivalently in matlab-like notation, node (s;t) stores subvectory(u: p:m1), whereu(s;t) =s+trequals the rank of node (s;t) when the nodes are viewed as a

one-dimensional array, indexed in column-major order. The distribution ofyis illustrated in Figure 3.

Row-major vector distributionSimilarly, arow-majorvector distribution views thercmesh of nodes as a linear array ofpnodes, numbered inrow-majororder. The vector is then assigned to this linear array of nodes in a round-robin fashion, one element at a time. In other words, consider vectorx. Its elementjis assigned to node (jmodc;(j=c) modr). Or,quotesdbs_dbs26.pdfusesText_32

[PDF] comment savoir si il prend du plaisir

[PDF] signes qu'un homme prend du plaisir

[PDF] arts visuels cycle 2 arbre printemps

[PDF] arts visuels arbres cycle 2

[PDF] arbre arts visuels cycle 3

[PDF] arbres arts visuels

[PDF] les arbres en arts plastiques ? lécole

[PDF] arbre arts visuels cycle 2

[PDF] arbre arts plastiques maternelle

[PDF] comment rediger un exercice de math

[PDF] redaction maths prepa

[PDF] académie de créteil recrutement sans concours

[PDF] date titularisation enseignant stagiaire 2017

[PDF] liste titularisation creteil 2016

[PDF] exemple de corpus corrigé poesie

[PDF] Parallel Matrix Multiplication: 2D and 3D

What is a solution using matrix multiplication?

How many matrix multiplications are there?

Who invented matrix multiplication?

How do you multiply two matrices?

Parallel Matrix Multiplication: 2D and 3D

FLAME Working Note #62

Martin Schatz

Jack Poulson

Robert van de Geijn

The University of Texas at Austin

Austin, Texas 78712

June 11, 2012

Abstract

1 Introduction

T+C, andC:=ATBT+C.1

3ppfor theoretical optimality reasons. It was subsequently observed by the authors of [19] that,

2D stationaryC,A, andBalgorithms. We call this general class of algorithms General 3D algorithms, or

G3D algorithms for short.

The paper makes a number of contributions:

2 Of Matrix-Vector Operations and Distribution

2.1 Collective communication

OperationBeforeAfter

BroadcastNode 0Node 1Node 2Node 3

Reduce(-

3Node 0Node 1Node 2Node 3

3GatherNode 0Node 1Node 2Node 3

3Node 0Node 1Node 2Node 3

3AllgatherNode 0Node 1Node 2Node 3

3Node 0Node 1Node 2Node 3

3Reduce-

3Node 0Node 1Node 2Node 3P

3AllreduceNode 0Node 1Node 2Node 3

3Node 0Node 1Node 2Node 3

3Figure 1: Collective communications considered in this paper.

Broadcastdlog2(p)en{log

2(p)+nReduce(-to-one)dlog2(p)enp1p

2(p)+n+n

Scatterdlog2(p)ep1p

2(p)+p1p

2(p)+p1p

2(p)+p1p

2(p)+p1p

Allreducedlog2(p)e2

2(p)+ 2p1p

All-to-alldlog2(p)ep1p

2(p)+p1p

2.2 Motivation: matrix-vector multiplication

0;00;10;n1

1;01;11;n1............

CCA; x=0

CCA;andy=0

0=0;00+0;11++0;n1n1

1=1;00+1;11++1;n1n1............

0;00;30;6

3;03;33;6

6;06;36;6

0;10;40;7

3;13;43;7

6;16;46;7

0;20;50;8

3;23;53;8

6;26;56;8

1;01;31;6

4;04;34;6

7;07;37;6

1;11;41;7

4;14;44;7

7;17;47;7

1;21;51;8

4;24;54;8

7;27;57;8

2;02;32;6

5;05;35;6

8;08;38;6

2;12;42;7

5;15;45;7

8;18;48;7

2;22;52;8

5;25;55;8