FAST MONTE CARLO ALGORITHMS FOR MATRICES I
We then present a second matrix multiplication algorithm which is similar in spirit to our main algorithm. In addition we present a model (the pass-efficient
Anatomy of High-Performance Many-Threaded Matrix Multiplication
approach” to implementing matrix multiplication (GEMM). While. GEMM was previously implemented as three loops around an inner kernel BLIS exposes two
Polynomial Codes: an Optimal Design for High-Dimensional Coded
in distributed matrix multiplication. Furthermore by leveraging the algebraic structure of polynomial codes
Communication-optimal parallel 2.5D matrix multiplication and LU
Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon's algorithm and be faster in practice.
Anatomy of High-Performance Matrix Multiplication
matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures
Anatomy of High-Performance Matrix Multiplication
Additional Key Words and Phrases: linear algebra matrix multiplication
Lecture 2: Approximating Matrix Multiplication 2 Approximating
Approximating the product of two matrices with random sampling or random projection methods is a fundamental operation that is of interest in and of itself
MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based
Unlike conventional methods using inner or outer product as the meta operation for matrix multiplication our approach is based on row-wise product
CAKE: Matrix Multiplication Using Constant-Bandwidth Blocks
arithmetic intensity matrix multiplication
Parallel Matrix Multiplication: 2D and 3D
Jun 11 2012 Abstract. We describe an extension of the Scalable Universal Matrix Multiplication Algorithms (SUMMA) from 2D to 3D process grids; ...
Matrix algebra for beginners Part I matrices determinants
There is one very important property of matrix multiplication that it is best to see early on Considerthe calculation below in which two square matrices are multiplied in a di?erent order 1 2 3 2 ?11 ?1 5 5 = 35 ?5 3 ?1 1 2 1 7 = 32 ?17 ?1 We see from this that matrix multiplication is not commutative
44 Matrices: Basic Operations - California State University Northridge
Algebra of Matrix Multiplication Identity Matrix Number of Solutions Properties of Matrix Multiplication Let A;B;C be matrices and c is a constant Assume all the matrix products below are de ned Then A(BC) = (AB)C Associativity Matrix Product A(B + C) = AB + AC Distributive Property (A+ B)C = AC + BC Distributive Property c(AB) = (cA)B = A(cB)
Lecture 4: Matrix multiplication Diagonal matrices Inverse
Matrix algebra: linear operations Addition: two matrices of the same dimensionscan be added by adding their corresponding entries Scalar multiplication: to multiply a matrixAby scalarr one multiplies each entry of Abyr Zero matrixO: all entries are zeros Negative: ?Ais de?ned as (?1)A Subtraction: A?Bis de?ned asA+ (?B)
Matrices and Linear Algebra - Texas A&M University
Chapter 2 Matrices and Linear Algebra 2 1 Basics De?nition 2 1 1 A matrix is an m×n array of scalars from a given ?eld F The individual values in the matrix are called entries
44 Matrices: Basic Operations - California State University
The method of multiplication of matrices is not asintuitive and may seem strange although this methodis extremely useful in many mathematical applications Matrix multiplication was introduced by an Englishmathematician named Arthur Cayley (1821-1895) We will see shortly how matrix multiplication can beused to solve systems of linear
Searches related to matrice multiplication PDF
Multiplication Just like adding and subtracting we first need to take a look at the size of the two matrices we want to multiply Matrix A Matrix B The number of columns in the first matrix MUST be the same as the number of rows in the second matrix otherwise the answer is “undefined”
What is a solution using matrix multiplication?
Solution using matrix multiplication ?We represent the number of each model sold using a row matrix (4X1) and we use a 1X4 column matrix to represent the sales price of each model. When a 4X1 matrix is multiplied by a 1X4 matrix, the result is a 1X1 matrix of a single number.
How many matrix multiplications are there?
0 0 2Note there are two matrix multiplications them, one for each Type 3 ele-mentary operation. by row operations. Called theRREF, it has the following properties. Each nonzero row has a 1 as the?rst nonzero entry (:=leading one). (b) All column entries above and below a leading one are zero.
Who invented matrix multiplication?
?Matrix multiplication was introduced by an English mathematician named Arthur Cayley (1821-1895) . ?We will see shortly how matrix multiplication can be used to solve systems of linear equations. Row by column multiplication
How do you multiply two matrices?
Just like adding and subtracting, we first need to take a look at the size of the two matrices we want to multiply. The number of columns in the first matrix MUST be the same as the number of rows in the second matrix, otherwise, the answer is “undefined”. same number of columns as the second matrix. tricky.
Parallel Matrix Multiplication: 2D and 3D
FLAME Working Note #62
Martin Schatz
Jack Poulson
Robert van de Geijn
The University of Texas at Austin
Austin, Texas 78712
June 11, 2012
Abstract
We describe an extension of the Scalable Universal Matrix Multiplication Algorithms (SUMMA)from 2D to 3D process grids; the underlying idea is to lower the communication volume through storing
redundant copies of one or more matrices. While SUMMA was originally introduced for block-wise matrix distributions, so that most of its communication was within broadcasts, this paper focuses onelement-wise matrix distributions, which lead to allgather-based algorithms. We begin by describing an
allgather-based 2D SUMMA, describe its generalization to 3D process grids, and then discuss theoretical
and experimental performance benets of the new algorithms.1 Introduction
The parallelization of dense matrix-matrix multiplication is a well-studied subject. Cannon's algorithm
(sometimes called roll-roll-compute) dates back to 1969 [7], and Fox's algorithm (sometimes called broadcast-
roll-compute) dates back to the 1980s [12]. Both suer from a number of shortcomings: They assume thatpprocesses are viewed as anrcgrid, withr=c=pp. Removing this constraint onrandcis nontrivial.They do not deal well with the case where one of the matrix dimensions becomes relatively small. This
is the most commonly encountered case in libraries like LAPACK [3] andlibflame[22, 23], and their distributed-memory counterparts: ScaLAPACK [9], PLAPACK [21], and Elemental [17]. They do not treat the other common cases of matrix-matrix multiplication:C:=ATB+C,C:= ABT+C, andC:=ATBT+C.1
These shortcomings were overcome by SUMMA. The original paper [20] gives four algorithms: ForC:=AB+C, SUMMA casts the computation in terms of multiple rank-k updates, and algorithm already independently reported in [1]. This algorithm is sometimes called the broadcast-broadcast-multiply algorithm, a label we will see is somewhat limiting. We also call this algorithm \stationary
C" for reasons that will become clear later. By design, this algorithm continues to perform well in the
case where the width ofAis small relative to the dimensions ofC. ForC:=ATB+C, SUMMA casts the computation in terms of multiple panel-matrix multiplies, so performance is not degraded in the case where the height ofAis small relative to the dimensions of B. We have also called this algorithm \stationaryB" for reasons that will become clear later.1 cf. the general form ofdgemm[10] 1 ForC:=ABT+C, SUMMA casts the computation in terms of multiple matrix-panel multiplies, and so performance does not deteriorate when the width ofCis small relative to the dimensions ofA. We call this algorithm \stationaryA" for reasons that will become clear later. ForC:=ATBT+C, the paper sketches an algorithm that is actually not practical.In [14], it was shown how stationaryA,B, andCalgorithms can be formulated for each of the four cases of
matrix-matrix multiplication, including forC:=ATBT+C. This then yielded a general, practical familyof matrix-matrix multiplication algorithms all of which were incorporated into PLAPACK and Elemental,
and some of which are supported by ScaLAPACK.In the 1990s, it was observed that for the case where matrices were relatively small (or, equivalently, a
relatively large number of nodes were available), better theoretical and practical performance resulted from
viewing thepnodes as a3pp3pp3ppmesh, yielding a 3D algorithm [2] It was recently observed that the nodes could instead be viewed as anrchmesh, withr=cbut generalh, building on Cannon'salgorithm. This was labeled a 2.5D algorithm [19] since it was believed thathwould always be chosen to be
less than3ppfor theoretical optimality reasons. It was subsequently observed by the authors of [19] that,
by using the SUMMA algorithm instead, at the expense of a slightly higher latency cost, the constraint that
r=ccould be removed.This paper attempts to give the most up-to-date treatment yet of practical parallel matrix-matrix multi-
plication algorithms. It does so by rst discussing how collective communication is related to matrix-vector
multiplication (y:=Ax) and rank-1 update (C:=yxT+C). This is then used to link matrix distribution to
a 2D mesh of nodes to the communications required for these matrix operations. This then leads naturally
into various practical algorithms for all cases of matrix-matrix multiplication on 2D meshes of nodes. We
nally generalizes the notion of 2.5D algorithms to yield a generally family of 3D algorithms that build on
2D stationaryC,A, andBalgorithms. We call this general class of algorithms General 3D algorithms, or
G3D algorithms for short.
The paper makes a number of contributions:
It systematically exposes the link between collective communication, matrix and vector distribution, and matrix-vector operations like matrix-vector multiplication and rank-1 update.It next shows how parallel algorithms for these matrix-vector operations systematically yield a family
of matrix-matrix multiplication algorithms that view nodes as a 2D mesh.It exposes a set based notation for describing distribution of matrices and vectors that is used by the
Elemental library for dense matrix computations on distributed memory architectures. It shows how the family of 2D algorithms can be used to build a family of algorithms that view the nodes as a 3D mesh. It provides performance results that illustrate the benets of the family of 2D and 3D algorithms.The exposition builds the reader's understanding so that he/she can easily customize the algorithms for
other matrix distributions.2 Of Matrix-Vector Operations and Distribution
In this section, we discuss how matrix and vector distribution can be linked to parallel 2D matrix-vector
multiplication and rank-1 update operations, which then allows us to eventually describe the stationaryC,
A, andB2D algorithms for matrix-matrix multiplication that are part of the Elemental library.2.1 Collective communication
Collectives are fundamental to the parallelization of dense matrix operations. Thus, the reader must be, or
become, familiar with the basics of these communications and is encouraged to read Chan et al. [8], which
presents collectives in a systematic way that dovetails perfectly with the present paper. 2OperationBeforeAfter
BroadcastNode 0Node 1Node 2Node 3
xNode 0Node 1Node 2Node 3 xxxxReduce(-
to-one)Node 0Node 1Node 2Node 3 x (0)x (1)x (2)x (3)Node 0Node 1Node 2Node 3P jx(j)ScatterNode 0Node 1Node 2Node 3 x 0x 1x 2x3Node 0Node 1Node 2Node 3
x 0x 1x 2x3GatherNode 0Node 1Node 2Node 3
x 0x 1x 2x3Node 0Node 1Node 2Node 3
x 0x 1x 2x3AllgatherNode 0Node 1Node 2Node 3
x 0x 1x 2x3Node 0Node 1Node 2Node 3
x 0x 0x 0x 0 x 1x 1x 1x 1 x 2x 2x 2x 2 x 3x 3x 3x3Reduce-
scatterNode 0Node 1Node 2Node 3 x (0) 0x (1) 0x (2) 0x (3) 0 x(0) 1x (1) 1x (2) 1x (3) 1 x(0) 2x (1) 2x (2) 2x (3) 2 x(0) 3x (1) 3x (2) 3x (3)3Node 0Node 1Node 2Node 3P
jx(j) 0P jx(j) 1P jx(j) 2P jx(j)3AllreduceNode 0Node 1Node 2Node 3
x (0)x (1)x (2)x (3)Node 0Node 1Node 2Node 3P jx(j)P jx(j)P jx(j)P jx(j)All-to-allNode 0Node 1Node 2Node 3 x (0) 0x (1) 0x (2) 0x (3) 0 x(0) 1x (1) 1x (2) 1x (3) 1 x(0) 2x (1) 2x (2) 2x (3) 2 x(0) 3x (1) 3x (2) 3x (3)3Node 0Node 1Node 2Node 3
x (0) 0x (0) 1x (0) 2x (0) 3 x(1) 0x (1) 1x (1) 2x (1) 3 x(2) 0x (2) 1x (2) 2x (2) 3 x(3) 0x (3) 1x (3) 2x (3)3Figure 1: Collective communications considered in this paper.
3 CommunicationLatencyBandwidthComputationCost used for analysisBroadcastdlog2(p)en{log
2(p)+nReduce(-to-one)dlog2(p)enp1p
n log2(p)+n+n
Scatterdlog2(p)ep1p
n{log2(p)+p1p
nGatherdlog2(p)ep1p n{log2(p)+p1p
nAllgatherdlog2(p)ep1p n{log2(p)+p1p
nReduce-scatterdlog2(p)ep1p np1p n log2(p)+p1p
n+p1p nAllreducedlog2(p)e2
p1p np1p n 2log2(p)+ 2p1p
n+p1p nAll-to-alldlog2(p)ep1p
n{log2(p)+p1p
nFigure 2: Lower bounds for the dierent components of communication cost. Conditions for the lower bounds
given in [8] and [6]. The last column gives the cost functions that we use in our analyses. For architectures
with sucient connectivity, simple algorithms exist with costs that remain within a small constant factor of
all but one of the given formulae. The exception is the all-to-all, for which there are algorithms that achieve
the lower bound for theandterm separately, but it is not clear whether an algorithm that consistently
achieves performance within a constant factor of the given cost function exists.To make this paper self-contained, Figure 1 (similar to Figure 1 in [8]) summarizes the collectives. In
Figure 2 we summarize lower bounds on the cost of the collective communications, under basic assumptions
explained in [8] (see [6] for analysis of all-to-all), and the cost expressions that we will use in our analyses.
2.2 Motivation: matrix-vector multiplication
SupposeA2Rmn,x2Rn, andy2Rm, and label their individual elements so that A=0 B BB@0;00;10;n1
1;01;11;n1............
m1;0m1;1m1;n11 CCCA; x=0
B BB@ 0 1... n11 CCCA;andy=0
B BB@ 0 1... m11 C CCA: Recalling thaty=Ax(matrix-vector multiplication) is computed as0=0;00+0;11++0;n1n1
1=1;00+1;11++1;n1n1............
m1=m1;00+m1;11++m1;n1n1 we notice that elementi;jmultipliesjand contributes to i. Thus we may summarize the interactions of the elements ofx,y, andAby 0 1 n1 0 0;0 0;1 0;n1 1 1;0 1;1 1;n1. m1 m1;0 m1;1 m1;n1(1) which is meant to indicate thatjmust be multiplied by the elements in thejth column ofAwhile theith row ofAcontributes to i. 4 0 00;00;30;6
3;03;33;6
6;06;36;6
1 30;10;40;7
3;13;43;7
6;16;46;7
2 60;20;50;8
3;23;53;8
6;26;56;8
3 11;01;31;6
4;04;34;6
7;07;37;6
4 41;11;41;7
4;14;44;7
7;17;47;7
5 71;21;51;8
4;24;54;8
7;27;57;8
6 22;02;32;6
5;05;35;6
8;08;38;6
7 52;12;42;7
5;15;45;7
8;18;48;7
8 82;22;52;8
5;25;55;8
8;28;58;8
Figure 3: Distribution ofA,x, andywithin a 33 mesh. Notice that redistributing a column ofAin the same manner asyrequires simultaneous gathers within rows of nodes while redistributing a row ofAconsistently withxrequires simultaneous gathers within columns of nodes. In the notation of Section 3, here
the distribution ofxandyare given byx(VR;) andy(VC;), respectively, andAbyA(MC;MR). While the presented mesh of nodes is square, none of the results depend on the mesh being square.2.3 Two-Dimensional Elemental Cyclic Distribution
It is well established that scalable implementations of dense linear algebra operations require nodes to be
logically viewed as a two-dimensional mesh. It is also well established that to achieve load balance for a wide
range of matrix operations, matrices should be cyclically \wrapped" onto this logical mesh. We start with
these insights and examine the simplest of matrix distributions that result. Denoting the number of nodes byp, anrcmesh must be chosen such thatp=rc.Matrix distribution.The elements ofAare assigned using an elemental cyclic (round-robin) distribution
wherei;jis assigned to node (imodr;jmodc). Thus, node (s;t) stores submatrixA(s:r:m1;t:c:n1) =0
B s;ts;t+c s+r;ts+r;t+c .........1 C A; where the left-hand side of the expression uses the MATLAB convention for expressing submatrices, starting indexing from zero instead of one. This is illustrated in Figure 3. Column-major vector distribution.Acolumn-majorvector distribution views thercmesh of nodes as a linear array ofpnodes, numbered incolumn-majororder. A vector is distributed with this 5 0360 3 6
0;00;30;6
3;03;33;6
6;06;36;6
1470 3 6
0;10;40;7
3;13;43;7
6;16;46;7
2580 3 6
0;20;50;8
3;23;53;8
6;26;56;8
0361 4 7
1;01;31;6
4;04;34;6
7;07;37;6
1471 4 7
1;11;41;7
4;14;44;7
7;17;47;7
2581 4 7
1;21;51;8
4;24;54;8
7;27;57;8
0362 5 8
2;02;32;6
5;05;35;6
8;08;38;6
1472 5 8
2;12;42;7
5;15;45;7
8;18;48;7
2582 5 8
2;22;52;8
5;25;55;8
8;28;58;8
Figure 4: Vectorsxandyrespectively redistributed as row-projected and column-projected vectors. Thecolumn-projected vectory(MC;) here is to be used to compute local results that will become contributions
to a column vectory(VC;) which will result from adding these local contributions within rows of nodes. By
comparing and contrasting this gure with Figure 3 it becomes obvious that redistributingx(VR) tox(MR)
requires an allgather within colums of nodes whiley(VC) results from scatteringy(MC) within process rows.
distribution if it is assigned to this linear array of nodes in a round-robin fashion, one element at a
time. In other words, consider vectory. Its element iis assigned to node (imodr;(i=r) modc), where= denotes integer division. Or, equivalently in matlab-like notation, node (s;t) stores subvectory(u: p:m1), whereu(s;t) =s+trequals the rank of node (s;t) when the nodes are viewed as aone-dimensional array, indexed in column-major order. The distribution ofyis illustrated in Figure 3.
Row-major vector distributionSimilarly, arow-majorvector distribution views thercmesh of nodes as a linear array ofpnodes, numbered inrow-majororder. The vector is then assigned to this linear array of nodes in a round-robin fashion, one element at a time. In other words, consider vectorx. Its elementjis assigned to node (jmodc;(j=c) modr). Or,quotesdbs_dbs26.pdfusesText_32[PDF] signes qu'un homme prend du plaisir
[PDF] arts visuels cycle 2 arbre printemps
[PDF] arts visuels arbres cycle 2
[PDF] arbre arts visuels cycle 3
[PDF] arbres arts visuels
[PDF] les arbres en arts plastiques ? lécole
[PDF] arbre arts visuels cycle 2
[PDF] arbre arts plastiques maternelle
[PDF] comment rediger un exercice de math
[PDF] redaction maths prepa
[PDF] académie de créteil recrutement sans concours
[PDF] date titularisation enseignant stagiaire 2017
[PDF] liste titularisation creteil 2016
[PDF] exemple de corpus corrigé poesie