Performance Portability in SPARC? Sandia? s Hypersonic CFD PDF

Compile-Time Polymorphism in C++ :

9 févr. 2000 Compile-Time Polymorphism in C++ : ... C++ Classes. ? User-defined type ... C++ class library for computational science applications.

C++ Compile Time Polymorphism for Ray Tracing

In this paper we propose C++ compile time polymorphism as an alternative optimization strategy that does on its own not reduce branching but that can be used

Interface-based Programming in C++

In C++ interface-based programming can also be achieved through link-time or compile-time polymorphism. This paper will show how interface-based programming

Polymorphism in C++

Compile time polymorphism: This type of polymorphism is achieved by function overloading or operator overloading. • Function Overloading: When there are

The POOMA Framework

mers to take advantage of compile-time polymorphism in the. C++ template facility. Second POOMA strongly supports the parallelism of modern computer

Minimizing Dependencies within Generic Classes for Faster and

cation of ISO C++ is silent regarding this issue (namely it ing)

Performance Portability in SPARC? Sandia? s Hypersonic CFD

C++ virtual functions (and function pointers) are not (easily) portable. • Answers? SPARC has taken the `run-time->compile-time polymorphism' approach.

Minimizing Dependencies within Generic Classes for Faster and

19 juin 2009 ity of compile-time polymorphism to a wider range of prob- ... cation of ISO C++ is silent regarding this issue (namely it.

CS 106X Lecture 27 Polymorphism; Sorting

7 déc. 2018 Classes: Inheritance and Polymorphism (HW8). • Sorting Algorithms ... At compile-time C++ generates a version of this class for each type.

A Motion Planning Framework for Robots with Low-power CPUs

template-based library that uses compile-time polymorphism to generate robot-specific motion The system behind MPT's code generation is C++ templates.

Performance Portability in SPARC - Sandia's

Hypersonic CFD Code for Next-Generation Platforms

U.S. DEPARTMENT OF 111 M AIL"W,5

ENERGY

23 Aug 2017 - DOE COE Performance Portability Meeting

Micah Howard, SNL, Aerosciences Department

& the SPARC Development Team

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary

of Honeywell International, Inc., for the U.S. Department of Energys National Nuclear Security Administration under contract DE-NA-0003525. SAND NO. 2017-5964 CSAND2017-8900C

Motivation: Hypersonic Reentry Simulation

Unsteady,

turbulent flow

Flowfield

radiation

Maneuvering RVs:

Shock/shock &

shock/boundary layer interaction

Laminar/transitional/turbulent

boundary layer Mach 109
8 7 6 5 4 3 2 1 0

Gas-surface

chemistry

Surface ablation & in-depth

decomposition

Gas-phase thermochemical

non-equilibrium

Atmospheric

variations

Random vibrational loading

SPARC Compressible CFD Code

n State-of-the-art hypersonic CFD on next-gen platforms n Production: hybrid structured-unstructured finite volume methods n R&D: high order unstructured discontinuous collocation element methods n Perfect and thermo-chemical non-equilibrium gas models n RANS and hybrid RANS-LES turbulence models n Enabling technologies n Scalable solvers n Embedded geometry & meshing n Embedded UQ and model calibration n Credibility n Validation against wind tunnel and flight test data n Visibility and peer review by external hypersonics community n Software quality n Rigorous regression, V&V and performance testing n Software design review and code review culture

SandiaNationalLaboratories

Performance Portability - Kokkos

Trilinos LAMMPS IApplications & Libraries

Kokkos 2.0erformance portability for C++ applications

Multi-Core Many-Core

If APU

SandiaNationalLaboratories

Albany

CPU+GPU

Performance Portability

The problem on Heterogenous Architectures (e.g. ATS-2) • C++ virtual functions (and function pointers) are not (easily) portable • Answers?

1. Kokkos support for portable virtual functions

2. C++ standard support for portable virtual functions

3. Run-time->compile-time polymorphism

SPARC has taken the `run-time->compile-time polymorphism' approach With this approach, we needed a mechanism to aispatch functions dynamically (run-time) or statically (compile-time) Dynamic dispatch is possible on GPUs but requires the object be created for each thread or team on the GPU

SandiaNationalLaboratories

Performance Portability

temolate struct Dispatcher { static void my func (const MyClass* ooj) f static cast(obj)->Type::my funcT(); Now we need a mechanism to convert run-time polymorphism to compile-time polymorphism so we can dispatch functions statically

SandiaNationalLaboratories

Enter the rt2ct chain...

A "Create" chain is used to piece together compile-time instantiations of classes The end of the chain (which is all compile-time) is handed to a Kokkos kernel In this way, we can arbitrarily handle combinations of physics models (GasModels, FluxFunctions, BoundaryConditions) for (efficient) execution on GPUs6

Threaded Assembly/Solves

Threaded Assembly on Structured Grids: MeshTraverserKernel

SandiaNationalLaboratories

MeshTraverserKernel allows a physics code (think flux/flux Jacobian computation and assembly) to operate on a structured (i , j , k) block - implements a multi-dimensional range policy for Kokkos : :parallel for - provides i , j , k line traversal (CPU/KNL) and 'tile' traversal (GPU) class PhysicsKernel : puolic MeshTraverserKernel Array4D node-level multi-dimensional data for a structured block - wraps a Kokkos : : DualView Graph coloring (red-black) to avoid atomics during assembly Threaded solves provided through Tpetra/Belos (point-implicit, GMRES) - OpenMP used for SPARC's native point-implicit and line-implicit solvers Net result of FY16 work: SPARC is running, end-to-end, (equation assembly + solve) on the GPU7

Performance PortabilitySandiaNationalLaboratories

n SPARC is running on all testbed, capacity & capability platforms available to SNL, notably: n Knights Landing (KNL) testbed n Power8+GPU testbed n Sandy Bridge & Broadwell CPU-based 'commodity clusters' n ATS-1 - Trinity (both Haswell and KNL partitions) n ATS-2 - Power8+P100 'early access' system 8

SPARC vs Sierra/Aero Performance

For the Generic Reentry Vehicle use-case...

Investigation of CPU-only, MPI-only performance

INBAM

SandiaNationalLaboratories

RC (Sitr)

157 x1.41.75 x

/2.44. xJAI OA Pr)

2.77 x

2.63 x

01A6 (EA t/s = Equation Assembly time/step; ES t/s = Equation Solve time/step; T/S = Total Time/Step) I 215 x
x - SPARC performing -2x faster than Sierra/Aero - Parallel efficiency is better than Sierra/Aero - Even higher performance from SPARC for CPU-only systems will come with continued investment in NGP performance optimization - Structured vs unstructured performance... 9

SPARC: Strong Scaling Analysis

For the heaviest kernel during equation assembly...

Compute Residual: Interior Faces

First...

lower = faster this is a log2 scale v log

2 Time

per

Equation

Assembly

[s] ,* * Broadwell 32x1 str •-• Haswell 32x1 str

0 0 KNL 16x16 str

* * KNL 32x8 str

A A KNL 64x1 str

O 4 KNL 64x4 str

7--V P100 str

SandiaNationalLaboratories

- Threaded KNL >1.5x faster than MPI-only KNL - Threading on KNL is important - P100 GPUs 1.5-2x faster than HSW/BDW - Higher GPU performance still possible - HSW/BDW 1.25-1.5x faster than threaded KNL - Higher KNL assembly performance may come from SIMD vectorization - Vectorization a FY18 deliverable '1, ck), (23 co cb b', ''''11 0` '1, <').• 1,

Number of Compute Nodes or GPUs

SPARC: Strong Scaling Analysis

For one critical MPI communication during equation assembly...

Halo Exchange

7IJ * * Broadwell 32x1 str

111-111 Haswell 32x1 str

O 0 KNL 16x16 str

* * KNL 32x8 str

A A KNL 64x1 str

0 0 KNL 64x4 str

V - V P100 str

SandiaNationalLaboratories

- Something is amiss with GPU-GPU MPI on P8/P100 systems - Apparently this will be fixed with P9/Volta? - Halo exchange for CPU good, KNL okay - Higher performance for low rank/high thread count KNL I ri, c>, cb (0 cb co', 'cl' (ic.' cl, <')', 1,

Number of Compute Nodes or GPUs

SPARC: Strong Scaling Analysis

For the linear equation solve...

log

2 Time

per

Equation

Assembly

[s]

Linear Equation Solver

* * Broadwell 32x1 str

111-111 Haswell 32x1 str

O 0 KNL 16x16 str

* * KNL 32x8 str

A A KNL 64x1 str

• - • KNL 64x4 str

SandiaNationalLaboratories

- Solves on threaded KNL - 2x faster than

HSW/BDW

- Higher performance on KNL still possible with recent compact BLAS work by the

KokkosKernels team

- Higher performance at scale for low rank/high thread count KNL - Superlinear behavior a DDR/HBM effect - GPU-based solves not shown - GPU-based solver performance analysis and optimization investment needed ri, o, cb (0 (2) co', '',1' (ic.' '1, <')', 1,

Number of Compute Nodes or GPUs

SPARC: Weak Scaling Analysis

For the heaviest kernel during equation assembly...

Recall...

lower = faster this is a log2 scale

SandiaNationalLaboratories

0.0 - 0.5

Cornpute Residual: Interior Faces

AA - Similar trend as S.S.: Threaded KNL >1.5x faster a)- Again, threading on KNL is important - 1.0 - HSW/BDW 1.25-1.5x faster than threaded KNL - Again, vectorization may help

CT - 1.5

ICT) * * Broadwell 32x1 str

Haswell 32x1 stra)

- 2.0 0 q KNL 16x16 str •b.0* * KNL 32x8 str

A A KNL 64x1 str

KNL 64x4 str

- 2.5 V---• P100 str - P100 GPUs 1.5-2x faster than HSW/BDW

Number of Compute Nodes or GPUs

SPARC: Weak Scaling Analysis

For one critical MPI communication during equation assembly...

Halo Exchange

log

2 Time

quotesdbs_dbs14.pdfusesText_20

[PDF] compile time polymorphism in c++ language are

[PDF] compile time polymorphism in c++ language are mcq

[PDF] compile time polymorphism in python

[PDF] compile time polymorphism is achieved by

[PDF] compile time polymorphism is also known as

[PDF] compile time polymorphism vs runtime polymorphism

[PDF] compiler book

[PDF] compiler c++

[PDF] compiler construction tools pdf

[PDF] compiler definition

[PDF] compiler design

[PDF] compiler design ppt

[PDF] compiler error

[PDF] compiler pdf

[PDF] complementary slackness condition lagrangian

[PDF] Performance Portability in SPARC? Sandia? s Hypersonic CFD

Performance Portability in SPARC - Sandia's

Hypersonic CFD Code for Next-Generation Platforms

U.S. DEPARTMENT OF 111 M AIL"W,5

ENERGY

23 Aug 2017 - DOE COE Performance Portability Meeting

Micah Howard, SNL, Aerosciences Department

Motivation: Hypersonic Reentry Simulation

Unsteady,

Flowfield

Maneuvering RVs:

Shock/shock &

Laminar/transitional/turbulent

Gas-surface

Surface ablation & in-depth

Gas-phase thermochemical

Atmospheric

Random vibrational loading

SPARC Compressible CFD Code

SandiaNationalLaboratories

Performance Portability - Kokkos

Trilinos LAMMPS IApplications & Libraries

Multi-Core Many-Core

SandiaNationalLaboratories

Albany

CPU+GPU

Performance Portability

1. Kokkos support for portable virtual functions

2. C++ standard support for portable virtual functions

3. Run-time->compile-time polymorphism

SandiaNationalLaboratories

Performance Portability

SandiaNationalLaboratories

Enter the rt2ct chain...

Threaded Assembly/Solves

SandiaNationalLaboratories

Performance PortabilitySandiaNationalLaboratories

SPARC vs Sierra/Aero Performance

For the Generic Reentry Vehicle use-case...

Investigation of CPU-only, MPI-only performance

SandiaNationalLaboratories

RC (Sitr)

157 x1.41.75 x

2.77 x

2.63 x

SPARC: Strong Scaling Analysis

Compute Residual: Interior Faces

First...

2 Time

Equation

Assembly

0 0 KNL 16x16 str

A A KNL 64x1 str

O 4 KNL 64x4 str

7--V P100 str

SandiaNationalLaboratories

Number of Compute Nodes or GPUs

SPARC: Strong Scaling Analysis

Halo Exchange

111-111 Haswell 32x1 str

O 0 KNL 16x16 str

A A KNL 64x1 str

0 0 KNL 64x4 str

V - V P100 str

SandiaNationalLaboratories

Number of Compute Nodes or GPUs

SPARC: Strong Scaling Analysis

For the linear equation solve...

2 Time

Equation

Assembly

Linear Equation Solver

111-111 Haswell 32x1 str

O 0 KNL 16x16 str

A A KNL 64x1 str

SandiaNationalLaboratories

HSW/BDW

KokkosKernels team

Number of Compute Nodes or GPUs

SPARC: Weak Scaling Analysis