English
Log in
English

An Introduction to Message Passing Interface (MPI)

By Chloe Allison 10 September 2020

What is MPI?

MPI is a de facto standard outlining method of passing information across a distributed network. In 1994, the first specification for MPI-1.0 was published by the MPI Forum, a group of Academic researches and industrial representatives spanning 20+ years of experience in parallel computing. The standard has now reach version 3.1 (June 2015) and a scope of version 4.0 is already in the works.

MPI enables vendors such as Intel and IBM or even the open-source communities to implement libraries to work across a wide range platforms and hardware to solve the most challenging scientific/engineering problems.

When we hear people talking about MPI, in most cases, this is an implementation associated with an application using MPI. It is widely used by most legacy simulation packages and aimed at those lucky enough to have High Performance Computing (HPC) hardware within their company to run their MPI application.

So, if this technology has been around for such a long time and people are already using it, you may be wondering why you should be excited by this now? Cloud technology has reached a turning point as the services has become more accessible and with near infinite compute resource (think Amazon Web Services or Google Cloud Platform). If we can take advantage of these immense cloud infrastructures and implement MPI effectively, we can tackle the range of simulation pain points that no one has been able to do before.

We have developed our MPI solvers to work seamlessly with the Cloud allowing engineers to attempt new simulations that have never been possible before.

When to use MPI

MPI is specifically used for solving very large simulations very quickly.

It is difficult to define what a ‘very large’ simulation is in terms of numbers as it is dependent not only on model size but different types of physics being solved within the model. A model may have double the amount of degrees of freedom (DOF)* but not all DOF have the same computational impact.

*DOF is the number of unknown quantities that are solved in the model. A node typically has 6 DOF in 3D space (Translation X,Y,Z, Rotation RX,RY,RZ). For a typical 3D element in OnScale, each node has 3 of the translational DOFs. Each additional physics (electrical, thermal etc.) add an additional DOF to the assigned nodes.

MPI

For example, solving a piezoelectric model will be much more computationally intensive than a purely mechanical model of the same size.

Rough guidelines of when to use MPI (any combination of the following):

  • Large piezoelectric or non-linear (electrostatics) simulations
  • Model size is in the range of 100+ millions of Degrees of Freedom
  • Solve times are 12+ hours when using 16+ CPU Cores

What can MPI be used for:

MPI can be used to solve a range of physics used across various applications:

*Electrostatic solve must be contained within part

**Require flow data to be partitioned

What solver features are not compatible with MPI

Particular physics that are coupled through a node from one MPI part to another may not be available:

  • Electrostatics on the MPI boundaries (piez esta) is not supported.
  • Electrostatic solve window must be contained within an MPI part (not touching the MPI boundary)
  • This will be added to the MPI solver in future releases.
  • Data input
  • Imported flow must be partitioned prior to be loaded into the correct part otherwise, is not supported
  • This will be automatically done in future releases.
  • Commands not supported:
  • Extrapolation (extr)
  • Echo (echo)
  • Mode shape (shap)
  • symb #get { }
  • nodal search commands
  • Piezoelectric (piez)
  • wndo auto piez
  • Outputs not supported:
    • Animations can be difficult as the model is partitioned into many parts

How to use MPI

MPI Partitioning

Partitioning is the process of sub-dividing the mesh into smaller sections or parts so that a set number of CPU cores can be focussed on solving that section.

To partition a model we need to use the part command:

grid 200 10 200

part
    max 2
    sdiv 1 1 100 1 10 1 200
    sdiv 2 100 200 1 10 1 200
    end

geom
    …

The example above defines 2 MPI parts subdivided along the I-direction.

All elements in the model must be associated to a part and must not overlap into multiple parts.

The part command must be defined in between the grid and geom.

max subcommand

The max command defines the total number of parts for MPI which is equivalent to how many cores will be used to solve the model. The example above shows a total number of 2 parts used.

sdiv subcommand

The sdiv subcommand is used to subdivide the mesh into each part ready for MPI. The arguments are the typical nodal locations used in OnScale. The nodes are defined in order of IJK.

Partitioning can applied along all 3 axes if necessary. Typically, partitioning along 2 axes is sufficient as the majority models are much wider than they are thick.

Using nested do loops, we can automate the partition process and control using variables to set the total number of parts and parts along one axis:

c Setting Up Variables
symb nptot = 25
symb npi = 5
symb npj = nptot / $npi
symb xpi = $npi
symb ypj = $npj
symb xndgrd = $indgrd
symb yndgrd = $jndgrd
symb xistep = $indgrd / $xpi
symb xjstep = $jndgrd / $ypj

part
    max $nptot

    symb ifr = 1 /* Intialise Starting Index

/* Loop through parts and partition model
    do loopI I 1 $npi 1
       
        symb jfr = 1 /* Re-initialise J Starting Index

        do loopJ J 1 $npj 1

            /* Calculate part number
            symb ip = $J + ( $I – 1 ) * $npj

            /* Calculate i and j indices
            symb xito = $I * $xistep
            symb ito = $xito
            symb xjto = $J * $xjstep
            symb jto = $xjto

/* Check for last part
            if ( $I eq $npi ) then
                symb ito = $indgrd
            endif
            if ( $J eq $npj ) then
                symb jto = $jndgrd
            endif

            /* Define part
            sdiv $ip $ifr $ito $jfr $jto $k1 $kndgrd
            symb jfr = $jto

        end$ loopJ

symb ifr = $ito
end$ loopI

There are various ways to partition models, use a method you can understand clearly.

Important note: The value set in the max subcommand must be a single integer value and not by a mathematical operation between 2 or more variables.

GOOD max subcommand definition

symb npart = 100

part
    max $npart
    …

BAD max subcommand definition

symb a = 10
symb b = 10
symb npart = $a * $b

part
    max $npart

Memory requirements

Memory is set as the total memory for the simulation. This value is entered in the Cloud Scheduler at time of job submission. Each part will be assigned a portion of the the total memory:

RAM Per Part = Total Memory / Number of Parts

Memory required for MPI will typically use less memory than regular shared memory multi-processing models using the mp omp commands. It is difficult to say exactly how much memory is required so a good starting point is typically around 500 MB per part.

We recommend using minimal amount of RAM as possible.

Best practices

  • Check the model is functional first without any MPI partitioning code
  • When testing MPI:
    • Reduce meshing for quicker tests
    • Run the model for a small number of time steps to check the MPI code is functional
    • Start with 500 MB RAM per Part and increase as required (trial & error)
    • Remove mp omp command from input deck

Model estimation

Our Cloud estimation does not support estimating MPI simulations at the moment.

Best way to estimate an MPI simulation is to execute for a number of time steps and extrapolate for the full length of the simulation. This can be down with following execution code snippet placed after the prcs command:

prcs /* Process command

symb #get { step } timestep
nexec = $full_simulation_time / $step

symb ts = 100
symb #get { time1 } wtime
exec $ts
symb #get { time2 } wtime
symb t_solve = $time2 – $time1
symb t_est = ( $nexec / $ts ) * $t_solve

The variable values are logged in the flex print file (.flxprt) which can be downloaded from Storage. The value t_est will store an estimate of the full simulation solve time in seconds. This can be converted into hours and multiplied by the total number of MPI parts to give an indication of Core-Hours usage.

Considerations when running piezoelectric MPI simulations

With piezoelectric simulations, there are more setup considerations to take into account.

Full Electric Solve

When using MPI for models with full 3D electric solves (e.g. 1-3 Piezocomposites or SAW resonators), if the electric window is continuous through parts, then we need to use the distributed solver (dstr) to be able to solve this:

non-MPI Model

piez
    slvr pard
    …

MPI Model

piez
    slvr dstr /* Must be first command
    …

Important note the distributed solver must be the first subcommand used in the piez section of the code for full 3D electric solve models.

1D Electric Solve

For models using a 1D electric solve approximation (e.g. FBARS), the solver command is defined in the usual manner:

non-MPI & MPI models

piez
slvr drct * * j

Mutltiple Circuits for Arrays

For MPI simulations, the circuit solve by default is calculated on the master part.

For simulations containing many circuits (e.g PMUT and CMUT arrays), it is more efficient to solve them across each individual part.

piez
    cslvr lapack dstr
    slvr drct
    …

The cslvr subcommand is used with the dstr option. This must be the first command when using the distributed circuit solve option.

Important note, this can only be used when the electric window is fully contained within an MPI part (i.e. the piezoelectric window is smaller than the nodal boundaries of the MPI part).

Ways of executing MPI models

Running on the Cloud

The cloud scheduler can be used to execute the MPI models. Click on the Run on Cloud icon to open the Cloud Scheduler:

  1. Select Single precision
  2. Single precision can support simulations up to 2.5 billion elements, otherwise switch to Double precision
  3. Enter the total amount of RAM for the full simulation
  4. Select Run

The Cloud Scheduler auto-detects an MPI simulation when the part command is present in the input file. The total number of parts/cores is determined by the value set by the max subcommand.

 

Chloe Allison
Chloe Allison

Chloe Allison is an Application Engineer at OnScale. She received her MA in Electrical and Electronics Engineering from the University of Strathclyde. As part of our engineering team Chloe assists with developing applications, improving our existing software and providing technical support to our customers.