Introduction

MPI is an inter-process communication (IPC) standard used for the design of parallel applications. Its uses span many areas of high-tech industries and research. You'll find MPI behind software programs that help in designing cars, racing cars, planes, high tech consumer electronics, research for new drugs, complex mathematical calculations and more. Maybe, because of that background it's not very easy to find examples of MPI programs that deal with more mundane concepts. With that in mind, I decided to create a sample program to illustrate some basic MPI concepts using an easy to understand concept.

Everybody knows about Lottery. Many people play it every week. The Lottery is an easy to understand concept. We’re going to apply it to build a sample program to illustrate the use of MPI to create a parallel execution program. The basic idea of the program is the generation of 6 random numbers and the validation if this six numbers match a given user play. What parallelization has to do with that? It's not news that it's very hard to win the lottery. The odds of winning a six numbers lottery with a single play is: 1 in 9,366,819 . In other words, playing the lottery twice a week could take 89759 years to match the a draw. That’s very hard! The proposal for the program is to see how many draws a computer program is required to find a winner. As we’ll soon discover, the odds of the computer is not any better than ours.

The Lottery MPI program

The Lottery MPI is a sample command line program. The command line accepts six luck numbers –a play; with the six numbers in hand the program will execute a loop generating a sequence of 6 random numbers -a draw; it subsequently checks to see if the draw matches the play. We count the number of times the loop executes until the six numbers are matched. By running this program for a number of times we can calculate the average number of draws required to match any given play.

When you run the program in serial mode (non-parallel) a single instance of the executable will load in memory and execute the logic until a match is found.

In the parallel mode you execute a number of instances of the program that will generate random numbers in parallel until one of the running instances generates a draw that matches the play.

A proper coded MPI program will have a single logic that allows the execution in serial or parallel.

A parallel version of the same program should be capable of generating many draws - six number sequences in parallel and validate if there’s a match with the play. The draws will be generated in parallel until one of the instances finds a match.

What I want to demonstrate is: a parallel program is capable of finding the matching numbers much more rapidly than a serial program. The hypothesis is : as more numbers of random draws are generated the time to find a match will be less than in serial . In other words, the more the number random sequences being generated more quickly a match will be found. That just make sence. So let's see the structure of the program and then validate if the hypothesis holds true.

Program structure

The sample program illustrates the use of some basic MPI features MPI in light of an easy to understand concept.

The program has two main blocks. There’s an if to decide which part of the program will be executed.

if (myRank > 0)

{ // workers code

…

else

{ // master's code

When we execute a MPI program a rank is attributed to each running instance of the program.

Process rank 0 : This will be the communication hub to all other instances of the program. Its basic functions are:

· Receives the play numbers from the command line

· Distribute the numbers among the processes

· Wait for a process to communicate it has found the winner sequence

· Notify all other processes that a winner was found

· Sum the total num of plays of each process

· Saves the result on a sequential file

Process rank 1 to n: Those are the working processes. The basic functions of the worker processes are:

· Receive the six play numbers

· Generates 6 random numbers

· Check to see if the random sequence matches the play

· Computes the number of draws

· Communicates with the process rand 0 when a matching is found

· Receives the notification from rank 0 that a process found a match

· Coordinate with the master process to provide the total number of draws and terminates

Remember: the more working processes more draws are generated the quicker we find a matching sequence – a winner!

The demand for inter-process communication in the Lottery MPI application is not very high. The inter-process communication happens in five distinct moments:

1. The master process sends the play numbers to all process in the pool – MPI_Send:

//sends the play numbers to all processes

for (int proc = 1; proc < numProcs; proc++) // for each of the workers:

{

MPI_Send(Play, 6, MPI_INT, proc, 0 /* tag */, MPI_COMM_WORLD);

}

2. The master process waits asynchronously for a communication from the worker processes about a possible winner using a non-blocking version of MPI_Recv – MPI_Irecv:

//async request awaiting for a winner process to comunicate

for (int proc = 1; proc < numProcs; proc++) // for each of the workers:

{

MPI_Irecv(msg, 400, MPI_CHAR, proc, tag, MPI_COMM_WORLD, &rreqs[proc-1]);

}

//wait to receive a notification from the process with a winner draw

MPI_Waitany(numProcs-1, rreqs, &index, stat);

3. The working processes communicate back to the master processes when a winner is found – MPI_Send:

sprintf_s(msg, 400, "Process %d at %s found a Winner after: ", myRank, host);

MPI_Send(msg, (int) strlen(msg)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);

4. The master process notify all process in the pool that a winner was found and they can terminate now – MPI_Send/MPI_Receive:

sprintf_s(terminate, 12, "TERMINATED");

//notify all process that a winner was found. They should terminate now

for (int proc = 1; proc < numProcs; proc++) // for each of the workers:

{

MPI_Send(terminate, 12, MPI_CHAR, proc, 0 /* tag */, MPI_COMM_WORLD);

}

5. The master process then computes the total number of draws by the sum of each work process number of draws – MPI_Reduce:

MPI_Reduce( &count, &recvbuf, 1, MPI_UNSIGNED_LONG, MPI_SUM, 0, MPI_COMM_WORLD);

6. After collecting the total number of draws for each process the data is then saved on a file:

SaveNumOfDraws(msg, recvbuf);

That’s pretty much what the program does. The rest of the code is self explained.

The beauty of MPI is that once you design your program the logic will capable of scaling from single process, to multi-core, many core, multi node. It’s a single approach that address serial to parallel in different computing configurations..

Running the program

The MPI inter-communication will happen when a program is executed by the MPIEXEC. The MPIEXEC can be found on the Windows HPC Server 2008 SDK – see references below. You can install the SDK, compile and execute your program using Windows Vista, XP or Server.

The command line to execute the lottery MPI program on a single computer:

MPIEXEC –n 3 LotteryMPI.exe 45 23 1 17 8 35

The n parameter indicates to MPIEXEC how many instances of the program will be executed. In this case 3 instances will run on my 2 cores notebook. One instance is master process (0). The other two instances are the worker processes that’ll loop until a match is found. Since the master process only takes CPU time to start the communication and send the play to the workers it doesn’t consume much CP. This way we can use the cores (in my notebook two) in its full to concentrate on the search for match.

The six numbers after LotteryMPI.exe is the play that we want the computer to find a match.

Running the LotteryMPI without the MPIEXEC it’s the same as running with MPIEXEC –n 1. Only the process with rank 0 – master code – will be executed.

Lottery program being executed using 2 cores – master process 0 waits without using any CPU.

In order to execute this program on a cluster we’re going to need a script. The Windows HPC Server offers a very productive GUI for job creation and monitoring. Productive, but only to create a handful of jobs. In order to create the number of jobs required for this tests - 1000 jobs - you need a more automated way to be productive. Windows HPC Server 2008 offer different options to interface with the job scheduler: Command-line, PowerShell command line, VBscript, and Powershell scripting. Here's a Powershell sample to create 1000 jobs:

We use the same command line as the serial version. The only change is the number of cores. For the results published below we used a four nodes cluster with 2 cores per node.

$strCommandLine = "mpiexec -n 9 c:\mpiapp.exe"

The results:

SERIAL	Number of Draws
Luckiest Run:	2,565
Unluckiest Run:	52,367,408
Average Runs:	8,153,606
Total:	8,153,606,043

PARALLEL	Number of Draws
Luckiest Run:	18,941
Unluckiest Run:	71,475,764
Average Runs:	9,927,989
Total:	9,908,132,764

The computer generated random numbers for the 1000 draws tested were not any better them the average player. It actually was worst. Maybe with a higher number of runs the number will be close to the calculated probability. Or maybe, the randomness of the computer creates some distortion here.

Performance comparison

	Lottery Draws Per Second
1 Core (1 Node)	9595	100%
2 Cores ( 1 Node)	18900	197%
4 Cores (2 Nodes)	32000	334%
6 Cores (3 Nodes)	44000	459%
8 Cores (4 Nodes)	59339	618%

The scalability of the solution is almost linear up to 8 cores – 4 nodes. The performance improvement from 1 core to 8 cores was 640%.

Conclusion:

So you liked the concepts and want to try it for yourself. Don't worry if you don't have a cluster of parallel computers. You still can test most the of scenarios using a virtual cluster. The article below explains how to create a cluster out of virtual machines.

You're going to need a copy of Windows HPC Server 2008 to create the cluster. You can download an evaluation version of the Windows HPC here:

Resources:

To compile and run this program you’re going to need the Windows HPC Server 2008 SDK. The SDK is freely available and can be downloaded here:

HPC Pack 2008 SDK

http://www.microsoft.com/downloads/details.aspx?familyid=12887DA1-9410-4A59-B903-693116BFD30E&displaylang=en

For a good introduction on MPI programming plus instructions on how to configure Visual Studio to compile, link, debug and execute this program check:

Classic HPC Development

http://alt.pluralsight.com/drjoe/sdk/ClassicHPCDevC++.pdf

For descriptions of all MPI functions including examples check:

/* Main.cpp */

#include <windows.h>

#include <mpi.h>

#include <time.h>

#include <list>

#include <algorithm>

#include <iostream>

using namespace::std;

void SaveNumOfDraws(char draw[], unsigned long count);

// Globals:

int myRank;

int numProcs;

char host[256];

int main(int argc, char* argv[])

{

char msg[400];

char terminate[12];

MPI_Status status, stat[10];

int src = MPI_ANY_SOURCE; // receive from any worker

int tag = MPI_ANY_TAG; // tag is being ignored

int Play[6] = {0,0,0,0,0,0};

int flag = 0;

int index;

int bTerminate=0;

int range_min, range_max;

unsigned long recvbuf;

unsigned long count=0;

using namespace std;

list <int> Draw;

list <int>::iterator Iter;

list <int>::iterator result;

range_min = 1;

range_max = 46;

MPI_Init(&argc, &argv); //initiate the MPI environment

MPI_Comm_size(MPI_COMM_WORLD, &numProcs); // number of processes involved in run

MPI_Comm_rank(MPI_COMM_WORLD, &myRank); // my process id: 0 <= myRank < numProcs

gethostname(host, sizeof(host)/sizeof(host[0])); // machine we are running on

if (myRank > 0)

{ // workers code

int dest = 0; // process 0

int tag = 0; // any value will do

src = 0;

int n = 0;

int num;

bool winner = false;

MPI_Barrier(MPI_COMM_WORLD);

//receive player numbers

MPI_Recv(Play, 6, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);

// Seed the random-number generator with the current time so that

// the numbers will be different every time we run.

Sleep(myRank * 1000);

srand( (unsigned) time(NULL) );

{

Draw.push_back ((double)rand() / (32767 + 1) * (range_max - range_min) + range_min);

{

num = ((double)rand() / (32767 + 1) * (range_max - range_min) + range_min);

result = find( Draw.begin( ), Draw.end( ), num );

if (result == Draw.end()){

Draw.push_back(num);

n++;

}

} while (n<5);

result = find( Draw.begin( ), Draw.end( ), Play[0] );

if ( result != Draw.end( ) ) {

result = find( Draw.begin( ), Draw.end( ), Play[1] );

if ( result != Draw.end( ) ) {

result = find( Draw.begin( ), Draw.end( ), Play[2] );

if ( result != Draw.end( ) ) {

result = find( Draw.begin( ), Draw.end( ), Play[3] );

if ( result != Draw.end( ) ) {

result = find( Draw.begin( ), Draw.end( ), Play[4] );

if ( result != Draw.end( ) ) {

result = find( Draw.begin( ), Draw.end( ), Play[5] );

if ( result != Draw.end( ) ) {

sprintf_s(msg, 400, "Process %d at %s found a Winner after: ", myRank, host);

MPI_Send(msg, (int) strlen(msg)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);

}

Draw.clear ();

count++;

n=0;

MPI_Iprobe( 0, 0, MPI_COMM_WORLD, &flag, &status );

}while (!flag);

MPI_Recv(terminate, 12, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);

//cout << terminate << endl;

MPI_Reduce( &count, &recvbuf, 1, MPI_UNSIGNED_LONG, MPI_SUM, 0, MPI_COMM_WORLD);

}

else

{ // master's code

int numargs;

//read command-line arguments

for( numargs = 1; numargs < argc; numargs++ )

Play[numargs-1] = atol(argv[numargs]);

MPI_Request *rreqs = (MPI_Request *)malloc((numProcs-1) * sizeof(MPI_Request));

MPI_Barrier(MPI_COMM_WORLD);

sprintf_s(msg, 256, "Master process %d is running on '%s'.", myRank, host);

cout << msg << endl;

//sends the play numbers to all processes

for (int proc = 1; proc < numProcs; proc++) // for each of the workers:

{

MPI_Send(Play, 6, MPI_INT, proc, 0 /* tag */, MPI_COMM_WORLD);

}

//assync request awaiting for a winner process to comunicate

for (int proc = 1; proc < numProcs; proc++) // for each of the workers:

{

MPI_Irecv(msg, 400, MPI_CHAR, proc, tag, MPI_COMM_WORLD, &rreqs[proc-1]);

}

//wait to receive a notification from the process with a winner draw

MPI_Waitany(numProcs-1, rreqs, &index, stat);

sprintf_s(terminate, 12, "TERMINATED");

//notify all process that a winner was found. They should terminate now

for (int proc = 1; proc < numProcs; proc++) // for each of the workers:

{

MPI_Send(terminate, 12, MPI_CHAR, proc, 0 /* tag */, MPI_COMM_WORLD);

}

MPI_Reduce( &count, &recvbuf, 1, MPI_UNSIGNED_LONG, MPI_SUM, 0, MPI_COMM_WORLD);

SaveNumOfDraws(msg, recvbuf);

bTerminate = 1;

}

MPI_Finalize();

return 0;

}

void SaveNumOfDraws(char draw[], unsigned long count)

{

char c = '\n';

FILE *stream;

errno_t err;

// Open for append

err = fopen_s( &stream, "numberofDraws.txt", "a+" );

if( err == 0 )

{

//printf( "The file 'numberofDraws.txt' was opened\n" );

}

else

{

//printf( "The file 'numberofDraws.txt' was not opened\n" );

}

fprintf_s( stream, "%s%u%c", draw, count, c );

fclose( stream );

}

Windows HPC Machinations

Wednesday, February 9, 2022

Lottery MPI - Simple example of Windows Server HPC MPI programming

Introduction

The Lottery MPI program

Program structure

Running the program

Performance comparison

Conclusion: