Introduction
MPI is an inter-process communication (IPC) standard used for
the design of parallel applications. Its uses span many areas of high-tech
industries and research. You'll find MPI behind software programs that help in
designing cars, racing cars, planes, high tech consumer electronics, research
for new drugs, complex mathematical calculations and more. Maybe, because of
that background it's not very easy to find examples of MPI programs that deal
with more mundane concepts. With that in mind, I decided to create a sample
program to illustrate some basic MPI concepts using an easy to understand
concept.
Everybody knows about Lottery. Many people play it every
week. The Lottery is an easy to understand concept. We’re going to apply it to
build a sample program to illustrate the use of MPI to create a parallel
execution program. The basic idea of the program is the generation of 6 random
numbers and the validation if this six numbers match a given user play. What parallelization
has to do with that? It's not news that it's very hard to win the lottery.
The odds of winning a six numbers lottery with a single play is: 1 in 9,366,819
. In other words, playing the lottery
twice a week could take 89759 years to match the a draw. That’s very hard! The
proposal for the program is to see how many draws a computer program is
required to find a winner. As we’ll soon discover, the odds of the computer is
not any better than ours.
The Lottery MPI program
The Lottery MPI is a sample command line program. The command
line accepts six luck numbers –a play; with the six numbers in hand the program
will execute a loop generating a sequence of 6 random numbers
-a draw; it subsequently checks to see
if the draw matches the play. We count the number of times the loop
executes until the six numbers are matched. By running this program for a number of times
we can calculate the average number of draws required to match any given play.
When you run the program in serial mode (non-parallel) a
single instance of the executable will load in memory and execute the logic
until a match is found.
In the parallel mode you execute a number of instances of
the program that will generate random numbers in parallel until one of the
running instances generates a draw that matches the play.
A proper coded MPI
program will have a single logic that allows the execution in serial or parallel.
A parallel version of the same program should be capable of
generating many draws - six number sequences in parallel and validate if
there’s a match with the play. The draws will be generated in parallel until
one of the instances finds a match.
What I want to demonstrate is: a parallel program is capable of finding the
matching numbers much more rapidly than a serial program. The hypothesis is :
as more numbers of random draws are generated the time to find a match will be
less than in serial . In other words, the more the number random sequences being
generated more quickly a match will be found. That just make sence. So let's
see the structure of the program and then validate if the hypothesis holds
true.
Program structure
The sample program illustrates the use of some basic MPI
features MPI in light of an easy to understand concept.
The program has two main blocks. There’s an if to decide which part of the program
will be executed.
if
(myRank > 0)
{ // workers code
…
else
{ // master's code
When we execute a MPI program a rank is attributed to each
running instance of the program.
Process rank 0 : This will be the communication hub to all
other instances of the program. Its
basic functions are:
·
Receives the play numbers from the command line
·
Distribute the numbers among the processes
·
Wait for a process to communicate it has found
the winner sequence
·
Notify all other processes that a winner was
found
·
Sum the
total num of plays of each process
·
Saves the result on a sequential file
Process rank 1 to n: Those are the working processes. The
basic functions of the worker processes are:
·
Receive the six play numbers
·
Generates 6 random numbers
·
Check to see if the random sequence matches the
play
·
Computes the number of draws
·
Communicates with the process rand 0 when a
matching is found
·
Receives the notification from rank 0 that a
process found a match
·
Coordinate with the master process to provide
the total number of draws and terminates
Remember: the more working processes more draws are generated
the quicker we find a matching sequence – a winner!
The demand for inter-process communication in the Lottery
MPI application is not very high. The inter-process communication happens in
five distinct moments:
1. The
master process sends the play numbers to all process in the pool – MPI_Send:
//sends the play numbers to all
processes
for (int proc = 1; proc <
numProcs; proc++) // for each of the workers:
{
MPI_Send(Play,
6, MPI_INT, proc, 0 /* tag */, MPI_COMM_WORLD);
}
2. The
master process waits asynchronously for
a communication from the worker processes about a possible winner using a non-blocking version of MPI_Recv – MPI_Irecv:
//async request awaiting for a
winner process to comunicate
for (int proc = 1; proc <
numProcs; proc++) // for each of the workers:
{
MPI_Irecv(msg,
400, MPI_CHAR, proc, tag, MPI_COMM_WORLD, &rreqs[proc-1]);
}
//wait to receive a
notification from the process with a winner draw
MPI_Waitany(numProcs-1, rreqs, &index,
stat);
3. The
working processes communicate back to the master processes when a winner is
found – MPI_Send:
sprintf_s(msg, 400, "Process %d at %s found a Winner after: ", myRank,
host);
MPI_Send(msg, (int)
strlen(msg)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
4. The
master process notify all process in the pool that a winner was found and they
can terminate now – MPI_Send/MPI_Receive:
sprintf_s(terminate, 12, "TERMINATED");
//notify all process that a
winner was found. They should terminate now
for (int proc = 1; proc <
numProcs; proc++) // for each of the workers:
{
MPI_Send(terminate,
12, MPI_CHAR, proc, 0 /* tag */,
MPI_COMM_WORLD);
}
5. The
master process then computes the total number of draws by the sum of each work
process number of draws – MPI_Reduce:
MPI_Reduce( &count, &recvbuf, 1,
MPI_UNSIGNED_LONG, MPI_SUM, 0, MPI_COMM_WORLD);
6.
After collecting the total number of draws for
each process the data is then saved on a file:
SaveNumOfDraws(msg, recvbuf);
That’s pretty much what the program does. The rest of the code is self explained.
The beauty of MPI is that once you design your program the
logic will capable of scaling from single process, to multi-core, many core,
multi node. It’s a single approach that address serial to parallel in different
computing configurations..
Running the program
The MPI inter-communication will happen when a program is
executed by the MPIEXEC. The MPIEXEC can be found on the Windows HPC Server
2008 SDK – see references below. You can install the SDK, compile and execute
your program using Windows Vista, XP or Server.
The command line to execute the lottery MPI program on a
single computer:
MPIEXEC –n 3
LotteryMPI.exe 45 23 1 17 8 35
The n parameter indicates to MPIEXEC how many instances of
the program will be executed. In this case 3 instances will run on my 2 cores
notebook. One instance is master process (0). The other two instances are the
worker processes that’ll loop until a match is found. Since the master process only takes CPU time
to start the communication and send the play to the workers it doesn’t consume
much CP. This way we can use the cores (in my notebook two) in its full to concentrate on the search for
match.
The six numbers after LotteryMPI.exe is the play that we
want the computer to find a match.
Running the LotteryMPI without the MPIEXEC it’s the same as
running with MPIEXEC –n 1. Only the process with rank 0 – master code – will be
executed.
Lottery program being executed using 2 cores – master
process 0 waits without using any CPU.
In order to execute this program on a cluster we’re going to
need a script. The Windows HPC Server offers a very productive GUI for job
creation and monitoring. Productive, but only to create a handful of jobs. In
order to create the number of jobs required for this tests - 1000 jobs -
you need a more automated way to be productive. Windows HPC Server 2008
offer different options to interface with the job scheduler: Command-line,
PowerShell command line, VBscript, and
Powershell scripting. Here's a Powershell sample to create 1000 jobs:
We use the same command line as the serial version. The only
change is the number of cores. For the results published below we used a four
nodes cluster with 2 cores per node.
$strCommandLine = "mpiexec -n 9 c:\mpiapp.exe"
The results:
SERIAL |
Number of Draws |
Luckiest Run: |
2,565 |
Unluckiest Run: |
52,367,408 |
Average Runs: |
8,153,606 |
Total: |
8,153,606,043 |
PARALLEL |
Number of Draws |
|
Luckiest Run: |
18,941 |
|
Unluckiest Run: |
71,475,764 |
|
Average Runs: |
9,927,989 |
|
Total: |
9,908,132,764 |
|
The computer generated random numbers for the 1000 draws
tested were not any better them the average player. It actually was worst.
Maybe with a higher number of runs the number will be close to the calculated
probability. Or maybe, the randomness of the computer creates some distortion
here.
Performance comparison
Lottery Draws Per Second |
||
1 Core (1 Node) |
9595 |
100% |
2 Cores ( 1 Node) |
18900 |
197% |
4 Cores (2 Nodes) |
32000 |
334% |
6 Cores (3 Nodes) |
44000 |
459% |
8 Cores (4 Nodes) |
59339 |
618% |
The scalability of the solution is almost linear up to 8
cores – 4 nodes. The performance improvement from 1 core to 8 cores was 640%.
Conclusion:
So you liked the concepts and want to try it for yourself.
Don't worry if you don't have a cluster of parallel computers. You still can
test most the of scenarios using a virtual cluster. The article below explains
how to create a cluster out of virtual machines.
You're going to need a copy of Windows HPC Server
2008 to create the cluster. You can download an evaluation version of
the Windows HPC here:
Resources:
To compile and run this program
you’re going to need the Windows HPC Server 2008 SDK. The SDK is freely
available and can be downloaded here:
HPC Pack
2008 SDK
For a good introduction on MPI
programming plus instructions on how to configure Visual Studio to compile,
link, debug and execute this program check:
Classic HPC Development
http://alt.pluralsight.com/drjoe/sdk/ClassicHPCDevC++.pdf
For descriptions of all MPI
functions including examples check:
/* Main.cpp */
#include <windows.h>
#include <mpi.h>
#include <time.h>
#include <list>
#include <algorithm>
#include <iostream>
using namespace::std;
void SaveNumOfDraws(char draw[], unsigned
long count);
//
// Globals:
//
int myRank;
int numProcs;
char host[256];
int main(int argc, char*
argv[])
{
char msg[400];
char terminate[12];
MPI_Status status,
stat[10];
int src = MPI_ANY_SOURCE; // receive from any worker
int tag = MPI_ANY_TAG; // tag is being ignored
int Play[6] = {0,0,0,0,0,0};
int flag = 0;
int index;
int bTerminate=0;
int range_min, range_max;
unsigned long
recvbuf;
unsigned long
count=0;
using namespace
std;
list <int> Draw;
list <int>::iterator Iter;
list <int>::iterator result;
range_min = 1;
range_max = 46;
MPI_Init(&argc,
&argv); //initiate the MPI environment
MPI_Comm_size(MPI_COMM_WORLD,
&numProcs); // number of processes involved in
run
MPI_Comm_rank(MPI_COMM_WORLD,
&myRank); // my process id: 0 <= myRank <
numProcs
gethostname(host, sizeof(host)/sizeof(host[0]));
// machine we are running on
if (myRank > 0)
{ //
workers code
int dest = 0; // process 0
int tag = 0; // any value will
do
src = 0;
int n = 0;
int num;
bool winner = false;
MPI_Barrier(MPI_COMM_WORLD);
//receive player numbers
MPI_Recv(Play, 6, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
// Seed the random-number generator
with the current time so that
// the numbers
will be different every time we run.
Sleep(myRank * 1000);
srand( (unsigned) time(NULL) );
do
{
Draw.push_back ((double)rand()
/ (32767 + 1) * (range_max - range_min) + range_min);
do
{
num = ((double)rand()
/ (32767 + 1) * (range_max - range_min) + range_min);
result = find( Draw.begin( ), Draw.end( ), num );
if (result ==
Draw.end()){
Draw.push_back(num);
n++;
}
} while (n<5);
result = find( Draw.begin( ), Draw.end( ), Play[0] );
if ( result != Draw.end( ) ) {
result = find( Draw.begin( ), Draw.end( ), Play[1]
);
if ( result != Draw.end( ) ) {
result = find( Draw.begin( ), Draw.end(
), Play[2] );
if ( result != Draw.end( ) ) {
result = find( Draw.begin( ), Draw.end(
), Play[3] );
if ( result != Draw.end( ) ) {
result = find( Draw.begin( ),
Draw.end( ), Play[4] );
if ( result != Draw.end( ) ) {
result = find( Draw.begin(
), Draw.end( ), Play[5] );
if ( result != Draw.end( ) ) {
sprintf_s(msg, 400, "Process %d at %s found a Winner after: ",
myRank, host);
MPI_Send(msg, (int) strlen(msg)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
}
}
}
}
}
}
Draw.clear ();
count++;
n=0;
MPI_Iprobe( 0, 0, MPI_COMM_WORLD, &flag, &status
);
}while (!flag);
MPI_Recv(terminate, 12, MPI_CHAR, 0, 0, MPI_COMM_WORLD,
&status);
//cout << terminate <<
endl;
MPI_Reduce( &count, &recvbuf, 1, MPI_UNSIGNED_LONG,
MPI_SUM, 0, MPI_COMM_WORLD);
}
else
{ //
master's code
int numargs;
//read command-line arguments
for( numargs = 1; numargs <
argc; numargs++ )
Play[numargs-1] = atol(argv[numargs]);
MPI_Request *rreqs = (MPI_Request *)malloc((numProcs-1) * sizeof(MPI_Request));
MPI_Barrier(MPI_COMM_WORLD);
sprintf_s(msg, 256, "Master
process %d is running on '%s'.", myRank, host);
cout << msg << endl;
//sends the play numbers to all
processes
for (int
proc = 1; proc < numProcs; proc++) // for each of
the workers:
{
MPI_Send(Play, 6, MPI_INT, proc, 0 /* tag */, MPI_COMM_WORLD);
}
//assync request awaiting for a
winner process to comunicate
for (int
proc = 1; proc < numProcs; proc++) // for each of
the workers:
{
MPI_Irecv(msg, 400, MPI_CHAR, proc, tag, MPI_COMM_WORLD,
&rreqs[proc-1]);
}
//wait to receive a notification from
the process with a winner draw
MPI_Waitany(numProcs-1, rreqs, &index, stat);
sprintf_s(terminate, 12, "TERMINATED");
//notify all process that a winner
was found. They should terminate now
for (int
proc = 1; proc < numProcs; proc++) // for each of
the workers:
{
MPI_Send(terminate, 12, MPI_CHAR, proc, 0 /* tag */, MPI_COMM_WORLD);
}
MPI_Reduce( &count, &recvbuf, 1, MPI_UNSIGNED_LONG,
MPI_SUM, 0, MPI_COMM_WORLD);
SaveNumOfDraws(msg, recvbuf);
bTerminate = 1;
}
MPI_Finalize();
return 0;
}
void SaveNumOfDraws(char draw[], unsigned
long count)
{
char c = '\n';
FILE *stream;
errno_t err;
// Open for append
err = fopen_s( &stream, "numberofDraws.txt",
"a+" );
if( err == 0 )
{
//printf( "The file
'numberofDraws.txt' was opened\n" );
}
else
{
//printf( "The file
'numberofDraws.txt' was not opened\n" );
}
fprintf_s( stream, "%s%u%c",
draw, count, c );
fclose( stream );
}