Efficient Data Aggregation with MPI_Reduce in Distributed Systems

In distributed computing, data processing is often spread across multiple nodes in a network. In an MPI programming environment, the root node plays a crucial role. It distributes data to non-root nodes. Then, it gathers their computed results. Finally, it consolidates them to produce the final output. But how do these components work together to achieve the final result?

One key MPI API that enables this process is MPI_Reduce(). This blog explores how global computations are efficiently performed across multiple nodes in a distributed computing environment using MPI_Reduce().

If you’re transitioning from single-core or even multi-core shared-memory programming to distributed computing, the shift can be challenging. Unlike multi-core environments, where threads share memory, distributed systems rely on explicit communication between independent processes using message passing. To bridge this gap, we’ll start with a simple example. Then we will break down MPI_Reduce() step by step. We will explain how it orchestrates global computations in a distributed system.

Exploring MPI_Reduce()

The MPI_Reduce() prototype is as follows:

int MPI_Reduce(
    const void *sendbuf,  // Pointer to local data (input)
    void *recvbuf,        // Pointer to result buffer (valid only in root process)
    int count,            // Number of elements to be reduced
    MPI_Datatype datatype,// Data type of elements (e.g., MPI_INT, MPI_DOUBLE)
    MPI_Op op,            // Reduction operation (e.g., MPI_SUM, MPI_MIN, MPI_MAX)
    int root,             // Rank of process that receives the final result
    MPI_Comm comm         // MPI communicator (usually MPI_COMM_WORLD)
);

In the above prototype, sendbuf is the pointer to your local data. Think of it as the data stored on an individual node. The function MPI_Reduce() will operate on this data. Each process contributes its own value(s) through this buffer.

Then, we have recvbuf. It is the pointer to the variable where the final global result will be stored. This happens after MPI_Reduce() processes data from all nodes. However, keep in mind that this buffer is only relevant for the root node—non-root nodes don’t need to access it.

Next, count represents the number of data elements that each node is contributing to the reduction. This is crucial because MPI_Reduce() needs to know how much data to aggregate from each process.

The datatype parameter is fairly self-explanatory. It defines the type of data being reduced. Examples include MPI_INT for integers or MPI_DOUBLE for floating-point numbers.

Now, here’s where things get interesting—the op field. This is where we specify what operation we want MPI to perform on the collected data. Do we want the sum of all values? Use MPI_SUM. Looking for the minimum value? Go with MPI_MIN. Need the maximum? That’s MPI_MAX. Other operations like MPI_PROD (for product) are also available, depending on what you need.

The root parameter denotes the rank of the root node—the one that will receive and store the final computed result. Most often, this is rank 0, but it can be any process you choose.

Finally, we have comm, the communicator that defines which group of processes are involved in the reduction. In most cases, we use MPI_COMM_WORLD, which means all processes in the MPI environment are participating.

In short, MPI_Reduce() is a smart aggregator. It collects data from multiple nodes, applies the desired operation, and delivers the final result to the root node. This makes global computation in distributed systems both efficient and scalable.

Now, lets take a simple example to understand the APi in mode detail.

How MPI_Reduce() Works in Action

Now, let’s say we need to compute the global minimum (MIN) across multiple nodes in an MPI programming environment. Each node contributes a single value to the computation. But how do we make this work? Let’s break it down step by step.

First, we need to define the data structures involved. If you’ve been following the MPI_Reduce() prototype, you know that sendbuf is where each node stores its local data. In our case, each node will generate a random integer. It will store this integer in sendbuf. The integer will then be passed to MPI_Reduce().

Next, we need a variable—let’s call it global_min—that will store the final result of the MIN operation. But here’s something important: only the root node (rank 0) will receive this final value. The other nodes won’t need to access global_min. MPI_Reduce() is designed to deliver the computed result only to the root process.

Since each node contributes just one integer value, the count parameter is set to 1. We’ll keep things simple by working with integer data, so we use MPI_INT as the datatype.

Now comes the most critical part—the operation we want to apply. Since we’re trying to find the smallest value across all nodes, we set the op parameter to MPI_MIN. This ensures that MPI_Reduce() will compare all values and return the minimum to the root node.

The root parameter defines which process will receive and store the final result. Here, we assign it to rank 0, meaning process 0 will hold the global minimum after reduction.

Finally, the comm parameter specifies the MPI communicator, which determines the group of processes involved in the reduction. In most cases, we use MPI_COMM_WORLD, ensuring that all available nodes participate in the computation.

So, in summary, MPI_Reduce() will gather values from all nodes. It will then apply the MPI_MIN operation. Finally, it stores the global minimum in the root process (rank 0). This simple yet powerful function allows distributed systems to efficiently aggregate results. It works whether you’re working with thousands of nodes or just a few.

Below is a visual representation of how MPI_Reduce() operates.

Developing the Source Code

Now that we understand how this API works, let’s move on to writing the code that puts it into action.

mpi_reduce.c

#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
#include <time.h>

int main()
{
    int size = 0;
    int rank = 0;

    MPI_Init(NULL, NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    srand(time(NULL) ^ (rank * 12345));

    int num = rand();
    int global_num = 0;

    /*
     * int MPI_Reduce(const void *sendbuf,   // Pointer to local data (input)
                      void *recvbuf,         // Pointer to result buffer (output, valid only in root process)
                      int count,             // Number of elements to be reduced
                      MPI_Datatype datatype, // Data type of elements (e.g., MPI_INT, MPI_DOUBLE)
                      MPI_Op op,             // Reduction operation (e.g., MPI_SUM, MPI_MIN, MPI_MAX)
                      int root,              // Rank of process that receives the final result
                      MPI_Comm comm)         // MPI communicator (usually MPI_COMM_WORLD)
     */


    MPI_Reduce(&num, &global_num, 1, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
    printf("rank = %d: random number generated = %d\n", rank, num);

    if(rank == 0) {
        printf("global_num = %d\n", global_num);
    }

    MPI_Finalize();

    return 0;
}

The program works like the following:

Initializing MPI

The MPI environment is initialized using:

MPI_Init(NULL, NULL);

This sets up the MPI execution environment, allowing processes to communicate.

Then, the program retrieves:

MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size() gets the total number of processes (nodes) involved in the computation. MPI_Comm_rank() determines the unique rank (ID) of the current process, starting from 0 up to size-1.

Each process will now execute its own copy of the program independently.

Generating a Random Number

Each process generates a random number using:

srand(time(NULL) ^ (rank * 12345));
int num = rand();

srand(time(NULL) ^ (rank * 12345)) ensures that different processes have different random seeds, reducing the chance of duplicate values. rand() generates a random number for each process, which will be used in the reduction operation.

Calling MPI_Reduce()

The program then calls:

MPI_Reduce(&num, &global_num, 1, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);

This function collects values from all processes and performs a global reduction operation (MPI_MIN) to find the minimum number.

&num: Each process sends its local random number.
&global_num: Only rank 0 will store the computed minimum value.
1: The number of elements per process (one integer).
MPI_INT: Data type of elements being reduced.
MPI_MIN: Reduction operation, selecting the smallest number among all processes.
0: The root process (rank 0) that will receive and store the final result.
MPI_COMM_WORLD: The group of all participating processes.

Cleanup

The program ends by calling:

MPI_Finalize();

This shuts down the MPI environment, ensuring all MPI-related resources are released.

Compiling and running the program

To compile the program you can use the following command on the command line:

mpicc -g -Wall mpi_reduce.c -o mpi_reduce

Once the program is compiled you can run the program using the following command on the command line:

~/mpi_programming$ mpiexec -n 4 ./mpi_reduce 
rank = 2: random number generated = 1250365241
rank = 3: random number generated = 319227418
rank = 0: random number generated = 161145521
global_num = 161145521
rank = 1: random number generated = 596437620

~/mpi_programming$ mpiexec -n 4 ./mpi_reduce 
rank = 2: random number generated = 477326442
rank = 3: random number generated = 1231909895
rank = 0: random number generated = 308265036
global_num = 273317725
rank = 1: random number generated = 273317725

~/mpi_programming$ mpiexec -n 4 ./mpi_reduce 
rank = 3: random number generated = 1821623307
rank = 1: random number generated = 1019906681
rank = 2: random number generated = 1108919976
rank = 0: random number generated = 19695501
global_num = 19695501

~/mpi_programming$ mpiexec -n 4 ./mpi_reduce 
rank = 3: random number generated = 1048236445
rank = 0: random number generated = 787833090
global_num = 241236950
rank = 1: random number generated = 241236950
rank = 2: random number generated = 1872877795

This program has been tested with 4 nodes, but you can easily scale it to run on more. Each node prints its own computed values. Only the root node (rank 0) is responsible for displaying the global minimum. This ensures efficient data aggregation and keeps the output structured. Feel free to experiment with different node counts to see how the program adapts to larger-scale computations!

Difference between MPI_Reduce() and MPI_Allreduce()

If you try to print global_num on any node other than the root, you’ll notice that it prints 0. This happens because MPI_Reduce() only provides the final computed result to the root node (rank 0). The variable remains unmodified on all other ranks. Let’s put this to the test with a simple program and see the behavior in action!

#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
#include <time.h>

int main()
{
    int size = 0;
    int rank = 0;

    MPI_Init(NULL, NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    srand(time(NULL) ^ (rank * 12345));

    int num = rand();
    int global_num = 0;

    /*
     * int MPI_Reduce(const void *sendbuf,   // Pointer to local data (input)
                      void *recvbuf,         // Pointer to result buffer (output, valid only in root process)
                      int count,             // Number of elements to be reduced
                      MPI_Datatype datatype, // Data type of elements (e.g., MPI_INT, MPI_DOUBLE)
                      MPI_Op op,             // Reduction operation (e.g., MPI_SUM, MPI_MIN, MPI_MAX)
                      int root,              // Rank of process that receives the final result
                      MPI_Comm comm)         // MPI communicator (usually MPI_COMM_WORLD)
     */


    MPI_Reduce(&num, &global_num, 1, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
    printf("rank = %d: random number generated = %d\n", rank, num);

    printf("global_num = %d\n", global_num);

    MPI_Finalize();

    return 0;
}

Notice that we’ve removed the if (rank == 0) check before printing global_num. Now, when you compile and run the program, you’ll observe something interesting in the output. Every node prints global_num. However, only the root node (rank 0) has the actual computed result. All other nodes display 0. Here’s what that looks like in action:

mpiexec -n 4 ./mpi_reduce 
rank = 3: random number generated = 1013327285
global_num = 0
rank = 1: random number generated = 1342460008
global_num = 0
rank = 2: random number generated = 42255666
global_num = 0
rank = 0: random number generated = 1967676264
global_num = 42255666

This highlights a key difference between MPI_Reduce() and MPI_Allreduce(). With MPI_Reduce(), only the root node receives the final computed result, while all other nodes remain unaware of it. If you want every node to have access to the global computation output, you’ll need to use MPI_Allreduce() instead. Let’s modify our program and see MPI_Allreduce() in action!

mpi_all_reduce.c

#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
#include <time.h>

int main()
{
    int size = 0;
    int rank = 0;

    MPI_Init(NULL, NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    srand(time(NULL) ^ (rank * 12345));

    int num = rand();
    int global_num = 0;

    /*
     * int MPI_Reduce(const void *sendbuf,   // Pointer to local data (input)
                      void *recvbuf,         // Pointer to result buffer (output, valid only in root process)
                      int count,             // Number of elements to be reduced
                      MPI_Datatype datatype, // Data type of elements (e.g., MPI_INT, MPI_DOUBLE)
                      MPI_Op op,             // Reduction operation (e.g., MPI_SUM, MPI_MIN, MPI_MAX)
                      int root,              // Rank of process that receives the final result
                      MPI_Comm comm)         // MPI communicator (usually MPI_COMM_WORLD)
     */


    MPI_Allreduce(&num, &global_num, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
    printf("rank = %d: random number generated = %d\n", rank, num);

    printf("global_num = %d\n", global_num);

    MPI_Finalize();

    return 0;
}

Now, when you compile and run the program, you’ll notice an output similar to this:

$ mpiexec -n 4 ./mpi_all_reduce 
rank = 3: random number generated = 487201772
global_num = 487201772
rank = 0: random number generated = 839115512
global_num = 487201772
rank = 1: random number generated = 815822249
global_num = 487201772
rank = 2: random number generated = 1331345596
global_num = 487201772

$ mpiexec -n 4 ./mpi_all_reduce 
rank = 2: random number generated = 395844147
global_num = 337914633
rank = 3: random number generated = 337914633
global_num = 337914633
rank = 0: random number generated = 1323879165
global_num = 337914633
rank = 1: random number generated = 1005005007
global_num = 337914633

Now, as you can see, the final computed result is printed by all nodes, not just the root node. This is exactly what MPI_Allreduce() enables. It ensures that the global computation result is available to every node in the MPI environment.

Now that we’ve laid the groundwork for understanding MPI_Reduce(), it’s time to tackle a more complex problem. In the next post, we’ll dive deeper into solving a more advanced challenge using MPI. Stay tuned!

Efficient Data Aggregation with MPI_Reduce in Distributed Systems

Exploring MPI_Reduce()

How MPI_Reduce() Works in Action

Developing the Source Code

Initializing MPI

Generating a Random Number

Calling MPI_Reduce()

Cleanup

Compiling and running the program

Difference between MPI_Reduce() and MPI_Allreduce()

1 Comment

Leave a Reply Cancel reply

Exploring MPI_Reduce()

How MPI_Reduce() Works in Action

Developing the Source Code

Initializing MPI

Generating a Random Number

Calling MPI_Reduce()

Cleanup

Compiling and running the program

Difference between MPI_Reduce() and MPI_Allreduce()

Share this:

Related

1 Comment

Leave a Reply Cancel reply