TIL MPI_Comm_spawn

(This post assumes prior experience with MPI).

I read a blog sometime ago where the author was talking about how he wanted to write more blog posts. One of his strategies was to turn some new thing he had learned, even if was something like a simple language or library feature, and talk about that as a ‘Today I Learned (TIL) xyz’. I decided to take a page from his book and write a little bit about a feature of MPI called MPI_Comm_spawn I learned about.

Learning MPI (Message Passing Interface) is a necessity if you’re working with supercomputers, since the vast majority of code that run on these systems use the MPI library to distribute computations across the thousands of nodes that make up any particular supercomputer. If you’re interested, there’s a beginner tutorial here.

Before I explain MPI_Comm_spawn itself, I need to talk about spawning processes in general. You may already be familiar with the pattern of using fork and exec to spawn a new process from the the current program. If you’re not, this Wikipedia page has a good summary of it. Essentially, you can make your program clone itself at a particular point in the program i.e. creating the “child” process, and then replace the clone with a completely different program erasing any trace of the original program. It’s also possible to establish communication with the child process by some inter process communication method even if the child process is now running a new program. This is generally how a shell would work and also how Python’s subprocess module works. The below example is actually forking the process and execing the ls program and returning the output, and all of this is handled internally by the subprocess module for our convenience.

import subprocess

output = subprocess.run(["ls", "-al"], capture_output=True)
print(output)

There’s a pretty good tutorial explaining fork and exec in a lot more detail, including how to communicate between the parent and child processes, with examples and exercises to help you get a better understanding. For our purposes, what I’m trying to establish here is that MPI_Comm_spawn functions on the same principles, but with all the MPI infrastructure available to it.

MPI_Comm_spawn can be called by one of your MPI ranks to spawn a new separate set of MPI ranks as children. These children are all started as peers to each other and able to talk to each other with their own MPI_COMM_WORLD communicator unconnected to the parent’s MPI_COMM_WORLD.

Here’s what an MPI_Comm_spawn call looks like:

MPI_Comm_spawn(
    argv[0], // program to spawn (argv[0] spawns our the currently running
             // program. This can also be a file name)
    MPI_ARGV_NULL, // program's args
    5,             // number of these child processes to spawn off
    MPI_INFO_NULL, // some additional info stuff that MPI_Comm_spawn can use
    parent_rank, // The rank of the MPI process that will do the spawning
                 // (this means that only this specified rank will spawn the
                 // children, not all the ranks will do it unless you do
                 // them all in a loop)
    comm, // The communicator that the parent_rank is present in that is
          // made visible to the children spawned. Usually called the
          // intracommunicator.
    &child_comm, // The communicator that the parent and its peers can use to
                 // send and receive messages from the children. This value is
                 // populated when the MPI_Comm_spawn returns Usually called the
                 // intercommunicator.
    &spawn_error // array of errcodes, length is the same as the number of
                 // children we will spawn. This will be populated with
                 // either MPI_SUCCESS or MPI_ERR_SPAWN depending on whether
                 // all the children spawn correctly or if any one of them
                 // failed, respectively
);

In each child spawned, you can make a call MPI_Comm_parent that will return the intercommunicator we specify in the MPI_Comm_spawn call in the parent. This will allow any of the children to send and receive messages to the original parent rank and all the parent’s peers in that communicator by specifying their rank.

Below is an example that shows a MPI_Comm_spawn with an example of messages being sent and received between the children and the parent (and its peers).

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
  MPI_Init(&argc, &argv);
  MPI_Comm comm = MPI_COMM_WORLD;
  MPI_Comm parent_comm;
  int rank;
  MPI_Comm_get_parent(&parent_comm);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  int size;
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if (parent_comm == MPI_COMM_NULL) {
    // We have no parent communicator so we have been spawned directly by the
    // user
    if (rank == 0) {
      printf("Initial world size: %d\n", size);
    }
    MPI_Comm child_comm;
    int spawn_error;
    printf("We are processes spawned directly by you. I am rank %d. \n", rank);
    int parent_rank = 2;

    MPI_Comm_spawn(
        argv[0], // program to spawn (argv[0] spawns our the currently running
                 // program. This can also be a file name)
        MPI_ARGV_NULL, // program's args
        5,             // number of these child processes to spawn off
        MPI_INFO_NULL, // some additional info stuff that MPI_Comm_spawn can use
        parent_rank, // The rank of the MPI process that will do the spawning
                     // (this means that only this specified rank will spawn the
                     // children, not all the ranks will do it unless you do
                     // them all in a loop)
        comm, // The communicator that the parent_rank is present in that is
              // made visible to the children spawned. Usually called the
              // intracommunicator.
        &child_comm, // The communicator that the parent and its peers can use to
                     // send and receive messages from the children. This value is
                     // populated when the MPI_Comm_spawn returns Usually called the
                     // intercommunicator.
        &spawn_error // array of errcodes, length is the same as the number of
                     // children we will spawn. This will be populated with
                     // either MPI_SUCCESS or MPI_ERR_SPAWN depending on whether
                     // all the children spawn correctly or if any one of them
                     // failed, respectively
    );

    MPI_Comm_rank(comm, &rank);
    if (rank == 3) {
      // Receive message from child rank 1 via the child communicator
      int hello_from_child = 0;
      int child_sender = 1;
      MPI_Recv(&hello_from_child, 1, MPI_INT, child_sender, 0, child_comm,
               MPI_STATUS_IGNORE);
      printf("number sent from child: %d\n", hello_from_child);

      // Send message to child rank 2 via the child communicator
      int hello_from_parent = 56;
      int child_receiver = 2;
      MPI_Send(&hello_from_parent, 1, MPI_INT, child_receiver, 0, child_comm);
    }
  } else {
    // We are children spawned from an MPI process
    if (rank == 0) {
      printf("child world size: %d\n", size);
    }
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    printf("I am a child from MPI_Comm_spawn. I am rank %d.\n", rank);
    int child_sender = 1;
    int hello_from_child = 22;
    if (rank == child_sender) {
      // Send message to rank 3 in parent's communicator from child rank 1
      MPI_Send(&hello_from_child, 1, MPI_INT, 3, 0, parent_comm);
      printf("among children: Sent number from child to parent\n");
    }

    // children communicating with each other
    int child_to_child_number = 0;
    if (rank == 0) {
      // Send message to child rank 1 from child rank 0 via the spawned MPI
      // processes' MPI_COMM_WORLD
      printf("among children: sending from rank 0 to 1\n");
      child_to_child_number = 73;
      MPI_Send(&child_to_child_number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
    }
    if (rank == 1) {
      // Receive message from child rank 0 via the spawned MPI processes'
      // MPI_COMM_WORLD
      MPI_Recv(&child_to_child_number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
               MPI_STATUS_IGNORE);
      printf("among children: received from rank 0: %d\n",
             child_to_child_number);
    }
    if (rank == 2) {
      // Receive message from rank 3 in parent's communicator
      int hello_from_parent = 0;
      int parent_sender = 3;
      MPI_Recv(&hello_from_parent, 1, MPI_INT, parent_sender, 0, parent_comm,
               MPI_STATUS_IGNORE);
      printf("among children: number received from parent : %d\n",
             hello_from_parent);
    }
  }

  MPI_Finalize();

  return EXIT_SUCCESS;
}

The output would look something like the below. I ’ve cleaned up the output for clarity, the actual output lines would be all mixed together since the various ranks including the spawned ranks would all be running in parallel:

$ mpirun -np4 ./example
Initial world size: 4
We are processes spawned directly by you. I am rank 0. 
We are processes spawned directly by you. I am rank 2. 
We are processes spawned directly by you. I am rank 3. 
We are processes spawned directly by you. I am rank 1. 
child world size: 5
I am a child from MPI_Comm_spawn. I am rank 2.
I am a child from MPI_Comm_spawn. I am rank 3.
I am a child from MPI_Comm_spawn. I am rank 4.
I am a child from MPI_Comm_spawn. I am rank 1.
I am a child from MPI_Comm_spawn. I am rank 0.
among children: Sent number from child to parent
number sent from child: 22
among children: sending from rank 0 to 1
among children: received from rank 0: 73
among children: number received from parent : 56

MPI_Comm_spawn may be useful in cases where you have a different MPI program that you want to quickly spawn off to perform some task, without needing to integrate that into your current MPI code. I can’t really point to any real world use cases unfortunately, because at my workplace we use job launchers like IBM’s jsrun and SLURM’s srun on the supercomputers, which don’t support the usage of MPI_Comm_spawn (you’ll just get an incomprehensible runtime error if you try to launch the above program with srun or jsrun). My examples worked with OpenMPI with the mpirun launcher.

I went down this rabbit hole of MPI_Comm_spawn because I was trying to figure out a weird bug in MPI4Py (which is a Python interface for MPI with support for various MPI library implementations, with some additional Pythonic features). and my initial hypothesis was that it was using MPI_Comm_spawn somehow. I’m working on a blog post to talk about that particular saga in more detail (in fact this post was broken off from a section I was writing in that post because I figured that one was long enough as it is and this would’ve been a long tangential section). I can tell you the issue was not MPI_Comm_spawn in that case, MPI4Py doesn’t use this feature at all. But it was pretty neat to at least learn about all this.

Here’s some additional documentation: OpenMPI documentation on MPI_Comm_spawn, RookieHPC page that’s a little more digestible.