Hi, co-author here! We use a pretty standard tech stack of PyTorch + NCCL + MPI....

mmq · on Jan 25, 2021

Probably OP was referring to the MPIOperator, TFOperator, PytorchOperator, ... they are under the Kuberflow org, but can be deployed independently of Kubeflow itself. Several other projects are using those operators to provide similar abstractions you mentioned in your blog post, e.g. Gang scheduling, cross-nodes communication, ...

One difference is that these operators use the Kubernetes service interface for communication, generally exposing a headless service for each replica.

sandGorgon · on Jan 26, 2021

@benchess - yes this is what i meant. Using the operator framework.

But more generally, MPI over ssh on a large kubernetes deployment is not a very common pattern. Any reason you chose that ?

Have you looked at Ray or Torch-Elastic (which seems to be officially supported by AWS, etc as well) https://github.com/pytorch/elastic ?

alculquicondor · on Jan 26, 2021

Hi, kube-scheduler maintainer here, currently looking into enabling MPI use cases in k8s.

We started a discussion in https://github.com/kubeflow/mpi-operator/issues/315