Yeah exactly... This seems closer to an HPC problem, not a "cloud" problem. Rela...

jacobr1 · on Jan 25, 2021

Exactly, we migrated to k8s not because we needed better scaling (ec2 auto scaling groups were working reasonably well for us) but because we kept inventing our own way to do rolling deploys or run scheduled jobs, and had a variety of ways to store secrets. On top of that developers were increasingly running their own containers with docker-compose to test services talking to each to each other.

We migrated to k8s to A) have a way to standardize how to run containerized builds and get the benefits for "it works on my laptop" matching how it works in production (at least functionally) and B) a common set of patterns for managing deployed software.

Resource scheduling only became of interest after we migrated when we realized the aggregation of our payloads allowed us to use things like spot instances without jeopardizing availability.

jedbrown · on Jan 25, 2021

Condor and the like are for independent jobs "throughput computing" but the authors here are using MPI for tightly-coupled jobs. SLURM and Flux are actively-developed schedulers for these kind of jobs.

https://slurm.schedmd.com/

https://flux-framework.readthedocs.io/en/latest/

kortex · on Jan 26, 2021

SLURM hits a nice sweet spot when you have a very traditional cluster: very homogeneous nodes (both hardware and software), standard logins (eg some kind of LDAP/AD), shared NFS files, trusted code. It's an absolute pain when:

- Lots of different kinds of nodes

- anything more complex dependency wise than a handful of shared Conda envs

- anything involving docker

- anything vaguely untrusted

- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability

- anything more complex than 3-5 priority levels of scheduling

It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.

It's also just a bit dated feeling. Even though kube is complex, it's a joy to work with compared to SLURM. Hashicorp is even better imho.

fock · on Jan 26, 2021

hmm, I'd like to digress

>- Lots of different kinds of nodes

well, that's not a problem of slurm (which will happily start your process on all nodes), but of typical MPI programming. And once you are running something computationally intensive over multiple nodes today, you are still using MPI.

>- anything more complex dependency wise than a handful of shared Conda envs

you can put whatever dependencies you want on your NFS (or copy them to your node). If you're running on a single node it behaves 100% like running with a special login shell on os XYZ, so I don't know what problems happen with dependencies. The main problem would be that it doesn't include any "service discovery" beyond OpenMPI.

>- anything involving docker

have not used it, but there's enroot/singularity. The first of which is apparently dogfooded at Nvidia. Probably might need some adjustements for bases images (because MPI)... As I don't know about the policy within these 5k+ cloud companies: can employees just execute any random image from dockerhub there? This seems a little dangerous...

> anything vaguely untrusted

linked to the docker case? Does kubernetes reboot nodes then? Slurm can do this. And while classical Slurm use cases definitely require a shared account (because of the shared fs), slurm should afaik merrily execute your programs even without any shared account than slurm. You can attack this obviously, but so you can attack kubernetes and while it gets more scrutiny it's also a byzantine collection of FANG-style requirements.

EDIT: What you can't work around is Slurm needing a comms-channel back to the controller, which you though could just firewall off (jobs don't use Slurm to communicate...). As each job can execute a Prolog-script, you can even only selectively allow traffic to flow between allocated nodes quite simply.

>- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability

that's indeed the case

>- anything more complex than 3-5 priority levels of scheduling

what kind of scheduling does kubernetes implement? I guess you could write a plugin for slurm doing that

> It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.

except that your points didn't pertain to this (except maybe for the dependencies, if you think about actual service-dependencies), I fully agree

kortex · on Jan 27, 2021

All very good points!

> you can put whatever dependencies you want on your NFS (or copy them to your node).

This is exactly what we do currently. For non controlled data, this works. However this gets really thorny when you involve CUI (confidential unclassified information), precisely because of mentioned shared fs.

Both SLURM and Kube let you write schedulers but just getting SLURM to talk to the DB was a tough affair, some very poorly documented bugs were at play.

I haven't been on this project in a bit so I don't recall the exact details. And maybe it's lack of familiar with SLURM. But I definitely felt hobbled by it. We are probably going to something based off of Hashicorp stuff.

fock · on Jan 27, 2021

yes, I guess you are still using NFSv3? We (really tiny vs. everyone else here) settled on that as well, because it requires less integration overall. Though if you're going the all-AD-route, there's the auks-plugin for running with NFSv4 (not sure, how long ticket renewal works though). And you can always just sbcast a zip of your tree and completely forego the NFS (if you store your data somewhere else. Normally you should also be able to write GRES-plugins to "share" this ressources.

stabbles · on Jan 25, 2021

The problem with slurm is how it's typically used: ssh into a shared login node with a shared file system, authorization is tightly coupled to linux users on that node, submit jobs with sbatch. Kubernetes deployment feels much more modern and safe.

I have worked with containers + slurm, where the vendor libmpi is injected in the container runtime [1] by a hook, which gives you close to bare metal performance with some container goodness in terms of isolation and deployment.

[1] https://github.com/eth-cscs/sarus

brutus1213 · on Jan 26, 2021

Slurm should be the answer but it isn't. In our ML environment, it required ML researchers to understand what is going on (more systems knowledge) and no one liked it. The situation devolved to sshing into machines and running jobs. You are right that slurm is a good fit for HPC ... I just don't think DL workloads are exactly that.

P.S. I also think the K8s scheduler isn't great.

Jugurtha · on Jan 25, 2021

One FAANGUAMLetc engineer told me they SSH, Slurm, and track experiments by telling their manager which parameters were best the day before. This was very strange given that this company has a machine learning platform, so either this engineer did not use it, or they did not use it that much.

We were talking about our machine learning platform and taking it for a spin. We do have long-running notebook scheduling[0] but we wanted to be able to watch the notebook's output from multiple devices as it was running, and for it to survive closed tabs or network disconnections, not just get the results once it's done. We also wanted to be able to do that right from the notebook's interface, instead of SSH'ing and all that, as this was tedious and some of our users aren't that comfortable doing that.

- [0]: https://iko.ai/docs/notebook/#long-running-notebooks

vergessenmir · on Jan 25, 2021

It maybe an HPC problem but I'm not sure the available solutions come close to k8s in terms of functionality and I'm not talking about scheduling.

I used to work in HPC/Grid but it's been a while but I do remember Condor being clunky even though it had its uses.

And the commercial grid offerings couldn't scale to almost 10k nodes back then (am not sure about now, or if they even exist anymore)

toomuchtodo · on Jan 25, 2021

Condor is clunky, but still in use in high energy physics, for example (LHC CMS detector data processing).

For greenfield deployments, I would recommend Hashicorp's Nomad before Kubernetes or Condor if your per server container intent is ~1 (bare metal with a light hypervisor for orchestration), but still steer you to Kubernetes for microservices and web-based cookie cutter apps (I know many finance shops using Nomad, but Cloudflare uses it with Consul, so no hard and fast rules).

Disclosure: Worked in HPC space managing a cluster for high energy physics. I also use (free version) Nomad for personal cluster workload scheduling.

vergessenmir · on Jan 27, 2021

I admit that Nomad is a fair middle ground due to its clean DSL and also because of the homogeneity of their workloads.

The team at OpenAI used the k8s api to make extensions around multi-tenancy (across teams) to saturate available allocations, task specific scheduling modifications which were not supported by the k8s scheduler.

I don't know if Nomad has this extensibility. Their plugins were around device plugins and tasks when I last looked at it.

bostonsre · on Jan 26, 2021

Another pro for Kubernetes is that it has a lot of inertia at the moment with a large contributing community and a large pool of engineers with experience in using it. It's a guess, but would assume the talent pool for hpc stuff isn't as big.

And yea, I like the ability to easily support a diverse set of workloads on the same cluster. It's a simple and easier to understand architecture compared to my previous experience with hadoop.

xorcist · on Jan 26, 2021

Not sure that's a pro if your use case is just a platform for long running compute intensive jobs. The platform's goals may diverge even more from yours in the future, if a cloud provider's use case is the cause for a big rewrite for example.

A small part of said inertia is perhaps the CADT model of software development in action up close, where functionality can be redeveloped multiple times because someone is not satisfied with the outcome.

freedomben · on Jan 25, 2021

> Ironically most people who run their own Kube clusters don't seem to have much workload diversity.

This has not been my experience at all, but most of my clients are big corporations/enterprises. It's not uncommon to have a cluster with hundreds or thousands of different services running, from front-end static file servers to CRUD apps to machine learning. Even the startups I've worked with had at least a handful of different services they ran on K8s.

Stammon · on Jan 26, 2021

Can you go a bit more into detail what you mean by protobuf "reflection service"

chubot · on Jan 26, 2021

A better term to search for is "gRPC reflection service". I can't find the link, but I thought I saw people saying that this was idiomatic to use in many cases for Google Cloud, rather than compiling the schemas into the binary.

That feels weird to me because compilation was always the traditional / intended use of protobufs, whereas dynamic reflection was for a few non-critical debugging tools. I guess my point is that Google has a very specific computing environment and set of problems, and when they get exported to the outside world, they start to get warped because of the different set of problems. That feels like what happened with Borg/Kubernetes as well. I seem to see a lot of regrets about Kubernetes lately, from primary authors and operators.