I think it's a tall order for another SCM to challenge git. I can't imagine how it could be any more entrenched in the industry.
Further, I'm happy with git. I played with Mercurial years ago, long enough to work with it day-to-day, and just didn't find any relevant advantages versus git.
I love that people are still out there trying to improve things, and certainly don't want that to stop, but it's difficult for me to imagine switching at this point.
I'm also happy with git, but there's 3 main things that could improve on git IMO:
1) Better handling of large files than git-lfs. As in 10+ GB repos. This is needed for game development (currently they tend to use Perforce or PlasticSCM)
2) Sparse checkout via file system integration (like Eden has)
3) Build system integration, so unchanged files and modules don't even need to be fetched to be compiled, because cached builds can be fetched from a build server instead (requires proper modularization, so e.g. C++ macro expansion doesn't just prevent anything from being cacheable)
These are all features that primarily have value for repos that push the limits on size, like big monorepos (with a huge amount of files) or game development (with big asset files). But get it right, and you could massively cut down the time it takes to check out a branch and build it.
This is by no means a perfect match for your requirements, but I'll share a CLI tool I built, called Dud[0]. At the least it may spur some ideas.
Dud is meant to be a companion to SCM (e.g. Git) for large files. I was turned off of Git LFS after a couple failed attempts at using it for data science work. DVC[1] is an improvement in many ways, but it has some rough edges and serious performance issues[2].
With Dud I focused on speed and simplicity. To your three points above:
1) Dud can comfortably track datasets in the 100s of GBs. In practice, the bottleneck is your disk I/O speed.
2) Dud checks out binaries as links by default, so it's super fast to switch between commits.
3) Dud includes a means to build data pipelines -- think Makefiles with less footguns. Dud can detect when outputs are up to date and skip executing a pipeline stage.
I hope this helps, and I'd be happy to chat about it.
I'd be curious to see if you've tried git-annex, I use it instead of git-lfs when I need to manage big binary blobs. It does the same trick with a "check out" being a mere symlink.
I haven't used it, no. Around the time Git LFS was released, my read from the community was that Git LFS was favored to supersede git-annex, so I focused my time investigating Git LFS. Given that git-annex is still alive and well, I may have discounted it too quickly :) Maybe I'll revisit it in the future. Thanks for sharing!
Neither is favored, git-annex solves problems that git LFS doesn't even try to address (distributed big files), at the cost of extra complexity.
Git LFS is intended more for a centralized "big repo" workflow, git annex's canonical usage is as a personal distributed backup system, but both can stretch into other domains.
In this case git-annex seems to have a feature that git LFS doesn't have that would be useful to you.
I work in games, and we use PlasticSCM with Unreal Engine.
> Better handling of large files than git-lfs.
PlasticSCM does really really well with binary files.
> Sparse checkout via file system integration
Windows only [0] but Plastic does this. I've been working through some issues but it's usable as a daily driver with a sane build system.
> Build system integration
UnrealBuildTool at the same time is both the coolest and most frustrating build systems I've ever used. It's _not_ a general purpose build system for anyone and everyone, it's tailored to Unreal, it's slow to run the actual build system, the implementation is shaky at times, but some of it's features are incredible. Two standout features are Unity Builds, and Adaptive Unity builds. Unity builds are common now [1], but adaptive unity is a game changer. It uses the Source control integrations to check what files are modified and removes them from the unity blobs, meaning that you're only ever rebuilding + relinking what's changed, and get the best of both worlds.
ClearCase (claimed to) support points two and three way back in the 90s. It never really worked right; it was always trying to "wink-in" someone else's binary, despite the fact that it was built using a different version of the source.
> These are all features that primarily have value for repos that push the limits on size, like big monorepos (with a huge amount of files) or game development (with big asset files). But get it right, and you could massively cut down the time it takes to check out a branch and build it.
Would this be a good fit for large machine learning data sets?
Every time a new model comes out that touts having been trained on something like a quarter of a billion images, I ask myself "how the heck does someone manage and version a pile like that?"
would you ever need to version individual images? at a high level, you could version the collection as a whole by filtering by timestamp, and moving deleted files to a "deleted" directory, or maintaining filename and timestamps in a database. I'm sure there are lots of corner cases that would come up when you actually tried to build such a system, but I don't think the overall scheme needs to be as conceptually complex as source code version control
Not handling large binary files is a feature from my perspective. Git is for source code and treating it like a disk or a catch all for your project is how people get into scaling trouble. I don't see any reason why versioned artifacts can't be attached to a build, or better, in a warm cache. I get that it's easier to keep everything together, but we can look at the roots of git and it becomes fairly obvious, Linus wasn't checking in 48GB 4K mpeg files for his next FPS.
> I don't see any reason why versioned artifacts can't be attached to a build, or better, in a warm cache
Because now you've got two versioning systems with different quirks and tools that behave slightly different that _must_ be kept in sync.
> I get that it's easier to keep everything together,
Easier is a bit of an understatement. Putting binary files outside of the project in another versioning system means you need two version control systems, _plus_ custom tooling to glue the two together. Also, the people who interact with these assets primarily are not technical at all, they're artists or designers, and they're the most likely to have issues with falling between the cracks.
If you want an SCM with direct build system integration there's always GNU make with its support for RCS :)
More seriously, can you describe what "build system integration" would look like? Basically like what GNU make does with RCS? I.e. considering the SCM "files" as sources.
How would such a system build a dumb tarball snapshot?
At my work we have such a system for our monorepo. Basically for each compilation unit (basically subdirectories), it takes the compilation inputs and flags etc and checks them against a remote cache. If it matches, it just pulls the results (binaries but also compiler warnings) from the remote instead of compiling anything. A new clean build is made every few commits to the main branch to populate the cache.
In practice it means that if I clone the repo and build it doesn't ever invoke the compiler, so that it takes only a few minutes instead of an hour.
Bazel has a similar system but I haven't used it myself,
That seems really useful, but how is this different from what ccache and various other compilation caches that cache things as a function of their input do?
The GP talks about "build system integration" being "a game changer", and I can see that being a useful thing for e.g. "priming" your cache, but that's surely just a matter of say:
ccache-like-tool --prime $(git rev-parse HEAD:)
I.e. give it an arbitrary "root" key and it would pre-download the cached assets for that given key, in this case the key is git-specific, but it could also be a:
> how is this different from what ccache and various other compilation caches that cache things as a function of their input do?
Vertical integration, basically. There is value in these things working, and working well. As an example of the issues with ccache etc, I've yet to find a compilation cache that works well on windows for large projects.
I've never worked on a project with a large remote ccache but I would guess it would be pretty much the same yes.
The "automation" of our in-house system is what really makes the difference but then again we have a team of developers that focus on tooling so it's not so much automated as it is maintained...
Coming from Darcs, Git has a horrible user interface. Something like Jujutsu has a chance to disrupt Git.
It can use the Git data store, so developers can in theory start using it without the whole team adopting it. Then it addresses the big problem with Git, the user interface:
I'm not suggesting that "jj" in particular will disrupt Git, but I think eventually the a tool which supports the git data store with a better user interface could take hold.
Git usage for most developer is 3-4 commands or once in a blue moon when they fuck up badly save a copy of your changes and reset hard. There aren't enough user interface improvements possible to get people to switch.
If you go to work for Google or Facebook, it quickly becomes apparent that switching SCMs is much cheaper than trying to use ordinary Git or Mercurial at scale. (Though it is clear that Google, Facebook and Microsoft are all trying to maximize the amount of familiarity and not reinvent the wheel too much; they all have been working on tools either building on, utilizing, or based on already existing SCM tools.)
This looks like it’s supposed to be more appropriate for very, very big repos. Which current Git doesn’t support and isn’t fundamentally designed to support.
So rather than use Git + Git Annex or something like that (maybe more), you’ll just use this alternative SCM.
(I keep hearing about how Git will eventually support big repos, but it’s still not on the horizon. Big repos with Git still seems to be limited to big corps who have the resources to make something bespoke. Personally I don’t use big repos so I have no skin in this game.)
Big repos work in Git today if you are able to play the sparse checkout dance. There are definitely more improvements to be made.
- Protocol for more efficiently widening sparse checkouts.
- Virtual filesystem or other solution for automatically widening sparse checkouts to greatly improve the UX.
- Ideally changing sparse checkouts from a repo state to just some form of lazy loading. Otherwise as you touch more files your performance will slowly degrade.
Yeah I think the directions are quite different. Both are improving user experience (either commands/behaviour with pijul or speed with Eden) but pijul is distributed with some efforts to improve algorithms but a bigger focus on improving semantics (making it more natural in some sense and more correct in another) whereas Eden is more centralising (the thing large companies want for their repo is branching not decentralisation, but DVCSes give the former mostly via the latter (I get that branches are a first class thing in git but much of their technical implementation can follow from things git must have to be distributed)) with a focus on massive size.
One thing I recall was an effort from pijul to make blaming a file run in O(n log h) where n is the size of the file and h is the size of the (repo or file, I’m not sure) history. I wonder if Eden will also have improved blame performance. I noticed they mentioned large histories but maybe it is still linear in the history size of the file. (The way hg works gives I think O(nh) where h is the size of the file history, which he stores per-file rather than per-commit like git).
The biggest differentiation for me between Git and Mercurial is that Mercurial is far better for code reviews because it manages stacks of "as small as possible" changes much easier. The git workarounds I've tried to replicate 'hg histedit' and 'hg absorb' are ... not good.
Similarly, I think Git(hub) has succeeded in open source because bundling more complete changes into PRs works well for unevenly distributed contributions.
I used Meta's Mercurial, having previously used primarily git (and SVN, and CVS before that). It has a number of very cool improvements over git, and it's well integrated into the rest of their infrastructure.
You know the feeling of having to use SVN after using Git? This is what it feels like to use Git after getting used to Meta's Mercurial. I wish I could go into the details, but I don't know how much of it was ported back to Mercurial.
I don't think it's trying to compete with git, it's not decentralized or meant to support big distributed open source project development. This looks like a nice tool for Big Company to manage its internal, private code repositories.
The decentralized part of Git and Mercurial is nice (eg. no need for Github et.al), but I think most software projects using Git or Mercurial do have a centralized server/hub...
I've been keeping an eye on Sturdy [1], a more performant, high level, and opinionated version control system. As a bonus, it seems to be compatible with Git.
OK, slightly off-topic but maybe the right minds are here. We have been developing an introductory CS curriculum committed to thinking-with-powerful-tools, including the command line, real programming languages, and git. It's great until it isn't. We intentionally maintain a simplified workflow, but still get the occasional merge conflict or local state blocking a pull. I keep thinking there must be a simplified wrapper over git which maintains the conceptual power while avoiding the sharp edges, even if at the cost of robustness. I'd be more interested in an abstraction than a GUI, but would be interested to hear whatever others have come up with.
The git user interface just sucks. hg supposedly has a better UI, and darcs is apparently better again (except sometimes merges could run in exponential time). Pijul is meant to give you darcs-like ui with good performance. But none of those things are fit which is maybe important to teach.
One possibility could be to use some kind of git ui. I only know about magit (which is built on/in Emacs) but I’m sure others exist.
The biggest problem I have with Git is the strong commit ordering. This leads to lack of tracking through cherry-picks which has very real friction for a fairly common workflow.
it solves scale issues that git can't solve at the moment.
fb monorepos are huge.
so for most people/companies this issue is not critical to solve and git is great.
Actually Git isn't quite so entrenched as you think. Perforce is still the norm in the games industry, for example, partly because it's more artist-friendly.
Further, I'm happy with git. I played with Mercurial years ago, long enough to work with it day-to-day, and just didn't find any relevant advantages versus git.
I love that people are still out there trying to improve things, and certainly don't want that to stop, but it's difficult for me to imagine switching at this point.