Looks like a pretty good first attempt at a distributed filesystem. Initial impression is HDFS with a distributed NameNode/Nameserver. The first diagram also shows a Metaserver layer that's not mentioned at all in the more recent of the two design docs but "separate Metaerver from Nameserver" appears (unchecked) in the roadmap. All operations using access methods other than their own SDK seem to get funneled through the NameServer cluster, which will severely limit throughput. Not clear how they do replication, though weakly implied that it's driven from the client (like Gluster) or NameServer rather than the first ChunkServer (like Ceph, HDFS, everything else). No mention of how they handle consistency or repair. Likewise no information about performance or security. Not clear if it's anywhere near POSIX compliant (probably not).
FUSE support is in the diagrams, but not checked off on the roadmap. Slow-node detection and avoidance seemed like one of the most interesting features from the design, but is not checked off either. Other things not even on the roadmap, using Gluster not as a fair comparison but as a handy list of possibilities: multiple replication levels, tiering, erasure coding, NFS/SMB, caching, quota, snapshots.
As I said, looks like a good first attempt. Better than most I've seen, with lots of potential, but as of today it seems rather bare-bones. Many hard problems remain to be solved, and I wish them well.
> Looks like a pretty good first attempt at a distributed filesystem.
You are damn right. It is.
~~
3 years ago, the most widely used DFS in baidu was Peta which is similar to HDFS V2. We have migrate to AFS now.
Looks nice. I know Raft better than most of the other pieces, so that's where I started; I didn't see code for dynamic membership changes nor log truncation. I can understand getting by with a fixed membership, but log truncation seems like a requirement for a production system. Would be interested to hear whether this is planned or whether there is a clever way around it!
Well, there's another project in the same organization named iNexus achieved in log truncation. It uses leveldb as underlying storage and the leveldb is slightly modified to clean the outdated data when compacting. Maybe BFS will do something similar.
For the source code, please refer to https://github.com/baidu/ins
And I'm sorry for the lack of English documents in this repo. We are working on it.
Thank you - excited to see this! I do think there is a lack of a C++ library for Raft that stands alone (and you have two projects just within baidu that could share code). I'd be excited to help with a standalone project! And I'm sorry for my lack of non-English, but it seems that the variable names are still in english so I can follow the code :-)
(It is a pity that Chrome doesn't automatically translate github pages that contain different languages - not sure why that isn't happening.)
Looking through the code it supports fuse, but the documentation in ENG is sparse. It also looks to underpin Tera: the Baidu distributed DB.
I think a low read/write latency dfs suitable for real time applications would be a game changer. I'm hoping they up the documentation from here and engage the English speaking community.
PS: if your DFS works within a docker container you'll have a very strong differentiator since the rest don't. You'd also possibly solve the "how to do storage in a container cloud without resorting to NAS or separate clusters" problem.
Oh, sorry, didn't realize we were playing the "move the goalposts" game. If you were to google for "gluster" and "containers" you'd get everything from slick marketing stuff to a presentation at the recent Gluster developer summit in Berlin. I have no idea if any of those would meet your next set of standards but, frankly, meh.
Container hosting with a homogeneous cluster constraint was a real requirement for me that I could not find a solution for amongst existing options but since you're a gluster dev you'd probably know better whether its possible; so happy to stand corrected. Thanks for correcting; and no offence intended.
It is certainly possible. The first user I know of who did this was using Mesos. Nowadays the push is more around doing it with Kubernetes and OpenShift; I know there was at least one presentation on it at Red Hat Summit. I'm a core-infrastructure guy, so that's kind of not my bailiwick, but if there's nothing in Gluster's own documentation about such things there might be something in one of those other communities.
I spent a good bit of time yesterday just throwing all the docs into google translate. Tera looks really interesting, but currently there is no way I can use it unless there is documentation in my native language :/
The code is not very self-documenting either. There are >50 line long functions with mixed levels of abstraction, and error handling code is completely mixed with logic. I find such code quite hard to read and lack of comments doesn't help.
Based on the design[1], it has a leader / follower pattern (although you should have multiple leaders with Raft consensus to avoid having a single point of failure), where the leader is called "nameserver" and decides where to put each piece of data and metadata among a set of chunk servers and metadata servers.
That design is very reminiscent of CephFS's cluster monitors, metadata servers, object storage devices.
Sorry, ambiguous choice of term. The set of nameservers "lead" the rest of the server cluster. Within the set of nameservers, they make decisions by electing a leader among them.
Good fucking god. This is insane. And anyone who opposed the the proposal, even while pointing out the fallacy of the core idea, got downvoted to hell too. This gives me a lot of context for what I saw in the last season of South Park.
Leader/Follower is actually something entirely different, but the linked chinese document talks about a master/client approach.
Sadly, in the past years, due to some political movements, the term "master/slave" has been declared problematic, and GitHub actively warns that projects using such language can and will be excluded from the service.
There have been previous discussions about this on HN.
Wow, that's very interesting. I thought GitHub delegates moderation to the repo owners.
There was actually a huge debate about this on Reddit caused by Swift merging a rename change PR into master. The Swift team was so excited about the change for some reason that they didn't even run tests before the merge...
As I mentioned in other subthreads (sadly I can’t edit the original comment anymore, so I have to duplicate content), there is this very famous example of several repos getting banned, and another getting threatened to be banned, for using the word "retard": https://news.ycombinator.com/item?id=9966118
first of all, impl in C++ (JVM/GC is pain in the ass) - clear arch (only master and dataserver) - very concise config file and easy to deploy - most important, 10k nodes scalability without federation design of namespace
Lack of good documentation, no tests and possibly undefined behaviour in a few places. The code also doesn't look any cleaner than HDFS and uses some weird mix of C (*printf, error codes) and C++ (vectors, smart pointers, RAII etc).
For the distributed FS people out there I've got a complicated question. In my job I need to poll and collect data from many remote sensor devices, log all the output, and process that. Not only do I do this buy MANY of my colleges do this and have a different way to manage this process. Can a distributed file system help with this case?
Is there any file system that would be able to sync what amounts to text/binary data across many hosts and allow me to aggregate the data off the network for more secure storage?
I was thinking about using IPFS for this but this also seems better. I'd hopefully like to have a private network for this use case so that other people can't post up a device on this file system and introduce fake data.
If you look at Baidu's infrastructure, it's almost like a parallel universe where the names are identical or almost identical to Google's: BFE, GTC, GSLB. And BFS does look a lot like GFS2 aka Colossus.
Seems that most of Baidu's C++ open source projects have this pattern as well, albeit with minor variations on the Google C++ style such as 4 spaces for indentation.
More like a C++ clone of HDFS than most people are likely hoping. While you seem to be able to mount it with FUSE I imagine it's primarily meant to be programmed against directly.
Using Raft over a dependency on an external consensus system is nice. Definitely makes the namenode architecture much better.
It looks like a faster version of HDFS since it's written in C++ (vs Java).
Another important aspect is that is using SSD + SATA(I suppose) , which could be a better option than standard SATA/SSD or LV cache using SATA + SSD.
Even if it's just a new thing, if it proves to be faster it may be implemented in Hadoop ecosystem in the future. HDFS has a lot of features being a mature piece of software but it lacks on the response time.
During non-GC periods, probably true. But having a realtime filesystem service that is prone to stop-the-world GC pauses is a showstopper for many applications.
Also, a C++ implementation is likelier to use far less memory than a Java implementation, assuming the skills of both programmers are roughly equal.
The underlying local filesystem on each node is not truly realtime, so a "realtime distributed file system" is already quite a stretch. Also JVM is perfectly fine with pause times below a few tens of ms worst-case (when using properly tuned G1, CMS GC), which is lower than worst-case latency induced by network + I/O.
As for using less memory - you don't allocate buffers for file data on the JVM heap. You allocate them in native memory exactly as you'd do it in C++. Therefore it is possible to create a JVM-based file system that handles petabytes of data with just as little as 100 MB heap, used mostly for small temporary objects.
Also, the code here is using mutexes a lot to synchronize threads and lock out whole objects. Therefore I think these "realtime" claims are quite exaggerated.
You're using the academic version of realtime, not the one that anybody cares about. HDFS's biggest problem is, and has always been, that it's literally impossible to tune it to give anything like reliable performance, mostly because the nameserver is a single point of lag for the entire system. "Worst case network and IO" latency is a huge stretch. Network performance is predictably sub-ms if you're using a network designed for modern distributed computing (A real stretch, I know, since almost all HDFS installations are on old-school core-router-tree infrastructure.) The IO operations are incredibly unpredictable - For a client at a time. Having individual servers that 10-20ms worst-case performance hiccoughs is nowhere near as bad for a system as all of your clients hiccoughing for even 5ms at the same time.
HDFS biggest problem is its SPOF master-slave architecture, not JVM nor GC. With a truly distributed shared nothing system Java Gc would not be a problem, because servers can now run with no major Gc for hours or days. So two servers or clients doing Gc at the same time are very unlikely. And even if some of them do, the pauses from Gc are much more predictable than the pauses from I/O which on a loaded system can take seconds, not milliseconds.
Also if GC was such a huge problem, exchanges or HFT companies wouldn't use Java for their low latency stuff, and there definitely are companies which do.
> Also if GC was such a huge problem, exchanges or HFT companies wouldn't use Java for their low latency stuff, and there definitely are companies which do.
Sure and this DFS in C++ memory use is probably huge compared to many hand-crafted assembly or C programs from 1980s. But who cares? 100 MB or even 1GB is really tiny for today's server hardware. And Java runtime itself is a few MB really. What takes most memory in many Java programs (e.g. IDEs) is code and libraries.
Size can lead to a tremendous difference in performance on modern CPUs, particularly if you can take advantage of L2/L3 instruction and data caches. It still matters, even on modern "big memory" systems where gigabytes of installed RAM are the norm.
Technically correct, but filesystems are mostly about I/O. For example this Baidu filesystem copies blocks of data into userland memory and transfers them in RPC messages - any system using proper zero copy approach would easily beat it even if coded in Python or JS. Baidu also seems to use threads, locks and SEDA instead of more efficient (but much harder to code) thread-per-core async architecture. Threadpools and lock based synchronization are terrible for latency.
The fact that something is in C++ doesn't make it automatically efficient. And particularly, if we're talking about milliseconds, not nanoseconds here, in Java or C# you can do just everything what you can do in C++, performance-wise.
"cs启动太慢" means "cs start-up too slow," where 启动 is start-moving and likely a verb-result construction, a pattern in Chinese that to my knowledge doesn't exist in Germanic languages. The second one is more accurately translated to "Other SDK writing strategies."
Not commenting on the BDFS so much as its really cool to see large Chinese companies contributing to open source, does anyone know of other large projects outside of the main Android forks? Pardon the ignorance.
Also wonder if there will be larger skepticism toward integrating Chinese O/S in regards to potential influence by the government (like the NSA has tried to influence in the past)
This looks extremely promising and good. I work on distributed system, in particular on databases (so one abstraction layer above file systems). This looks like it would make for a really nice storage engine for https://github.com/amark/gun . Also it is nice to see non-English projects! Very exciting work.
In this case, I think it's fine. Chinese has a named number for 10000 (wàn/万), so they used that. Since English doesn't, they used 'thousands'. In either case, the idea is that the code would run on a large number of servers.
For instance, Hindi has special names for 100000 (lakh), 10M (crore/karod) etc. so a similar translation to Hindi would use those even if it meant introducing a factor of 10 in the literal interpretation.
i wanted to write awesome too, so i'll be more detailed :)
Seems (have not tried it yet) awesome because:
- another big party offering such software, the more choices the merrier for the users/sysadmins
- sandboxed
- scalable to 10k nodes
- no single point failure
- ssd and traditional disk usage via the disk manager
I see many positive comments that are not downvoted, so when you downvote someone saying "awesome", I suspect it is because you disagree, not because it was a low value post, which would be the reason why you would downvote. Also, your response was "explain why"; again, I don't see people usually questioning each acclaimation.
"The" IPFS announcement? IPFS itself has been around for quite a while. Could you be more specific to which announcement you're referring, and how that relates to BFS?
Sure, nothing ever stays the same, but I think that for the foreseeable future, we can reasonably expect English to stay the universal language of software development. It's the default foreign language people learn in their home countries for a lot of reasons, so a lot of people who want to get into the field already have at least a basic understanding of English. This aids them in learning and communication with other developers, and it's just too convenient to be displaced any time soon.
Maybe both kinds of comments are being downvoted by those of us who like technical conversations not to be full of people bitching about other people's choice of language. I don't find that strange at all.
Is it hypocrisy? English is not my native language but I consider it the default CS language. There must be a way for us to share knowledge, and that turned out to be English.
Yes, but it is a problem. Unlike Computer languages, human languages do not only convey pure meanings, i.e. pure descriptions of relations between entities. They embed a full baggage of culture, so even if it is convenient and pragmatic to use English in CS as main language, it is not neutral, it is both an effect and a cause of the Anglo-saxon cultural, economic and military hegenomy over the world.
I don't think they stick to Mandarin because they want to. English is considered hip and intellectual in China, especially in the first tier cities which I suppose the developers of such modern technology are. But the problem is that English is really, really hard for Chinese native speakers, since grammar, words, culture and pronunciation are so different from all the Chinese languages.
So as a team leader or project manager in China I would probably also stick with Chinese since it is much easier to find really good and not too expensive employees that way. Let them try to use English, support the ambition, but don't enforce it.
And I don't know much about India but from what I heard is that it is more a cultural issue that India lacks behind. Everything is (so I heard) still very traditional and backward focussed. While China as a country spent 20-30 years to become more open for new ideas and approaches. How true is that from other people's perspective here?
There are systematic faults, which prevent much change, if the current policies are kept up (note: India's literacy rate is ~78 %, since literacy (except in English) brings no great advantage).
Imagine China, with only the expensive class of engineers, for instance. Or atleast, one with this class, and another class that was educated in English, but barely knows the language, let alone possessed of any usable skill.
There are now villages, driven by this economics, where rural-children are being taught in English. Considering how bad the Japanese/Chinese are with English, it shouldn't be hard to interpolate how disabling this is when everything is being taught in a foreign language.
That India is a linguistic-apartheid state doesn't contradict with the fact that the creme de la creme is quite good (very apparent from the population in the US).
English is the most common script in India, why would we use anything else. Unless you are Hindian trying to impose your minority language on the rest of us.
- English is hardly the "most common script". Just because the retainers in Delhi impose the colonial apparatus on us, doesn't automagically give it "statistical power" as well.
- Every state has a (poor, uneducated, illiterate) captive linguistic population more than that of Korea; no reason they ought to use Hindi, nor even Nagari (script != language, in case your education didn't tell you that).
- Among the rich, yes, English is most common, and this is really what matters in the end, aint it ?
This is precisely why India will never be able to work in its own language, and also precisely why it is doomed to eternal poverty and continued illiteracy. Probably will remain a hub exporting little other than people, for the next couple centuries.
And this is why I admire China. It's not democratic, it has a paranoid regime, but at least they aren't run by hypocrites who'd use a "socialist democracy" as cover for continued colonization.
> trying to impose your minority language on the rest of us.
> Note the debate you are having with the GP is in English.
You don't say ?
> For better or worse, English is the common language for software development (and aviation, etc.)
English in India is more dense than the feudal castes of medieval Europe; hardly a professional thing this.
The Human-rights wallahs don't complain precisely because the current state of affairs benefits the nations that control them; much as it does their native retainers.
I imagined that. I am more like "let me tell you exactly what I want" rather than relying on behavior defined elsewhere. It would not pass review where I work.
Not to be rude, but I'm glad I don't work where you work.
How would that be consistent with your own classes? "Oh no, you can't just use a 'Tree' object, you need to explicitly set that there are no leaves yet, no branches yet, no squirrels yet, etc… etc…"
Do you .clear() your vectors before you use them?
This sounds like newbies that do:
#define TRUE (1 == 1)
It's not about initialization, it's more about specifying clearly what happens in all cases.
Anyway the reason for which it would no pass review is that today you use one compiler, tomorrow you have to use another and then you have to review all these little details again. It's about saving money more than anything and you do that by not relying on compiler behavior.
with the added bonus of being able to add "const" to that, for the benefit of the reader.
This is not relying on compiler implementation! Can you name one language that has strings that initialise to anything but a valid object containing an empty string?
This is not an obscure side-effect. This is like assuming "std::vector<int> v;" creates an empty vector, not a undefined-state vector container.
(I don't want someone coding C++ as if all objects are references. Coding in one language as if it were another is a well-known antipattern)
Read the whole comment. Still don't see nothing that invalidates my answer to "Can you name one language that has strings that initialise to anything but a valid object containing an empty string?".
> the reason for which it would no pass review is that today you use one compiler, tomorrow you have to use another
Good thing then that it's mandated by the language reference, and not up to the compiler to decide. According to C++11, §21.4.2/1, an uninitialized std::string should be an object of class std::basic_string with non-null data and a size of 0.
Except there's already a 'true'. When I see this I know that whoever wrote is completely incompetent (as in "does not know programming", not "is stupid").
It's clever, yes. The bad kind of clever that's also misguided.
Sometimes there is a 'true', sometimes there isn't. Sometimes the code is new, sometimes it is legacy code. You are making too many assumptions. When I see this I know that...
The point of defining true to (1==1) is that it's "future proof" in case implicit typecasting to bool works in a world where 0 evaluates to "true".
That would break approximately ALL C code.
You're being ridiculous. You might as well try to protect against the meaning of "if" changing.
I've seen amateur code that tries to protect against "stdio.h" going away and therefore reimplementing everything in it. This is like that.
Believing that the meaning of everything can change means that you cannot use anything you didn't code yourself. You can't trust documented APIs, then that's some sort of programmer NIH nihilist.
Looks like a pretty good first attempt at a distributed filesystem. Initial impression is HDFS with a distributed NameNode/Nameserver. The first diagram also shows a Metaserver layer that's not mentioned at all in the more recent of the two design docs but "separate Metaerver from Nameserver" appears (unchecked) in the roadmap. All operations using access methods other than their own SDK seem to get funneled through the NameServer cluster, which will severely limit throughput. Not clear how they do replication, though weakly implied that it's driven from the client (like Gluster) or NameServer rather than the first ChunkServer (like Ceph, HDFS, everything else). No mention of how they handle consistency or repair. Likewise no information about performance or security. Not clear if it's anywhere near POSIX compliant (probably not).
FUSE support is in the diagrams, but not checked off on the roadmap. Slow-node detection and avoidance seemed like one of the most interesting features from the design, but is not checked off either. Other things not even on the roadmap, using Gluster not as a fair comparison but as a handy list of possibilities: multiple replication levels, tiering, erasure coding, NFS/SMB, caching, quota, snapshots.
As I said, looks like a good first attempt. Better than most I've seen, with lots of potential, but as of today it seems rather bare-bones. Many hard problems remain to be solved, and I wish them well.