This endeavor looks largely orthogonal to what the objectives of an online encyclopedia should be. Creating as many stub articles as possible and filling them with "formulaic, generic, and reusable templated sentences with spots for specific information" seems more like a recipe for an automated content farm than for "disseminating the sum of human knowledge."
It would be most interesting to know what the 148 active Cebuano Wikipedia users think of the 5,331,028 articles the bot created, ostensibly for them. Too bad nobody apparently cared to ask.
In particular, since Cebuano speakers are likely to be fluent in Tagalog and/or English as well, they can easily use one of the other Wikipedia editions too. Without the hyperactive bot, the much smaller Cebuano Wikipedia would arguably be more relevant, reflecting topics truly of interest to the community.
While the number of articles is a convenient way of comparing Wikipedia language editions, it only works as such to the extent that the articles are kept to a certain standard. It seems to me that what we are observing here is yet another example of the situation that when a measure becomes a target it ceases to be a good measure.
The counterpoint is that automatically-created stub articles serve to encourage community editing. It's much easier to edit an existing article than create a new one from scratch. This is one of the key principles behind the Gene Wiki project[1], which creates stub articles for human genes for this reason:
> Basic articles (called “stubs”) were systematically created based on content extracted from structured databases. These stubs are then edited by the broader Wikipedia community, while “bots” keep the structured content in sync with the source databases.
(The "structured content" mentioned is the info box on the right-hand side of a gene article. Nowadays I believe this is populated directly from Wikidata[2].)
Note: I am a member of the lab that runs Gene Wiki, but my work is unrelated.
This seems fine as long as the articles are clearly marked as machine-generated. Machine translation regularly garbles the meaning of text, while producing readable text that has correct sentence structure etc. This is a major problem in an encyclopedia.
The subtitle of the article makes it sound like the text in question is machine-translated, but it is created by filling a template with structured data. So long as the template is correct and the data source is accurate, the meaning won't be garbled.
Why not? I tried a random article and got one about a park: https://ceb.wikipedia.org/wiki/Atokad_Park Note that there's no English article about that park that could have served as the source for a machine translation. The article mostly lists a bunch of facts about the location that are easily available in public databases. I don't know how many entries GeoNames.org has, but this park has the number 5063315, so there should be material for quite a lot of articles.
I'm fluent in Visayan (Cebuano & Waray-Waray). The articles are hard to read, but no harder than, say, the translated Bible. I honestly didn't realize they were bot-constructed. very cool!
How often are they inaccurate though? I worked for a content-heavy startup that thought it was a good idea to machine-translate all our content. A brief skim through the translations easily found many instances where the meaning of sentences was entirely reversed.
Indeed, I found myself very curious about why the bot of "Swedish physicist Sverker Johansson" was writing articles in Cebuano (a language mainly spoken in the southern Philipines). It turns out that Cebuano is his wife's native language.
I'm curious about the actual quality of these articles. The OP says "the majority [of a random 1000] were surprisingly well constructed", but it's not clear what that even means. Does the author of the article know Cebuano (or Swedish or Waray-Waray, which are other languages this bot writes in) well enough to judge? Or does it just mean the articles looked like regular articles (ie they have all the infoboxes and other trappings of human-editor-driven articles)?
"Well constructed" sounds like a measure that a content farm would use. The measure for an encyclopedia should be "accurate", and based on the number of times I've seen machine translation completely garble meaning (while maintaining correct structure), I doubt these articles would score highly on that measure.
Having some basic information already present means that potential editors won't have to start from an empty page, making it easier to see where content can be added.
Of course if what you're interested in is actually how much humans contribute to Wikipedia in a given language, you'll need to ignore bot activity, but that doesn't make it useless.
There are many viewpoints. As a counter example, consider the article for one of the topics you mention ("truly of interest to the community"). A user could come across a concept (or person, place, etc.) that they aren't familiar with, click a link, find a short and formulaic but truthful summary, then return to continue reading - all within their own wikipedia. That's useful.
The bot in question essentially takes a good external database of facts, and populates a stub article with a fill-in-the-blank method. It's not that complicated.
I always thought it was a bit bizarre that different language editions of Wikipedia contain different information. It seems the focus should be more on translation than content creation. Maybe that isn’t practical with the current structure, but surely the aim should be a definitive knowledge graph rather than a disparate and unevenly duplicated set of articles. Just my two cents – I am sure many have put a lot of thought into how to best tackle this.
But then you'd lose all the interesting cultural insights you can infer from comparing the way different language versions describe the same topic. The texts are supposed to be neutral, but even neutral can mean quite different things. And when you have a topic that is prone to be suffering from edit wars in one language (e.g. a company you suspect of paying someone to "clean their wiki"), escaping to a different version can often be worthwhile if you have the language skills.
What you describe is indeed the downside. The upside is that if we automatically translated English content for (say mathematics) into other languages then non-English speakers would have access to the highest quality Wikipedia content.
I do think this is an interesting point: the current Wikipedia policy of allowing each language's Wikipedia to evolve independently risks missing an opportunity to free certain non-English speakers from some of the unfairness that results from happening to have not been born an English speaker.
(With no disrespect to the Wikipedias of other major languages, I'm sure they have many extremely high quality articles. But I assume that in, say Math, Physics, Computer Science, the English Wikipedia is the highest quality.)
> translated English content for (say mathematics) into other languages then non-English speakers would have access to the highest quality Wikipedia content.
Why would you think that English content in mathematics is the highest quality in comparison to other languages? Your thinking is flawed.
Because English Wikipedia is by far the version with the most readers, and therefore the one where most time has been invested in making an article good.
In my experience, non-English Wikipedia articles are great for "insider knowledge" about topics and/or local figures who are deemed not relevant enough to the English Wikipedia editors. But outside these niches, their texts tend to be both shorter and less polished.
Because academics around the world who are native speakers of non-English languages do much of their professional work in English, read English language journals, and, if they contribute to Wikipedia, they are at least as likely to do so in English than their native language. As a result, English Wikipedia (for, say, mathematics) has vastly more attention and contributions, and is more comprehensive, than other languages.
I don't know about Maths and other STEM pages, but modern history and politics (last 100-150 years) pages in many non-English languages are highly questionable or even downright lying.
Often people don't agree even on the most basic facts. For example, in the case of the Polish-Soviet war (1919-1920), "who attacked who?" is still a controversial topic.
Valid point. But why delineate differences of opinion across language lines when there are all sorts of reasons to disagree. Whether the disagreement is political, philosophical, regional, cultural, seems like using language as the key differentiator is a fuzzy proxy. You need to find a way to resolve these differences if the content is going to be trustworthy.
How do you mean? I'm fine with the fact that the Greek Wikipedia doesn't contain an article about the Boston Tea Party, but I like that it contains an article about the 1821 rebellion. Requiring the information to be the same across languages would mean that either both should be translated, or, if no translator can be found, one should be deleted.
EDIT: Or do you mean contain the same information between different languages of a specific article?
I think they mean 2 articles in 2 languages with the same content, or as close as a translator can get. Very difficult to keep updated without automation but that seems like something they want to steer away from until no longer reliant on machine translations.
Why should something be deleted? If I was Greek studying the American Revolution it would be relevant. If there were French Wikipedia articles about cooking topics translated to English I’d be interested.
I'm guessing the task is just too hard so this is the next best option. For all of the versions to contain the same content you would have to have every edit made to at least the English version and optionally another version. What happens when someone who only knows a non English language wants to make an edit? Does the site ping a user who knows both languages to translate it? Its just easier to let the versions be split.
You know, I never knew that that sidebar was a link to the same entry in different languages - thanks! Still, it makes me wonder if there is still a way to open up more content in other languages, so that those who contribute more in-depth can somehow have that content be shared on other language pages more transparently. But, I never studied library science and I'm sure finer minds than mine have considered this problem.
The tricky thing is that any text content has to have translation. You might be able to get away with not translating maps, since place names tend to be more stable (or at least generally pretty easy to work out) across different languages. For example, "Pologne-Lituanie" is going to be within the capability of most English speakers to work out, even if they've never heard of "Poland-Lithuania".
It is possible to link images and other things via Wikimedia, and my understanding is that Wikipedia does push for people to do this.
It links to articles in over 80 languages. So on one hand it does a really good job at cross linking. On the other hand, missing out on linking to the languages you mention seems like a huge error.
It does link to all those languages. I think what the grandparent was referring to is that there's no way to indicate that an article in another language might be interesting in some way.
Such as, for Carnival, being written in a language spoken by a people who celebrate it, or for Alexander the Great, highlighting the languages spoken in territories he conquered.
It's an interesting proposal, but I get a headache just thinking about the politics of implementing it.
The problem is that humans are very fond of doublespeak, rewriting history and just plain lying. That is why certain countries endeavor in government sponsored rewriting of local language Wikis. I say we just write off all non-English Wikis as lost cause and strife to maintain at least one Wiki as objective as rationally possible. (PS: my native language is not English, so I'm not a just defending the "easiest for me" solution)
Having machine-translated content is powerful for SEO, but I don't know how practical that is for Cebuano. It would be nice for English to no longer be practically required for people to become computer literate.
Genuinely curious about your last point (don't know much about the topic). Is english intrinsically better at this, or is it because of the presence of jargon? Is it a studied phenomena, or is it something most people feel?
English is usually shorter than other latin-based languages. It's longer than ideogram based ones but you don't have to learn 100000 symbols to express yourself in it.
It also has a very simple grammar compared to most languages. Take this sentence:
"I would like not to go to school today"
The french equivalent would be:
"Je voudrais ne pas aller à l'école aujourd'hui."
"would like" is a simple combination of two words, but in french you need to know the precise conjugation of it.
"not" is actually expressed as 2 words with "ne pas", which can be positioned in several ways.
Infinitive, like with "to go", is simple in english: just add "to". In french, each verb is different, like "aller".
Then you got "the" in any circumstances in english, but the "l'", could also be "le, la, or les" depending of the word after it. A;so remember that each word is either feminine or masculine in french, even a stone or the sun.
Then "à" and "école" got an accent. French has many of them, you need to know the right one, where to place it, how to pronunciation it and type it on the keyboard.
Finally, "today" vs "aujourd'hui". I know which one is easier to type in a bug report.
Not to say English doesn't have weird traps, but it's very, very relaxing compared to the rest. And much more efficient.
Also describing a view of the country side with it feels a bit limiting. But I'm not Shakespear :)
Hangeul is a syllabary so anyone can read it, just like Latin. It was specifically designed in response to peasants not being able to read Chinese hanzi.
Things like the subject and topic can be and is omitted when it’s superfluous. (The above sentence omits the subject and subject particle. There is no topic here). English almost always needs to specify the subject, asides from very casual speech/slang. Korean, like english, doesn’t need to specify gender of nouns, and it also doesn’t need “a” or “the” markers. The location particle above could be dropped too.
However Korean does have complex honorifics and formality conjugations, which typically get longer the more formal/polite it is. Above we have the plain or dictionary form, which is usually the shortest form as well.
"I don’t want to go" and "I would like not to go" express different things.
One is expressing opposition right now, the other one is expressing desire or even a request, potentially while the action is already engaged. The first one is definitive, the second one is wishful thinking or negotiation.
I don’t see any practical difference other than using more polite, indirect language. The effect of “would like not to” is the same in the end with opposition.
I don’t think there’s a 1:1 translation of “would like not to”. I’d probably say something like “it’d be good if I did / didn’t do X”. Which is less direct than the equivalent “I do / don’t want to do X”.
What French really needs is a comprehensive spelling/grammar reform. In Haitian Creole, "ne pas aller à l'école aujourd'hui" is written simply "pa ale lekòl jodi"! And one can just write "mwen" instead of having to differentiate "je" and "moi". (There's also a marker word "ta" that's entirely isomorphic to the English "would" in that it introduces a conditional modality.) Not coincidentally, English is basically a creole language as well. That's why it's so incredibly simple.
English has like 17 tenses, many languages only have 3 without any loss in expressivity. When learning English you need to learn each word twice (once to write, once to speak) - that's not the case with many phonetic languages.
English isn't inherently better as a world's language than anything else, it's just there, so it benefits from the network effect (it's more beneficial to learn language that has the most users, so it gets even more users).
In colloquial Japanese that would be three words: Kyou gakkou ikitakunai. The first word is "today" (unconjugated), second is "school" (unconjugated), and the third is the negative "wish" (volitional?) form of "to go" (iku > ikitai > ikitakunai). Subject (the speaker) is implied and neither the time nor the the topic/object needs to be explicitly tagged.
I can see that being true. Many of the web sites I build are multi-lingual, and when designing for many languages, you have to take into consideration that certain languages take many more characters or words to express an idea than in English.
Off the top of my head, I believe we factor in 15% more text space for Spanish. German is something like 60% more.
When localizing software also it's a general rule of thumb that your on-screen UI spaces need to be something like 40% bigger than the English text that goes in them since the German equivalent is always going to be way bigger. It's common for it to turn out half-way through localization that some of your text doesn't fit anymore.
(FWIW, this also happens when swapping from ideographic languages to English - western localizations of Japanese video games often end up with very small text as a result.)
In my opinion, jargon helps, but English is just easier to learn and shorter to communicate.
My first language is Portuguese, and English was not so hard for me to learn. I studied a bit of French and it was ok, but not as easy as English. And this considering it should be a bit easier since Portuguese is my main language.
Now I'm studying German and... it's way harder than English and French. At least, way more verbose and which more complex grammar rules and conjugations.
English has an incredibly simple grammar, comparatively speaking. There are only three verb conjugations (if you don't count auxiliary verbs) and one gender. Even naming variables is easier!
Yes. In the end, we need one common language so that we can communicate with people around the globe, and that's not only about computers, but everything.
> It would be nice for English to no longer be practically required for people to become computer literate.
That's already case in other mission critical industries, like aviation. Hard to build businesses with cross-border collaboration without using English. (This is also how I learned English in the first place, it was a good motivator!)
I like this because growth and progress of knowledge base, regardless of language or hosting platform, is incremental and cumulative. Wikipedia shows this effectively in the English channel because it happened so quickly. But even the legacy encyclopedias did this through centuries. Whether a bot lays the groundwork from other reference points or dedicated humans do it is sort of immaterial, I think, because the very long run this benefits the people who speak this language.
In an age where languages are dying with their last speakers, Visayan has done much to preserve their diversity -- although not a written/codified language, volunteers give radio broadcasts in the language, books are published in it (here the lack of codification shows by variance in spelling, verb conjugation, and sentence structure), and similar. Thank you to this wikipedian for doing something to preserve a wonderful language (I mention in another comment I am fluent and miss the regular speaking of it).
That sad. In the end, all this would (if not already) make then just go for the English version. I already do this (I'm Brazilian) as the Portuguese version is nowhere near the international (English) version in terms of completeness and being up-to-date.
BTW, do you have a link for their terms on "relevance"?
The English version isn't particularly free. I attempted to add a page about a file format that is fairly well used but doesn't have a huge amount of information online about it. The only real source is a zip file from a companies website which contains a pdf with the file spec and some example programs. Unfortunately the editors decided that due to the lack of referencable sources, they would rather no article exist at all.
I understand it for some cases where the mods just need to stop people making up random crap on topics that don't exist or can't be verified. But in this case a single reference is more than enough to write the whole page because the spec is literally the only source of truth on the topic.
Unfortunately I think the mods may be too passionate about "protecting the integrity of wikipedia" that they let legitimate content be deleted. It also doesn't help that the wikipedia UI for disputes and edits is really confusing and I had a hard time trying to work out what was going on or how I communicate to this moderator. The whole system is designed for power users only.
It's important to note that there are not only "the mods", but two opposing factions in Wikipedia: the Deletionists and the Inclusionists. [1]
I too agree that we don't need articles on someone's cat, but I've had articles deleted as not notable on indie web comics and indie role-playing games with hundreds or thousands of readers or copies sold.
I thought that the fact that the RPG was published and publicly available, and was being discussed in RPG forums would make it notable enough, especially when it was mentioned as an inspiration for rules in more traditional RPGs. But since they hadn't been mentioned in any published articles they were deleted and there was no real way for me to fight it. I had added stuff like the list of contributors, publishing year, and overview of the rules and setting, with no personal discussion of the game.
The result was that I stopped trying to improve Wikipedia, because I don't have the time or interest to fight people with an infinite amount of time that deletes my additions. My main contribution to Wikipedia wouldn't have been on the articles on Barack Obama or World War II anyway, as they already have people who are experts that add information. I could have brought information on my specialized topics of interest, but realized that they would all seem non-notable to someone who's not interested in the same thing and would be deleted.
Fileformats.info is a thing. You could try and add it to their wiki, then you would have a "reliable" source for the English Wikpedia. Specialized websites are usually considered "reliable" for that purpose.
I actually did reference a user created page on the OSM wiki about the format which matched what the official docs said. But was told that wikis are not a referencable source for wikipedia.
It would be most interesting to know what the 148 active Cebuano Wikipedia users think of the 5,331,028 articles the bot created, ostensibly for them. Too bad nobody apparently cared to ask.
In particular, since Cebuano speakers are likely to be fluent in Tagalog and/or English as well, they can easily use one of the other Wikipedia editions too. Without the hyperactive bot, the much smaller Cebuano Wikipedia would arguably be more relevant, reflecting topics truly of interest to the community.
While the number of articles is a convenient way of comparing Wikipedia language editions, it only works as such to the extent that the articles are kept to a certain standard. It seems to me that what we are observing here is yet another example of the situation that when a measure becomes a target it ceases to be a good measure.