Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The second largest version of Wikipedia is written mostly by one bot (vice.com)
140 points by jxub on Feb 24, 2020 | hide | past | favorite | 84 comments


This endeavor looks largely orthogonal to what the objectives of an online encyclopedia should be. Creating as many stub articles as possible and filling them with "formulaic, generic, and reusable templated sentences with spots for specific information" seems more like a recipe for an automated content farm than for "disseminating the sum of human knowledge."

It would be most interesting to know what the 148 active Cebuano Wikipedia users think of the 5,331,028 articles the bot created, ostensibly for them. Too bad nobody apparently cared to ask.

In particular, since Cebuano speakers are likely to be fluent in Tagalog and/or English as well, they can easily use one of the other Wikipedia editions too. Without the hyperactive bot, the much smaller Cebuano Wikipedia would arguably be more relevant, reflecting topics truly of interest to the community.

While the number of articles is a convenient way of comparing Wikipedia language editions, it only works as such to the extent that the articles are kept to a certain standard. It seems to me that what we are observing here is yet another example of the situation that when a measure becomes a target it ceases to be a good measure.


The counterpoint is that automatically-created stub articles serve to encourage community editing. It's much easier to edit an existing article than create a new one from scratch. This is one of the key principles behind the Gene Wiki project[1], which creates stub articles for human genes for this reason:

> Basic articles (called “stubs”) were systematically created based on content extracted from structured databases. These stubs are then edited by the broader Wikipedia community, while “bots” keep the structured content in sync with the source databases.

(The "structured content" mentioned is the info box on the right-hand side of a gene article. Nowadays I believe this is populated directly from Wikidata[2].)

Note: I am a member of the lab that runs Gene Wiki, but my work is unrelated.

---

[1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5944608/

[2]: https://www.ncbi.nlm.nih.gov/pubmed/26989148


This seems fine as long as the articles are clearly marked as machine-generated. Machine translation regularly garbles the meaning of text, while producing readable text that has correct sentence structure etc. This is a major problem in an encyclopedia.


The subtitle of the article makes it sound like the text in question is machine-translated, but it is created by filling a template with structured data. So long as the template is correct and the data source is accurate, the meaning won't be garbled.


I can't really see how the number of translated articles could be as huge as it is using only that approach.


Why not? I tried a random article and got one about a park: https://ceb.wikipedia.org/wiki/Atokad_Park Note that there's no English article about that park that could have served as the source for a machine translation. The article mostly lists a bunch of facts about the location that are easily available in public databases. I don't know how many entries GeoNames.org has, but this park has the number 5063315, so there should be material for quite a lot of articles.


I guess it depends on how much information is in Wikidata, assuming that's the data source.


I'm fluent in Visayan (Cebuano & Waray-Waray). The articles are hard to read, but no harder than, say, the translated Bible. I honestly didn't realize they were bot-constructed. very cool!


How often are they inaccurate though? I worked for a content-heavy startup that thought it was a good idea to machine-translate all our content. A brief skim through the translations easily found many instances where the meaning of sentences was entirely reversed.


The articles I have read have generally been history / geography. They tend to be okay.

Can't speak to tech / science / art in the Visayan space.


Indeed, I found myself very curious about why the bot of "Swedish physicist Sverker Johansson" was writing articles in Cebuano (a language mainly spoken in the southern Philipines). It turns out that Cebuano is his wife's native language.

I'm curious about the actual quality of these articles. The OP says "the majority [of a random 1000] were surprisingly well constructed", but it's not clear what that even means. Does the author of the article know Cebuano (or Swedish or Waray-Waray, which are other languages this bot writes in) well enough to judge? Or does it just mean the articles looked like regular articles (ie they have all the infoboxes and other trappings of human-editor-driven articles)?


"Well constructed" sounds like a measure that a content farm would use. The measure for an encyclopedia should be "accurate", and based on the number of times I've seen machine translation completely garble meaning (while maintaining correct structure), I doubt these articles would score highly on that measure.


Many Wikipedia articles start out as just a few sentences and then slowly accumulate content. To give an example I noticed recently, compare https://en.wikipedia.org/w/index.php?title=Pu-Xian_Min&oldid... in 2007 vs. now https://en.wikipedia.org/wiki/Pu-Xian_Min

Having some basic information already present means that potential editors won't have to start from an empty page, making it easier to see where content can be added.

Of course if what you're interested in is actually how much humans contribute to Wikipedia in a given language, you'll need to ignore bot activity, but that doesn't make it useless.


There are many viewpoints. As a counter example, consider the article for one of the topics you mention ("truly of interest to the community"). A user could come across a concept (or person, place, etc.) that they aren't familiar with, click a link, find a short and formulaic but truthful summary, then return to continue reading - all within their own wikipedia. That's useful.


How can you ensure that it's truthful? Machine translation garbles meaning all the time.


The bot in question essentially takes a good external database of facts, and populates a stub article with a fill-in-the-blank method. It's not that complicated.


> since Cebuano speakers are likely to be fluent in Tagalog and/or English as well

Likely, but not always. I've encountered areas, for example Bohol, in that region where only the local language was well-understood.

And "Taglish" is about as understandable as Japlish - lots of similarities to English, but what does it mean?

Same in Indonesia. Outside of the main cities on Sumatra, especially with older people, only the local language is used daily.


I always thought it was a bit bizarre that different language editions of Wikipedia contain different information. It seems the focus should be more on translation than content creation. Maybe that isn’t practical with the current structure, but surely the aim should be a definitive knowledge graph rather than a disparate and unevenly duplicated set of articles. Just my two cents – I am sure many have put a lot of thought into how to best tackle this.


But then you'd lose all the interesting cultural insights you can infer from comparing the way different language versions describe the same topic. The texts are supposed to be neutral, but even neutral can mean quite different things. And when you have a topic that is prone to be suffering from edit wars in one language (e.g. a company you suspect of paying someone to "clean their wiki"), escaping to a different version can often be worthwhile if you have the language skills.


What you describe is indeed the downside. The upside is that if we automatically translated English content for (say mathematics) into other languages then non-English speakers would have access to the highest quality Wikipedia content.

I do think this is an interesting point: the current Wikipedia policy of allowing each language's Wikipedia to evolve independently risks missing an opportunity to free certain non-English speakers from some of the unfairness that results from happening to have not been born an English speaker.

(With no disrespect to the Wikipedias of other major languages, I'm sure they have many extremely high quality articles. But I assume that in, say Math, Physics, Computer Science, the English Wikipedia is the highest quality.)


> translated English content for (say mathematics) into other languages then non-English speakers would have access to the highest quality Wikipedia content.

Why would you think that English content in mathematics is the highest quality in comparison to other languages? Your thinking is flawed.


Because English Wikipedia is by far the version with the most readers, and therefore the one where most time has been invested in making an article good.

In my experience, non-English Wikipedia articles are great for "insider knowledge" about topics and/or local figures who are deemed not relevant enough to the English Wikipedia editors. But outside these niches, their texts tend to be both shorter and less polished.


Because academics around the world who are native speakers of non-English languages do much of their professional work in English, read English language journals, and, if they contribute to Wikipedia, they are at least as likely to do so in English than their native language. As a result, English Wikipedia (for, say, mathematics) has vastly more attention and contributions, and is more comprehensive, than other languages.


> Because academics around the world who are native speakers of non-English languages do much of their professional work in English

Source?


I don't know about Maths and other STEM pages, but modern history and politics (last 100-150 years) pages in many non-English languages are highly questionable or even downright lying.


Indeed. Plus there’d be tugs of war, for example, the Sino-Japanese war, or the Russo-Japanese war, or the Israeli-Palestinian conflict.

There is no way to reconcile those histories.


What? Why not? An objective and comprehensive recollection of facts, mutually identified responses, and associated outcomes can't be hashed out?


Often people don't agree even on the most basic facts. For example, in the case of the Polish-Soviet war (1919-1920), "who attacked who?" is still a controversial topic.


Valid point. But why delineate differences of opinion across language lines when there are all sorts of reasons to disagree. Whether the disagreement is political, philosophical, regional, cultural, seems like using language as the key differentiator is a fuzzy proxy. You need to find a way to resolve these differences if the content is going to be trustworthy.


Exactly.


How do you mean? I'm fine with the fact that the Greek Wikipedia doesn't contain an article about the Boston Tea Party, but I like that it contains an article about the 1821 rebellion. Requiring the information to be the same across languages would mean that either both should be translated, or, if no translator can be found, one should be deleted.

EDIT: Or do you mean contain the same information between different languages of a specific article?


I think they mean 2 articles in 2 languages with the same content, or as close as a translator can get. Very difficult to keep updated without automation but that seems like something they want to steer away from until no longer reliant on machine translations.


Yeah they meant why does the Barack Obama article in English differ in content than the Swedish one


Why should something be deleted? If I was Greek studying the American Revolution it would be relevant. If there were French Wikipedia articles about cooking topics translated to English I’d be interested.


I'm guessing the task is just too hard so this is the next best option. For all of the versions to contain the same content you would have to have every edit made to at least the English version and optionally another version. What happens when someone who only knows a non English language wants to make an edit? Does the site ping a user who knows both languages to translate it? Its just easier to let the versions be split.


I'd rather see Wikipedia find a way to link these different sites in more interesting ways, for example if I go to the entry for Carnival (https://en.wikipedia.org/wiki/Carnival), why doesn't it link me to the Brazilian (https://pt.wikipedia.org/wiki/Carnaval), Spanish (https://es.wikipedia.org/wiki/Carnaval) or Italian (https://it.wikipedia.org/wiki/Carnevale) entries for which I might learn more, using auto-translate?


All of those languages and more are linked in the sidebar, what would you prefer to see?


You know, I never knew that that sidebar was a link to the same entry in different languages - thanks! Still, it makes me wonder if there is still a way to open up more content in other languages, so that those who contribute more in-depth can somehow have that content be shared on other language pages more transparently. But, I never studied library science and I'm sure finer minds than mine have considered this problem.


The tricky thing is that any text content has to have translation. You might be able to get away with not translating maps, since place names tend to be more stable (or at least generally pretty easy to work out) across different languages. For example, "Pologne-Lituanie" is going to be within the capability of most English speakers to work out, even if they've never heard of "Poland-Lithuania".

It is possible to link images and other things via Wikimedia, and my understanding is that Wikipedia does push for people to do this.


It links to articles in over 80 languages. So on one hand it does a really good job at cross linking. On the other hand, missing out on linking to the languages you mention seems like a huge error.


It does link to all those languages. I think what the grandparent was referring to is that there's no way to indicate that an article in another language might be interesting in some way.

Such as, for Carnival, being written in a language spoken by a people who celebrate it, or for Alexander the Great, highlighting the languages spoken in territories he conquered.

It's an interesting proposal, but I get a headache just thinking about the politics of implementing it.


The problem is that humans are very fond of doublespeak, rewriting history and just plain lying. That is why certain countries endeavor in government sponsored rewriting of local language Wikis. I say we just write off all non-English Wikis as lost cause and strife to maintain at least one Wiki as objective as rationally possible. (PS: my native language is not English, so I'm not a just defending the "easiest for me" solution)


I discovered this in 2018, when comparing lists of languages supported by different software and the number of speakers.

https://peterburk.github.io/i2018n/#wikipedia

Having machine-translated content is powerful for SEO, but I don't know how practical that is for Cebuano. It would be nice for English to no longer be practically required for people to become computer literate.


> It would be nice for English to no longer be practically required for people to become computer literate.

French here. We are terrible at english in my country.

Still, the fact most information in computing is shared in english is a god send. Sure, you have to learn it, but then:

- no need to search for it in so many languages

- no need to produce translations of tutorials/docs/comments in so many languages

- the community to share and communicated with is huge and diverse

- english is way more efficient than french, spanish, german or chinese to talk about technical stuff


Genuinely curious about your last point (don't know much about the topic). Is english intrinsically better at this, or is it because of the presence of jargon? Is it a studied phenomena, or is it something most people feel?


English is usually shorter than other latin-based languages. It's longer than ideogram based ones but you don't have to learn 100000 symbols to express yourself in it.

It also has a very simple grammar compared to most languages. Take this sentence:

"I would like not to go to school today"

The french equivalent would be:

"Je voudrais ne pas aller à l'école aujourd'hui."

"would like" is a simple combination of two words, but in french you need to know the precise conjugation of it.

"not" is actually expressed as 2 words with "ne pas", which can be positioned in several ways.

Infinitive, like with "to go", is simple in english: just add "to". In french, each verb is different, like "aller".

Then you got "the" in any circumstances in english, but the "l'", could also be "le, la, or les" depending of the word after it. A;so remember that each word is either feminine or masculine in french, even a stone or the sun.

Then "à" and "école" got an accent. French has many of them, you need to know the right one, where to place it, how to pronunciation it and type it on the keyboard.

Finally, "today" vs "aujourd'hui". I know which one is easier to type in a bug report.

Not to say English doesn't have weird traps, but it's very, very relaxing compared to the rest. And much more efficient.

Also describing a view of the country side with it feels a bit limiting. But I'm not Shakespear :)


And then you have Korean.

오늘 학교에 가기 싫다 (I don’t want to go to school today)

Hangeul is a syllabary so anyone can read it, just like Latin. It was specifically designed in response to peasants not being able to read Chinese hanzi.

Things like the subject and topic can be and is omitted when it’s superfluous. (The above sentence omits the subject and subject particle. There is no topic here). English almost always needs to specify the subject, asides from very casual speech/slang. Korean, like english, doesn’t need to specify gender of nouns, and it also doesn’t need “a” or “the” markers. The location particle above could be dropped too.

However Korean does have complex honorifics and formality conjugations, which typically get longer the more formal/polite it is. Above we have the plain or dictionary form, which is usually the shortest form as well.


How would you express "would like"?

"I don’t want to go" and "I would like not to go" express different things.

One is expressing opposition right now, the other one is expressing desire or even a request, potentially while the action is already engaged. The first one is definitive, the second one is wishful thinking or negotiation.


I don’t see any practical difference other than using more polite, indirect language. The effect of “would like not to” is the same in the end with opposition.

I don’t think there’s a 1:1 translation of “would like not to”. I’d probably say something like “it’d be good if I did / didn’t do X”. Which is less direct than the equivalent “I do / don’t want to do X”.


Actually 가기(가) is the topic here unless I’m mistaken. I wrote this too fast last night.


I painfully agree with you, as a French. So much so that I comment my code in English for brevity.

Regarding "aujourd'hui", this monstrosity litterally means : "to the day of today"..

Our dear language needs a reform, lest it will survive only as a roman dialect.


What French really needs is a comprehensive spelling/grammar reform. In Haitian Creole, "ne pas aller à l'école aujourd'hui" is written simply "pa ale lekòl jodi"! And one can just write "mwen" instead of having to differentiate "je" and "moi". (There's also a marker word "ta" that's entirely isomorphic to the English "would" in that it introduces a conditional modality.) Not coincidentally, English is basically a creole language as well. That's why it's so incredibly simple.


English has like 17 tenses, many languages only have 3 without any loss in expressivity. When learning English you need to learn each word twice (once to write, once to speak) - that's not the case with many phonetic languages.

English isn't inherently better as a world's language than anything else, it's just there, so it benefits from the network effect (it's more beneficial to learn language that has the most users, so it gets even more users).


In colloquial Japanese that would be three words: Kyou gakkou ikitakunai. The first word is "today" (unconjugated), second is "school" (unconjugated), and the third is the negative "wish" (volitional?) form of "to go" (iku > ikitai > ikitakunai). Subject (the speaker) is implied and neither the time nor the the topic/object needs to be explicitly tagged.


I can see that being true. Many of the web sites I build are multi-lingual, and when designing for many languages, you have to take into consideration that certain languages take many more characters or words to express an idea than in English.

Off the top of my head, I believe we factor in 15% more text space for Spanish. German is something like 60% more.


When localizing software also it's a general rule of thumb that your on-screen UI spaces need to be something like 40% bigger than the English text that goes in them since the German equivalent is always going to be way bigger. It's common for it to turn out half-way through localization that some of your text doesn't fit anymore.

(FWIW, this also happens when swapping from ideographic languages to English - western localizations of Japanese video games often end up with very small text as a result.)


In my opinion, jargon helps, but English is just easier to learn and shorter to communicate.

My first language is Portuguese, and English was not so hard for me to learn. I studied a bit of French and it was ok, but not as easy as English. And this considering it should be a bit easier since Portuguese is my main language.

Now I'm studying German and... it's way harder than English and French. At least, way more verbose and which more complex grammar rules and conjugations.


English has an incredibly simple grammar, comparatively speaking. There are only three verb conjugations (if you don't count auxiliary verbs) and one gender. Even naming variables is easier!


> english is way more efficient than french, spanish, german or chinese to talk about technical stuff

That's your opinion or do you have supporting research to link to?


> It would be nice for English to no longer be practically required for people to become computer literate.

In turn, you'll get "every other language becomes practically required for people to be able to communicate about computers"


Yes. In the end, we need one common language so that we can communicate with people around the globe, and that's not only about computers, but everything.


> It would be nice for English to no longer be practically required for people to become computer literate.

That's already case in other mission critical industries, like aviation. Hard to build businesses with cross-border collaboration without using English. (This is also how I learned English in the first place, it was a good motivator!)


I like this because growth and progress of knowledge base, regardless of language or hosting platform, is incremental and cumulative. Wikipedia shows this effectively in the English channel because it happened so quickly. But even the legacy encyclopedias did this through centuries. Whether a bot lays the groundwork from other reference points or dedicated humans do it is sort of immaterial, I think, because the very long run this benefits the people who speak this language.

In an age where languages are dying with their last speakers, Visayan has done much to preserve their diversity -- although not a written/codified language, volunteers give radio broadcasts in the language, books are published in it (here the lack of codification shows by variance in spelling, verb conjugation, and sentence structure), and similar. Thank you to this wikipedian for doing something to preserve a wonderful language (I mention in another comment I am fluent and miss the regular speaking of it).


Clicking on random article on https://ceb.m.wikipedia.org/wiki/Espesyal:Random#/random , looks like every article is that of either a tree, or an animal, or an insect, or a place...


And looking at one whole initial article it generated:

https://ceb.wikipedia.org/w/index.php?title=Klakkabekken_(su...

it describes where the place is and gives the citation as "found in the Geonames.org database".


The bot in question was originally created for plants and animals, so that makes sense.


So they mean to tell us "insignificant" facts and articles must be deleted?


The German Wikipedia would be twice as big if mods weren’t obsessed with some made up criteria of relevance.


That sad. In the end, all this would (if not already) make then just go for the English version. I already do this (I'm Brazilian) as the Portuguese version is nowhere near the international (English) version in terms of completeness and being up-to-date.

BTW, do you have a link for their terms on "relevance"?


The English version isn't particularly free. I attempted to add a page about a file format that is fairly well used but doesn't have a huge amount of information online about it. The only real source is a zip file from a companies website which contains a pdf with the file spec and some example programs. Unfortunately the editors decided that due to the lack of referencable sources, they would rather no article exist at all.


This bullshit policy drives me mad. I will start donating regularly once it's cancelled. Not sooner, nor later.


I understand it for some cases where the mods just need to stop people making up random crap on topics that don't exist or can't be verified. But in this case a single reference is more than enough to write the whole page because the spec is literally the only source of truth on the topic.

Unfortunately I think the mods may be too passionate about "protecting the integrity of wikipedia" that they let legitimate content be deleted. It also doesn't help that the wikipedia UI for disputes and edits is really confusing and I had a hard time trying to work out what was going on or how I communicate to this moderator. The whole system is designed for power users only.


It's important to note that there are not only "the mods", but two opposing factions in Wikipedia: the Deletionists and the Inclusionists. [1]

I too agree that we don't need articles on someone's cat, but I've had articles deleted as not notable on indie web comics and indie role-playing games with hundreds or thousands of readers or copies sold.

I thought that the fact that the RPG was published and publicly available, and was being discussed in RPG forums would make it notable enough, especially when it was mentioned as an inspiration for rules in more traditional RPGs. But since they hadn't been mentioned in any published articles they were deleted and there was no real way for me to fight it. I had added stuff like the list of contributors, publishing year, and overview of the rules and setting, with no personal discussion of the game.

The result was that I stopped trying to improve Wikipedia, because I don't have the time or interest to fight people with an infinite amount of time that deletes my additions. My main contribution to Wikipedia wouldn't have been on the articles on Barack Obama or World War II anyway, as they already have people who are experts that add information. I could have brought information on my specialized topics of interest, but realized that they would all seem non-notable to someone who's not interested in the same thing and would be deleted.

[1] https://en.wikipedia.org/wiki/Deletionism_and_inclusionism_i...


Out of curiosity, what file format?


.fit Its stores gps logs along with sensor readings for things like power/heart rate/etc. Its very commonly used in cycling computers.


Fileformats.info is a thing. You could try and add it to their wiki, then you would have a "reliable" source for the English Wikpedia. Specialized websites are usually considered "reliable" for that purpose.


I actually did reference a user created page on the OSM wiki about the format which matched what the official docs said. But was told that wikis are not a referencable source for wikipedia.


> BTW, do you have a link for their terms on "relevance"?

The German ones? You can find them here: https://de.wikipedia.org/wiki/Wikipedia:Relevanzkriterien

Some might make sense, some are rather arbitrary and depend on the person that is going to delete your post.

I tried getting into Wikipedia and supporting the work, but the atmosphere in Germany is rather toxic. Let it die and look at the English version.


Thank you very much! That's what I was curious about.

But you are absolutely right, it's just better to stick with the English version.


[flagged]


All articles that fail to meet criteria are to be purged.


Slightly pedantic but the largest "Wikipedia" (depending on how you define it) is http://wikidata.org/ and it's also primarily written by bots.


That’s a wiki, not a Wikipedia.


but both are wikimedia projects


And it's a data model that's actually suited to being written by bots. Instead of … whatever this is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: