So Long Surrogates: How We Moved to UTF-8 in Haskell

Aardwolf · on April 27, 2022

If only Windows, Java and JavaScript could also move away from internal usage of UTF-16, it's purely a legacy format and the worst of both worlds (UTF-32 and UTF-8). Even worse is that unicode itself, which should in theory be a list of codes for glyphs, modifiers and other script related values, that's independent of encoding, had to have some codes reserved for "surrogates" for the UTF-16 encoding anyway. UTF-8 doesn't need such a thing...

camgunz · on April 27, 2022

I have an old saw about UTF-16 not being an irredeemable format and UTF-8 eating the world being bad, and I'm happy to dig it out again.

UTF-16 is great for lots of East Asian languages, which billions of people use. In UTF-8, most of those languages require 3 bytes to encode a 32-bit codepoint, in UTF-16 they only ever need 2. This ends up being a huge savings.

The main benefit of UTF-8 if you're say, Chinese, is interop. Everything else is worse.

You might think "but BOMs are super evil." Checking a BOM is extremely, extremely easy. Furthermore, you don't get to bail out of checking anything just by using UTF-8, you have to check to ensure you have _valid_ UTF-8. That's right, you gotta scan the whole bytestream anyway, so you may as well just check the 2-byte BOM at the beginning too.

You might also think "what about ASCII compatibility?" ASCII compatibility is an anti-feature. You should never be indexing into UTF strings (you always have to iterate, or save the results of an iteration), upper/lowercasing isn't addition/subtraction, etc. etc. You also can't just forget about encodings as a result--you can store ASCII in something expecting UTF-8, but you definitely can't store UTF-8 in something expecting ASCII. So if you're sniffing/decoding/tagging a format anyway, you may as well be agnostic.

You might also think "OK OK, you could be right, but what about HTML, which is mostly ASCII and would nearly double in size if it went from UTF-8 to UTF-16." Practically all HTML is gzipped, so the difference is pretty small, plus the majority of text isn't HTML (almost anything stored in a database, almost anything in a file on your computer, etc.)

Different encodings are good at different things. There's no one superior encoding for all uses. What we need is text encoding agnosticism.

---

In fairness, I will say I've heard that UTF-8 is pretty popular in countries with exactly the kind of languages I'm talking about, so the issue is mostly moot at this point. I just think UTF-16 gets a really bad rap, and I think we shouldn't just gloss over UTF-8 having won because it's good for European languages.

jcranmer · on April 28, 2022

The main evil of UTF-16 is that it's usually not UTF-16, it's UCS-2, which is to say, a fixed-width format which has been shoehorned into a varible-width format to support non-BMP code points. As a consequence, systems that are built on "UTF-16" (i.e., really UCS-2) use designs centered around the ability to random-access Unicode strings despite the fact that doing so reliably isn't possible.

> ASCII compatibility is an anti-feature.

Actually, it's not. One of the main reasons it's important is...

> What we need is text encoding agnosticism

... that bit that you want. If you're supporting multiple encodings, you need a reliable way of signalling that encoding. And file metadata tends to be insufficiently reliable, so it ends up being in-band for text (e.g., HTML files). If all the encodings you need to support are ASCII-compatible, and the directive to set the charset is itself pure ASCII, that means it is perfectly safe to parse the file as ASCII to figure out how to parse the file. If the charset isn't ASCII-compatible... well, that historically has led to security vulnerabilities. Ergo, a non-ASCII-compatible charset has heightened security risks.

That said, my own experience when dealing with the mess that is charsets is that supporting multiple charsets is itself a painful process that isn't worth it.

josephg · on April 28, 2022

> systems that are built on "UTF-16" (i.e., really UCS-2) use designs centered around the ability to random-access Unicode strings despite the fact that doing so reliably isn't possible.

100% this. In javascript you can accidentally cut characters in half. This lets you accidentally construct strings which have invalid encodings - and that can lead to all sorts of weird problems. In comparison, using rust's safe API, its impossible to construct a string which contains invalid unicode.

For example, if I put an emoji like this U+1F970 in javascript, I can then split it in half and put one half into a JSON string - like this: "\"\\ud83e\"". Javascript will treat this as "one character". Rust will treat this as an encoding error, and will refuse to parse it back into a string.

Another implication is that javascript's .length field looks useful, but its almost never what you actually want. How many unicode characters in this string: "\ud83e\udd70"? Trick question - even though javascript says the string's length is 2, there's only 1 unicode character in that string. How many bytes does this string (with JS length 1) "\u270C" take up over the wire? Javascript says the length is 1, but it takes 2 bytes to send in any encoding (utf8 or utf16).

What a mess.

camgunz · on April 28, 2022

This feels less like UTF-16 being a bad format and more like JS being a bad language.

hnfong · on April 28, 2022

You're right, but in practice most of the systems I've dealt with that claims to support "UTF-16" really means they support UCS-2, and if you need to do anything outside the BMP you can help yourself by checking surrogate characters.

At least I haven't seen languages claim UTF-8 support by just giving you raw byte arrays.

Before the advent of emoji (which I'm personally grateful for), the vast majority of developers aren't even aware that UCS-2 is not UTF-16 and outside the BMP Javascript's String functions are totally broken and that you actually have to check surrogates yourself to properly decode strings.

And of course, it's not only Javascript... all systems/languages that mistakenly adopted UCS-2 ended up in similar places.

camgunz · on April 29, 2022

Yeah and I don't mean to gloss over these problems. Like, how are you supposed to realistically use UTF-16 in JS? Super clumsily apparently.

IDK, again probably you can goof using UTF-8, but it does seem harder, like writing a bad byte or whatever. It would be better if string libraries actually worked though, maybe in 2023.

josephg · on April 28, 2022

I'm pretty sure C#, java and C++ (using UCS2 strings) all have the exact same problem.

camgunz · on April 28, 2022

At least Java I doubt it. char holds any valid UTF-16 value, so without casts I think it's probably pretty hard. Not an expert here though.

jcranmer · on April 28, 2022

A Java char can hold any valid UCS-2 character, i.e. U+0000 to U+FFFF. If you need to use, say, U+1F4A9, you can't represent that with a char, so you need to use an int to hold that codepoint. A cursory glance at Java's string and character APIs reveals an awful lot of duplication between char-based "character at" and int-based "codepoint at" methods, with the former being a) broken and b) largely undeprecated despite being clearly broken.

camgunz · on April 28, 2022

Yeah I think mostly you're right. The `charAt` stuff assumes UCS-2. I think the `codePointAt` stuff is fine though? Unless you think we'll get to 2 billion code points I guess (then, I've always thought Java having no unsigned types was bananas)

recursive · on April 28, 2022

In a UTF-16 string, as implemented in java, you can end up with a char that represents only half a surrogate pair, that's not a full unicode scalar. In order to build the unicode scalar, you need the other half of the surrogate pair, which is another java "char".

camgunz · on April 29, 2022

Yeah this is probably true. I haven't checked it in repl.it, but my other experiments basically showed that UTF-16 as UCS-2 is fine, getting above 0xFFFF (bigger than a Java char) causes problems.

It's really disappointing, honestly. Go has rune, which is 32-bit, and adding a new rune-based API seems doable. But hey, what do I know.

Dylan16807 · on April 28, 2022

Maybe. But the pattern seems to be much more common in UTF-16 systems than in UTF-8 systems.

camgunz · on April 28, 2022

Idk this seems vague, also it's probably pretty easy to write invalid UTF-8 (or not validate it, or get DOS'd because you panic on invalid UTF-8 in an environment variable or user input, etc.)

josephg · on April 28, 2022

> it's probably pretty easy to write invalid UTF-8

In which language(s)?

Its quite hard to write invalid UTF-8 in Rust, because the UTF8 encoding is checked when the string is created. I think this is the right choice.

Javascript, C# and java should work like rust, since a string should always ... y'know, contain a correctly encoded string. The current situation in these languages feels like a weird middleground where String classes are actually arrays of 16 bit integers wearing a fur coat.

camgunz · on April 28, 2022

> Its quite hard to write invalid UTF-8 in Rust, because the UTF8 encoding is checked when the string is created. I think this is the right choice.

Oh 100%. You gotta validate every time. Again I'm not a C# or Java expert; if their stdlibs let you build invalid UTF-16 strings then that's unfortunate, but again a problem with an implementation, not an encoding.

In other words, if Rust used UTF-16 internally it would behave exactly the same as you're describing.

josephg · on April 29, 2022

> In other words, if Rust used UTF-16 internally it would behave exactly the same as you're describing.

This is a good point. I suppose the real problem here isn't UTF-16. Its that all these languages and systems assumed unicode would never grow past 65536 characters. They weren't designed to use UTF16. They were designed to use UCS2. It was only after unicode grew past UCS2's limit that they added awful hacks to their systems in order to (badly) support UTF16. The problems I'm complaining about aren't UTF16's fault. They're problems that arise as a result of the hacks these languages added in order to support surrogate pairs.

You could build a string type in rust today which used UTF16 strings internally. It would work fine, and be just as safe as the UTF8 strings in the standard library. I just really don't see the point. The verdict is in. UTF8 has won the war to be the world's interoperable format for unicode text. UTF8 is usually smaller than UTF16 (except apparently in some parts of Asia). And it doesn't need byte-order-marks or any of that guff. Its a relief that this point is basically settled.

I hope that a future generation of programmers can grow up without needing to understand UCS2, UTF-16, byte-order-marks, surrogate pairs or any of this junk.

camgunz · on April 28, 2022

Without some specific way of knowing what encoding you're dealing with, you can have this problem with any number of encodings you're working with (including ASCII suddenly having UTF-8 bytes in it). As soon as you're in a "assume an encoding and maybe be surprised", you're asking to be surprised.

Totally agree these discussions circle around to "you need some kind of OOB way to know what encoding you're dealing with", which puts us back to everything being encoding agnostic, with some way to configure it. Maybe that's locales, maybe that's a type 32-byte integer at the beginning, dunno.

jcranmer · on April 28, 2022

No, the best way is not to have people specify what encoding they intend to use--it is instead to not give people a choice. When you give people a choice, given that one encoding is far dominant above all others, many people will start assuming that is the only viable choice regardless of any actual declarations to the contrary.

An example of the latter effect in action is the system charset of Linux and most other Unixen... in practice, virtually every system is using UTF-8 for things like pathnames and the like. However, the mechanism to indicate that this is the case is frequently enough unreliable that it's often less bug-prone to assume the system is UTF-8 anyways.

Of course, there's a few more advantages to always using UTF-8:

* You eliminate the code path of "I'm sorry, I can't save the text that you wrote," since every character is usable in UTF-8, which is a feature very rare outside of UTF-* charsets. And, honestly, that's a wonderful code path to eliminate.

* One of the features of UTF-8 is that virtually any text that isn't UTF-8 will fail to parse as UTF-8. This allows you to build systems that produce errors on bad input, which means that maybe, just maybe, the people who produced the broken input will discover their problem and fix it.

camgunz · on April 28, 2022

> in practice, virtually every system is using UTF-8 for things like pathnames and the like.

I mean, *nix filenames are a pretty weird case where you have these requirements of 0x2F always being a path separator and 0x00 always meaning "end of path". In truth, we can't guarantee anything about filename encoding other than those two things. So, for example, you might be dealing with UTF-8 or EUC-JP. In practice (as a sibling comment pointed out) all you usually need to do is scan for 0x2F and stop at 0x00, so this works out OK, but if you ever need to do anything else, welcome to guessing the encoding based on locales, environment variables, or other (OOB) configurations.

> However, the mechanism to indicate that this is the case is frequently enough unreliable that it's often less bug-prone to assume the system is UTF-8 anyways.

My argument, which I guess looking at it I've not articulated at all in this thread, is that UTF-8's ASCII compatibility and *nix filename compatibility have allowed us to be slack in fixing this. It oughta be possible to configure an encoding when creating a filesystem, just like creating a database. There's nothing inherently buggy about configuring an encoding, we just haven't gotten our systems programming act together, and UTF-8's amenability to the existing situation is probably at fault here.

> since every character is usable in UTF-8

I guess you mean like, vs. EUC-JP or something? Sure but UTF-16 is fine here too.

> One of the features of UTF-8 is that virtually any text that isn't UTF-8 will fail to parse as UTF-8.

IDK the "brittleness" of an encoding feels hard to quantify. Besides, I think if you're doing the (always required) validation step, this goes for any input. Look at things like weirdo MySQL/Oracle/MSSQL database encodings for example.

---

I should say I see where you're coming from. But I think we'd be in just as good a place if we specified a filename/text encoding at filesystem creation (OOB configuration) and quit relying on magic 0x2F/0x00 bytes. I don't know why this makes people tear their hair out as some kind of impossible problem to solve.

kevin_thibedeau · on April 28, 2022

> including ASCII suddenly having UTF-8 bytes

People have been dealing with 8-bit ASCII extensions for longer than Unicode has been around. It isn't overwhelmingly difficult to have 8-bit clean text handling without any awareness of the encoding.

JoshTriplett · on April 27, 2022

If you care about text size, you should compress your text; that'll save much more space, since it can optimize for what's actually used in the document.

> ASCII compatibility is an anti-feature.

ASCII compatibility is extremely useful if you're working with, for instance, filenames or programming languages. You can lex UTF-8 and handle separators like `/` or quotes like `"` and `'`, because those bytes can never occur otherwise.

camgunz · on April 28, 2022

Sure it's useful, but it's not at all a guarantee. You can't assume the byte stream is UTF-8, and UTF-8 has a NUL in it.

Reading about it, filename encoding looks like a mess, GLib apparently assumes UTF-8 unless you set it with an environment variable, which I think is wild. Qt uses the locale, which feels reasonable (ignoring that locales are themselves bananas). But assuming UTF-8 and relying on ASCII compatibility and no NULs feels risky.

Dylan16807 · on April 28, 2022

ASCII has NUL too. You can't put it in a filename either way.

camgunz · on April 28, 2022

Yeah but there's no reason to put it in strings because it's not printable. When you get to the 2/3/4 byte-wide parts or UTF-8 they can have zeroes in there. This is a long-winded way of saying UTF-8 can't really be NULL-terminated because it can contain NULLs as a matter of course. That's where you get modified UTF-8, which is indistinguishable from regular UTF-8 other than it's NULL encoding, which if you don't know it's the modified derivative will sneak up on you, just like UCS-2 vs UTF-16.

This stuff just has to be OOB configurable. ASCII-compatibility lets us think we can do stuff like this, but you're always just assuming and hoping it doesn't blow up.

Dylan16807 · on April 28, 2022

> This is a long-winded way of saying UTF-8 can't really be NULL-terminated because it can contain NULLs as a matter of course.

UTF-8 "can contain NULLs as a matter of course" in exactly the same way that ASCII can.

Any system that doesn't allow null bytes in ASCII can just as easily not allow them in UTF-8.

A system that does allow them can just as easily allow them in both.

There is no difference between the two here.

-

Edit: Oh, wait, maybe you misunderstand UTF-8. The 2+ byte encodings never use bytes below 0x80. They never cause a null byte to happen. The only way you get a null byte is if it's an ASCII-compatible null character, code point 00000.

camgunz · on April 28, 2022

Oop, sorry it's late here and I forgot that. OK, I'll grant you UTF-8 (or I guess any NULL-terminatable encoding like Shift JIS or EUC-JP) is better for filenames.

toast0 · on April 28, 2022

UTF-8 Doesn't contain zero bytes, except for code point zero. Anything above code point 127 will be encoded as multiple bytes, all of which have bit 7 (0x80) set and are therefore not a zero byte. So NULL-terminated works fine for utf-8.

netheril96 · on April 28, 2022

I'm Chinese, and I prefer UTF8 to UTF16.

> You might think "but BOMs are super evil." Checking a BOM is extremely, extremely easy. Furthermore, you don't get to bail out of checking anything just by using UTF-8, you have to check to ensure you have _valid_ UTF-8. That's right, you gotta scan the whole bytestream anyway, so you may as well just check the 2-byte BOM at the beginning too.

BOM breaks substring. Everything you take a substring, you need to prepend a BOM if you want to serialize it to somewhere else. If you want to concat two strings with BOM, you need to remove one of them. All of these are unnecessary pains.

> You might also think "OK OK, you could be right, but what about HTML, which is mostly ASCII and would nearly double in size if it went from UTF-8 to UTF-16." Practically all HTML is gzipped, so the difference is pretty small, plus the majority of text isn't HTML (almost anything stored in a database, almost anything in a file on your computer, etc.)

This just contradicts your own reasoning that UTF-16 is better for East Asians due to size savings.

bjoli · on April 28, 2022

That is a minor nuisance when dealing with raw bytes. Any sane language representation would probably store it as a separate field egen not on the wire.

I would probably do BOM, LUT, BYTES. Where LUT is a lookup table of indices for every 64 chars so that you could do o(1) random access for the vast majority of cases.

camgunz · on April 28, 2022

Substrings just get the BOM too. EZ.

Re: HTML, people usually bring it up because it's a lot of ASCII that'll all double up under UTF-16. Mostly my argument is that most text isn't HTML, so size savings on other text files still matter.

netheril96 · on April 28, 2022

Which standard string library ever does that?

camgunz · on April 28, 2022

Are there UTF-16 string libraries that let you split a UTF-16 string into a string with a BOM and a string without it? That would feel like a bug to me.

josephg · on April 28, 2022

> In UTF-8, most of those languages require 3 bytes to encode a 32-bit codepoint, in UTF-16 they only ever need 2. This ends up being a huge savings.

Meanwhile in lots of other countries, UTF-8 lets you use 1 byte to store each character instead of 2; which is also a huge savings.

The question is, is it better for the world to pick one encoding and have all programs use it, or pick several encodings and switch between them depending on the context / country? Picking one encoding means we can reuse our code more easily. Picking multiple encodings means we get slightly smaller file sizes.

For my money, the best approach is to use UTF-8 everywhere. This lets us reuse code. Then if you're worried about file size on disk or over the wire, compress your text content with LZ4 or snappy or something. That'll halve the size of text for everyone and LZ4 is so fast its essentially free from a computational standpoint.

As this blog post demonstrates, quite a lot of code needs to deeply understand text encoding to work (be that UTF-8 or UTF-16 or whatever). Its a big savings in programming hours if we only have to write this code once.

camgunz · on April 28, 2022

I mostly disagree. We already have lots of code that deals with tons of different encodings, UTF-16 is pretty easy and there's lots of libraries already, etc. But if we ignore all that, your argument is against UTF-8, which succeeded UTF-16.

hnfong · on April 28, 2022

> UTF-16 is great for lots of East Asian languages, which billions of people use. In UTF-8, most of those languages require 3 bytes to encode a 32-bit codepoint, in UTF-16 they only ever need 2. This ends up being a huge savings.

"Only ever" is an exaggeration. The vast majority of standard Chinese (aka. Putonghua) characters are in the BMP, but occasionally there are characters outside of it. In particular a lot of Hong Kong characters are outside of the BMP, and before emoji forced software to fix things, systems claiming to support UTF-16 (but were instead UCS-2) kept failing for say 1% of the characters.

I've had much better luck with UTF-8 systems than purported UTF-16 ones. The problem with UTF-16 is not necessarily in its inherent technical issues, but rather, it's how often systems with UCS-2 support just claim they support UTF-16, and leave users/developers with a crappy experience when working with characters outside the BMP. There's also the encoding agnosticism thing -- with UTF-16 you run into issues about byte order marks (with or without), endian problems, and generally incompatibility with ASCII.

As I said, the situation probably improved quite a bit after emoji became popular. But if I had the choice, I'd choose UTF-8 over UTF-16 any time.

Also, although it's generally true that you "can't store UTF-8 in something expecting ASCII", in practice a lot of systems are somewhat more graceful than that. For example, if I grep a UTF-8 text file, grep doesn't need to know the text encoding if all you're trying to find is an ASCII string. Similarly it's possible to edit a UTF-8 text file as an ASCII file if the editor preserves the "binary" bits.

In conclusion, "different encodings are good at different things" is only true when considered in isolation. The historical baggage of UCS-2, and separately, of ASCII-compatibility tilts the balance strongly in favor of UTF-8 IMHO. The alleged 33% saving in Chinese text doesn't really matter that much in the grand scheme of things -- which is perhaps why people use UTF-8 in spite of the "advantages" you mentioned.

camgunz · on April 28, 2022

Yeah I mean, I broadly agree with what you're saying. Text encoding is a mess, and pragmatic solutions are needed. And when jamming text in various places, jamming UTF-8 is typically more successful than jamming in UTF-16, because of NULL bytes and the shortsightedness of UCS-2.

But I think that 99% of the things you have to do for UTF-8 (validate, encode/decode, no indexing) you have to do for UTF-16, and if we're arguing for systems that are UTF-aware, UTF-16 does great. If we're arguing for trying to jam encoded bytes into systems, well it's a crapshoot. UTF-8 being better at winning that crapshoot doesn't feel like a recipe for robustness to me, and certainly doesn't strike me as a perfect solution. OOB configuration does, though.

> For example, if I grep a UTF-8 text file, grep doesn't need to know the text encoding if all you're trying to find is an ASCII string.

Well, this is a weird example. grep uses locales, so yeah you're probably fine if you're grepping for ASCII using an ASCII-compatible locale. But if you're not, then what you feed to grep has to be encoded in your locale's encoding (I think). So I think this is actually an example in favor of "this should be OOB configurable", and an example of "ASCII-compatibility lulls into a false sense of security".

thaumasiotes · on April 28, 2022

> I have an old saw about UTF-16 not being an irredeemable format and UTF-8 eating the world being bad, and I'm happy to dig it out again.

> UTF-16 is great for lots of East Asian languages, which billions of people use. In UTF-8, most of those languages require 3 bytes to encode a 32-bit codepoint, in UTF-16 they only ever need 2. This ends up being a huge savings.

That doesn't make UTF-16 not a terrible format, though. East Asian countries use better formats instead. For example, in China they use GB18030.

When your case for "UTF-16 is not an irredeemable format" is "it's better than UTF-8 at a task where nobody would use either format anyway", you're not making a strong case.

camgunz · on April 28, 2022

> When your case for "UTF-16 is not an irredeemable format" is "it's better than UTF-8 at a task where nobody would use either format anyway", you're not making a strong case.

Haha this is pretty fair; I know I'm being pedantic. But I wouldn't say any GB* encoding is better than UTF-16. Another commenter pointed out that you really want the ability to--say--paste arbitrary text into an editor and for that editor to be using an encoding that can handle it.

Or even for the web, if we actually took the Accept-Language header [0] seriously, that could also be a big savings.

[0]: https://www.rfc-editor.org/rfc/rfc7231#section-5.3.5

account42 · on April 28, 2022

> You might think "but BOMs are super evil." Checking a BOM is extremely, extremely easy.

Checking the BOM is not enough, you also have to handle it. Non-LE BOMs (like surrogates, at least before the emojipocalypse) are rare enough that many "UTF-16" based tools simply only support UTF-16LE. They are also stupid because you need to know that you are dealing with UTF-16LE or UTF-16BE in the first place - either through heuristics or because it is specidified and then you can also guess/specify the byte order.

And like surrogates, the BOM is another wasted Unicode character that is only needed for the UTF-16 mess.

> Furthermore, you don't get to bail out of checking anything just by using UTF-8, you have to check to ensure you have _valid_ UTF-8. That's right, you gotta scan the whole bytestream anyway, so you may as well just check the 2-byte BOM at the beginning too.

You don't need validation for most UTF-8 tasks, GIGO is often more reasonable for things entered by humans.

> you definitely can't store UTF-8 in something expecting ASCII.

Unix tools would disagree.

$ echo "Hello Wörld!" | tr '!' '?' Hello Wörld?

Many tools only care about finding substrings. UTF-8 gurantees if the byte sequence making up the encoding of a Unicode character (or sequence) appears in valid UTF-8 encoded text then it also decodes to that Unicode sequence.

> Practically all HTML is gzipped

Not when it is being parsed it isn't.

camgunz · on April 28, 2022

> Non-LE BOMs (like surrogates, at least before the emojipocalypse) are rare enough that many "UTF-16" based tools simply only support UTF-16LE.

Gymnastics like this always occur when you don't know the encoding. If you didn't know you were getting UTF-8, welcome to heuristics or whatever.

> And like surrogates, the BOM is another wasted Unicode character that is only needed for the UTF-16 mess.

This isn't a big deal at all.

> You don't need validation for most UTF-8 tasks, GIGO is often more reasonable for things entered by humans.

You can't build robust systems this way. You always have to validate, and in order to do that, you gotta know what encoding you're getting.

> Unix tools would disagree.

Most of those use locales, or they don't expect ASCII they expect bytes. Again you can get lucky with this, but you can't build a robust system with luck.

> Many tools only care about finding substrings.

You have to scan through the string or save offsets (after an initial scan) either way, UTF-8 or 16.

> Not when it is being parsed it isn't.

Fair point!

rectang · on April 28, 2022

> The main benefit of UTF-8 if you're say, Chinese, is interop. Everything else is worse.

Well, for HTML it turns out to be a wash.

While Han characters are typically three bytes in UTF-8, markup is ASCII and so only one byte — whereas in UTF-17, both the character data and the markup are two bytes. When you add them together the average cost per character is more or less two bytes, whether your Chinese-content HTML is encoded as UTF-8 or UTF-17.

camgunz · on April 28, 2022

This is probably true for HTML, but not true for lots of other text like whatever's stored in databases or on hard drives.

rectang · on April 28, 2022

(LOL. How did I type "UTF-17", twice?)

camgunz · on April 28, 2022

Off-topic but yeah, 6 is so far away. I really notice it doing bitwise stuff with '^'.

Dylan16807 · on April 28, 2022

> UTF-16 is great for lots of East Asian languages, which billions of people use. In UTF-8, most of those languages require 3 bytes to encode a 32-bit codepoint, in UTF-16 they only ever need 2. This ends up being a huge savings.

Not really. That's just an excuse to be contrarian.

Any significant length of text will be compressed and/or have lots of latin characters.

> you definitely can't store UTF-8 in something expecting ASCII

That's not true.

camgunz · on April 28, 2022

> Any significant length of text will be compressed and/or have lots of latin characters.

A reasonable counterargument is Asian books on Project Gutenberg [0]. I guess you can say they're compressed on the wire, but they're not compressed in my RAM or cache.

>> you definitely can't store UTF-8 in something expecting ASCII

> That's not true.

Well alright, you can jam bytes pretty much into anything. And you get luckier with UTF-8 than you do with probably any other encoding. But lucky isn't robust, is all I'm saying.

[0]: https://www.gutenberg.org/browse/languages/zh

Dylan16807 · on April 28, 2022

> I guess you can say they're compressed on the wire, but they're not compressed in my RAM or cache.

You don't really need to have very much of a book decompressed at once, and in your average program handling that book there's going to be much more RAM dedicated to layout and display than the actual character bytes.

> Well alright, you can jam bytes pretty much into anything. And you get luckier with UTF-8 than you do with probably any other encoding. But lucky isn't robust, is all I'm saying.

It won't work in all systems but it'll work in most of them. And you can check beforehand.

camgunz · on April 29, 2022

In truth I think these are mostly arguments for letting apps/users pick their encoding. If GBK works for your app and user base, go for it. If UTF-8 ends up being more efficient, have at it. We can sit around hypothesizing about a book reading app or a grammar analyzing app or blah blah blah, but I'm pretty sure UTF-8 won't be the best encoding for every circumstance. Python at least used to let you do this at compile time, for example.

Dylan16807 · on April 29, 2022

Allowing choice is a big overhead with almost no benefit. It's not worth that little bit of optimization in certain circumstances.

camgunz · on May 1, 2022

Honestly, not to get too high-roady here, I feel like we shouldn't be in the business of not "allowing choice" unless there's a very compelling reason, and "I'm against OOB encoding info" isn't compelling, at least not to me. Should we do this for all file types? Any kind of document or image? Everything's XML, sure it's not optimal, but hey at least we can define any image format we want in it.

I don't know why we treat text encodings any differently than any other binary format. We have OOB format info for all of those. People are just downright religious about UTF-8, and it's my goal in these threads to push (hopefully gently) against that. UTF-8 isn't always the best, you always have to validate, you shouldn't be indexing into strings, we should have OOB encoding info (whether in a filesystem config or whatever). This is the only possible path to robust handling of encodings: "always assume UTF-8 and hope non-UTF-8 systems get updated or fall into disuse" is a recipe for dealing with broken systems for decades (what we're doing now).

Dylan16807 · on May 1, 2022

The first thing I think of there is browsing the web.

Different image formats have different use cases and substantial advantages over each other. And it's often impossible to convert without losing quality.

But everything gets to be HTML. Or the compatible variant of XHTML. Nothing else has any real support, and when it did, like flash web pages, that was not good.

For text unicode wins out easily, with UTF-8 and UTF-16 far above the other encodings, and the differences are minor enough that we should just use the better one.

> we should have OOB encoding info (whether in a filesystem config or whatever). This is the only possible path to robust handling of encoding

I don't know about that. Old files, which represent almost all the non-unicode non-ascii text we have, are unlikely to ever be properly tagged. For new files, are those tags really needed? If you have a firewall where you mandate tagging behind it, you're already mandating a format for attaching tags, so why not mandate utf-8 instead?

poorlyknit · on April 27, 2022

(author here)

You are absolutely right! If your use case demands storage of CJK strings, UTF-8 is probably not your best bet.

But our data (mostly stringly typed, think product descriptions or links) is >99% Basic Latin characters and we are getting to a point where memory is actually becoming an issue. So it's neat that Haskell allows us to "upgrade" to UTF-8. With the data being in memory I think compression would not be very helpful either.

Edit: I also kind of agree that UTF-16 gets to much hate :D

layer8 · on April 27, 2022

Besides the surrogate characters there are also some other noncharacters: https://www.unicode.org/faq/private_use.html#noncharacters

Because of modifier characters, control characters like for bidi, stuff like soft-hyphens and ligatures, locale-dependent semantics (upper/lowercase, collation etc.), the general discordance between glyphs and characters, and so on and so forth, Unicode is so complex, and in general always requires careful processing of code point (or code unit) sequences, that honestly the surrogate encoding doesn’t make that much of a difference. It’s just an additional wrinkle in a sea of wrinkles.

Aardwolf · on April 27, 2022

I still find the surrogates different. Bidi, private use, ligatures, ... are script or locale related.

Unicode uses numeric values from 0 to 1112063. You can invent all kinds of methods to encode numbers from 0 to 1112063 (variable length, fixed length, decimal, hexadecimal, anything else). But most ways I can think of to encode these numbers, including variable length ones that would use 8 bit or 16 bit primitives, don't require me to actually reserve some of those to-be-encoded numbers themselves for a special meaning. Yet for UTF-16 they managed to do it. Imagine that all other encodings out there would also want to reserve some Unicode values for their own purpose!

layer8 · on April 27, 2022

You always have to work with sequences of code units anyway (instead of just single code points), so the individual reasons for that doesn’t make much of a difference. It seems your rejection is more on aesthetic than on practical grounds.

chrismorgan · on April 28, 2022

As far as valid Unicode is concerned, you care about the distinction between code points and scalar values, and surrogates are the only difference.

Noncharacters can be represented in any Unicode encoding. Surrogates code points cannot, but can be found in unvalidated UTF-16 (which is most UTF-16).

Dealing with Unicode text semantically requires that you be aware of a great many factors such as those you name, but you don’t need to be aware of those for just storing and transferring Unicode text. But with surrogates, UTF-16 managed to break it for everyone: any part of the system that uses unvalidated UTF-16 can introduce errors that you must care about. Hence surrogates are a special kind of atrocity.

account42 · on April 28, 2022

> unvalidated UTF-16 (which is most UTF-16).

I prefer the name WTF-16 [0]

[0] https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

layer8 · on April 28, 2022

How is that different from unvalidated UTF-8? Both are variable-length encodings that require decoding and have failure cases.

chrismorgan · on April 28, 2022

To begin with, the difference is one of convention. UTF-16 is almost never checked for well-formedness: not a single one of the major users of UTF-16 that I am familiar with validates (Windows wide API functions, JavaScript, Qt); I don’t know of any system that exposes UTF-16 strings as some kind of main string type that does check well-formedness. UTF-8, on the other hand, is normally validated. Not always, to be sure, but normally. (Probably the most significant “largely UTF-8 but not necessarily” thing is *nix paths and other related syscalls—which is not dissimilar from the Windows situation, except that one uses 16-bit code units and the other 8-. This would seem to weaken my point, but my point is again strengthened by the number of languages that can’t equanimously interact with non-UTF-8 paths on Linux, which is not such a problem for non-UTF-16 paths on Windows, for all that the latter are very much rarer.)

But beyond that, the difference lies in Unicode having been compromised for the sake of one particular encoding, and now there are all kinds of places that have to deal with potentially ill-formed UTF-16, more that potentially ill-formed UTF-8, I would say. Look at what Python 3 did: its string type is not a Unicode string, but rather a sequence of Unicode code points. That is, it can encode surrogates. Blech.

layer8 · on April 28, 2022

I’ll grant that the surrogate code points prevent round-tripping of arbitrary code point sequences (though BOM characters arguably do as well).

> Unicode having been compromised for the sake of one particular encoding

They really didn’t have much choice though, because UCS-2 existed and was in major use at the time Unicode was extended beyond 16 bits (which wasn’t part of the original Unicode design), and too many code points were already used up to turn it into an UTF-8-like encoding. Not supporting the additional planes with the 16-bit encoding would have been a nonstarter. Switching to 32-bit in-memory representation (UCS-4) even less so, given the resource constraints of the time. The reason platforms like Windows and Java didn’t add strict input validation was due to backwards compatibility, because they previously were UCS-2 where arbitrary sequences are valid.

One can lament the chronology that lead to those technical choices, but they each were reasonable and appropriate choices under the constraints present at each time.

More importantly, surrogate characters rarely create any actual problems in practice. I haven’t encountered any in the past 20+ years (other than missing support for characters beyond U+FFFF). Other parts of Unicode cause significantly more complexity.

chrismorgan · on April 29, 2022

Missing support for characters beyond U+FFFF is the main problem caused by surrogates (their existence, even if indirect)—it normally comes of some kind of UCS-2/UTF-16 confusion. It’s not fair to disqualify them. The only (class of) case that I’m aware of for a long time where it’s not linked to that is with MySQL’s idiotic utf8 → utf8mb3 type.

You may not have encountered such bugs, but I’m very familiar with surrogate-related bugs, because I use a Compose key extensively. I haven’t been using Windows for the last year, but from time to time I would definitely encounter bugs that are certainly due to surrogates. On the web, I found bugs a few times, all but once in Rust WebAssembly things, such as https://github.com/Pauan/rust-dominator/issues/10. And even now I’m back on Linux, I know of one almost certainly surrogate-related bug: I can’t type astral plane characters in Zoom at all; pretty sure I had this problem back on Windows, too. Copy and paste, sure, but type, no, they become REPLACEMENT CHARACTER.

The history is unfortunate but I strongly refute that they had not much choice. UCS-2 should have been abandoned as a failed experiment. Certainly there had been significant investment into it in the last few years, but with the benefit of hindsight, switching to UTF-8 (which was invented before they decided on surrogates) would have made everyone’s life much easier, especially given its ASCII-compatibility, which would have allowed the Windows API to retire the misbegotten A/W split and return to sanity a few decades early.

Ah, BOMs. Haven’t seen one in years. Good riddance.

______-_-______ · on April 28, 2022

Windows is slowly but surely moving to UTF-8.

https://docs.microsoft.com/en-us/windows/apps/design/globali...

> As of Windows Version 1903 (May 2019 Update), you can [...] use UTF-8 as the process code page.

> Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps.

Of course if you use this, your app will only run on very recent Windows versions. But that's how it goes with OS features. We'll start reaping the benefits 10-20 years from now.

cryptonector · on April 27, 2022

Microsoft is making improvements in their UTF-8 support. Getting rid of the `W` APIs will take forever. Java and JavaScript are even more stuck with UTF-16.

Aardwolf · on April 27, 2022

UTF-8 support for filenames would be a great start, to support windows filenames in a multiplatform way in C!

account42 · on April 28, 2022

Filenames are tricky. You need to interpret them for display and when interfacing with other systems but fudamentally, they are not Unicode strings on either Windows or Linux but rather sequences of WCHAR or char respectively (with some restrictions). That means that a UTF-8 API can never support the full range of filenames that might be present on a valid Windows filesystem. But you can have eat your cake and have it to by having that API accept WTF-8 [0] which is a superset of UTF-8 that is specifically designed for this kind of interopability.

[0] https://simonsapin.github.io/wtf-8/

cryptonector · on April 27, 2022

But what do you care how they file names are stored on disk, as long as you can read directories and traverse paths using UTF-8?

Aardwolf · on April 27, 2022

The only standard C functions that manage files use zero terminated byte strings as filenames. I'm not saying C made a great design choice here either, but the combination of this and Windows using UTF-16 makes it impossible to write code that can open a file that works on both windows, and other OSes, without OS-specific #ifdefs

gary_0 · on April 27, 2022

There is a lot of C/C++ code out there that breaks with Windows unicode filenames, although most mature software projects don't just naively use fopen. Some C libraries allow you to use UTF-8 filenames for all platforms and do the right thing internally, and I'm a big fan of that.

timbit42 · on April 28, 2022

Is UTF-16 part of the JVM or the Java language, or both?

crdrost · on April 27, 2022

Oh wow. That is really not very much pain, as described.

I have to say, I never thought that the benefit of Haskell having a horrible native string type would be "you can just upgrade strings like any other dependency," which is really kinda slick. You think about how much pain there was for Py2 -> Py3 where one of the big sticking factors was all of the distinctions around strings and encoding and byte arrays... this is comparatively quite nice. Makes me wonder how much of a programming language can be hotswappable.

antonvs · on April 27, 2022

> Makes me wonder how much of a programming language can be hotswappable.

For a research language that can make a lot of sense, not so much for a language to be used in industry.

The downside is that different libraries will have different string representations, so you can end up being forced to do a lot of conversion if you're using different libraries that have made different choices from each other, or from your own code.

There are at least 5 commonly used string types - String (linked list of Char), ByteString lazy & strict, and Text lazy and strict. The latter two have a good rationale for being different - byte strings are not necessarily text - but, for various reasons, they're often used to represent text anyway.

These five also have corresponding `readFile` functions - see https://www.snoyman.com/blog/2016/12/beware-of-readfile/ . As Snoyman recommends in that post, it's probably best to "Stick with Data.ByteString.readFile for known-small data, use a streaming package (e.g, conduit) if your choice for large data, and handle the character encoding yourself. And apply this to writeFile and other file-related functions as well."

The first comment on that post starts out with "This problem extends well beyond readFile." Having the string handling more standardized at the language level can make life quite a bit simpler for developers.

jbboehr · on April 28, 2022

> String (linked list of Char)

That... sounds awful for performance. Is that a real thing?

antonvs · on April 30, 2022

It is a real thing, but it's a design decision that dates back to around 1990 when Haskell was very much purely a research language.

Haskell's type class capability (similar to traits or interfaces) was still new/experimental at that time, and one benefit of strings as lists is it allowed them to easily be manipulated using the same syntactic and semantic machinery - pattern matching, recursive processing etc. - as other list data.

These days real, non-trivial code uses much more optimized string representations, but the original String type still exists and is used in various standard library functions, like "error". But as the original commenter pointed out, "you can just upgrade strings like any other dependency," so if you want, you can always import some other library that uses e.g. the Text type for error message, if you care for some reason.

crdrost · on May 1, 2022

Yes and it's even worse than you think... The characters are not even raw characters, they are “boxed” into data structures and then the linked-list code guards them with a “thunk,” so “If I have computed this char return the cached char, else compute it and save it to the cache and then return it.”

The performance is good enough for teaching and learning and mocking out examples... But the reason for the user libraries is that real string processing workloads need something way better as their default.

kaba0 · on April 28, 2022

While it is surely bad for performance, it is not that bad in practice because haskell has many optimizations for dealing with these lazy, possibly infinite sequences.

thaumasiotes · on April 28, 2022

Absolutely; that's how Erlang represents strings. (Technically, a linked list of integers; there's no such thing as a Char.)

It generalizes really well to Unicode; encoding is unnecessary. :p

On the other hand, I think the name of the function that converts an integer to its own string representation, integer_to_list, could have been chosen better.

resoluteteeth · on April 27, 2022

Utf8 vs utf16 as the internal representation of the Unicode string type is mostly just an implementation detail.

This is very different from going from python2, which conflated bytes and ascii strings, to python3, which intentionally changed the api to propely distinguish sequences of bytes and strings.

nerdponx · on April 27, 2022

This site saves 26 "statistics" cookies and 99 "marketing" cookies.

Really? Is all that necessary?

meetups323 · on April 27, 2022

Your user agent saves the cookies. If you don't like it, change it.

bawolff · on April 27, 2022

Ignoring the privacy bit - 125 cookies is quite a bit of per request overhead, especially in http/1.1 where they are not compressed. I would say its poor website design.

deathanatos · on April 27, 2022

Heh, so I actually do this.

An incredible amount of the web just breaks. Twitter, Reuters, Imgur. Like it's one thing if, when I attempt to log in, your log in fails (and usually, logins fail to handle the error & will just loop back to the start, that's at least a start) but a lot of the web will have a flash-of-text and then nothing, & JS has crashed.

hombre_fatal · on April 27, 2022

I do think cookies get unfair treatment.

They are things that your browser happily rebroadcasts back to the server with no real UI for it outside of the shitty devtool bar made for devs, even after all this outcry about cookies.

It reminds me of the meme of the guy riding a bicycle, throwing a branch into the spokes (rebroadcasting cookies), and then roaring in pain on the ground about how evil websites/advertisers are tracking him with cookies.

That said, what a lame HN thread on a post about Haskell.

eklavya · on April 27, 2022

I have come to accept it and just ignore it. Many times there would be a long thread having to do absolutely nothing about the topic at hand. Not a tangent but like completely unrelated, why are we even discussing this here kind of thing.

I wish there was a good way to visually differentiate when a new top comment starts except by squinting and figuring out the whitespace from the left of the mobile screen, more painful than necessary I presume.

hombre_fatal · on April 27, 2022

I couldn't help myself. Always desperate to find an opportunity to shove my 2 cents into the world. So imagine my glee when you provided me with another one!

Yeah, I think both (A) defaulting to auto-expanded threads and (B) making them annoying to collapse make HN worse than it could be.

You tend to read the top-level thread because it's already there. And then it ends up being longer than you expected, or you're trapped in a subtree that just won't end, or you just want to see what other people are saying. And there's no good way to move past it.

Would be nice to click the indentation to collapse the thread anywhere inside the tree.

boogies · on April 27, 2022

I just scroll to the top and use the “next” link on the top comment (added with the prev and context links around October 27–28th last year I think).

eklavya · on April 28, 2022

Thanks a lot for sharing boogies, much simpler now.

Mindless2112 · on April 27, 2022

Use HackerWeb. Top-level comments are highlighted, and it automatically collapses threads to show only top-level comments when there are a lot of comments.

https://hackerweb.app/#/item/31181595

Rygian · on April 27, 2022

Why shift the burden on the user and the user agent? The website is the only one to blame here.

meetups323 · on April 27, 2022

Blaming the website for your own agent doing something you don't want it to is learned helplessness.

Every marketing cookie generates revenue for the website in some way or another. The website wants revenue, so it asks the user agent to maintain those cookies. The user agent agrees. Then the operator of the user agent gets upset that the website asked their agent to store the cookies? Get upset that your agent agreed, not that a request was made.

Or better yet, don't get upset at all and just solve the darned problem yourself. Is this Hacker News or Complier News?

Rygian · on April 27, 2022

Blaming the user-agent for accepting an abusive amount of cookies set by the website is outright bad faith.

The only entity with any real power to decide which cookies the website uses is the website itself.

Asking the user or the user agent to comb through cookies and decide, one by one, which ones seem marketing-related and which ones are technically required, and then block, is way too much to ask from a regular internet user.

I have tried, but fail to see good faith in your reply.

grumbel · on April 27, 2022

The browser is the one who stores and sends cookies. It would be trivial to make that action explicit and only at the users request. That wouldn't even be a new feature, that used to be how things worked 20 years ago. Lynx is however the only browser left that I know that still asks you before storing cookies.

You don't even have to shift through cookies for this to work, you can just reject all by default until the user explicitly request them to be stored (or use a whitelist or wait until the users tried to login that would necessitate a cookie, etc.) Lots of possibilities.

> is way too much to ask from a regular internet user.

That's kind of the point. By making it all transparent and seamless browser makers played into the hand of marketing companies. If cookies had a cost and would degrade the user experience, they might be thinking twice before putting hundreds of them on a site.

Marketing companies are just making use of the tools they are given. And browser manufacturers gave them a lot of tools, while taking control away from the user.

woojoo666 · on April 27, 2022

There are many different yet legitimate uses for cookies. It's impractical to expect the user to sift through to find the ones that are necessary and the ones that aren't. Even if the browser requests them beforehand, how is the user supposed to know if the request is for a marketing cookie or functional cookie.

> That's kind of the point. By making it all transparent and seamless browser makers played into the hand of marketing companies. If cookies had a cost and would degrade the user experience, they might be thinking twice before putting hundreds of them on a site.

Cookies do have a cost, namely the bad PR from people complaining about the unnecessary tracking cookies. If you think that's not enough, then you are free to reject cookies as well to degrade your own experience. But they aren't mutually exclusive. Complaints and bad PR can also drive users away from the site and enact change.

grumbel · on April 28, 2022

For cookies to have a cost they would need to be visible first. Brave does that right, by not only blocking lots of them out of the box, but also by showing you how many it blocked straight in the address bar, without any extra clicks. Firefox in contrast doesn't do that. It doesn't even give an easy way to inspect the cookies, it just has a "Clear cookies and site data" button that doesn't even tell you what it has stored or what it is going to delete.

Simply put, browser could to a lot better job at preventing this.

As for legitimate use, I don't really see much. Login handling is the obvious one, but I'd argue that login handling itself is in dire need of a rework and should be handled by a proper Web standard, not site specific hacks and "Save password" guesswork.

woojoo666 · on April 30, 2022

That's fair, I would love for browsers to give more transparency on the tracking front.

As for legitimate use cases, I think shopping carts on most online marketplaces use cookies.

Rygian · on April 28, 2022

> The browser is the one who stores and sends cookies.

The website is the one who decides which cookies to send in the first place. The browser never invents a cookie out of thin air.

> you can just reject all by default until the user explicitly request them to be stored

Which cookies should the user "request to be stored" and which cookies can the user safely ignore? How does the user tell them apart? Why should the user have to bother?

> If cookies had a cost and would degrade the user experience

Cookies are already degrading my user experience; you may have noticed the cookie consent popups on many sites. Those popups exist because cookies were being abused (ie. non-consensually) for purposes that are not essential to the functioning of the website. Such uses are now banned in the EU.

> And browser manufacturers gave them [marketing companies] a lot of tools

Browser manufacturers did not build those tools for the sake of marketing companies.

grumbel · on April 28, 2022

> The website is the one who decides which cookies to send in the first place.

I can't fault websites for making use of functions the browser offers them.

> Which cookies should the user "request to be stored"

Have a simple toggle button for "Save state for this website" and discard everything when that button isn't pressed. Most website I visit I don't care about and have no need to keep any state. The few that I need to log into, I can just press that button. Knit that together with the "Save Passwords" function and it might be pretty much automatic most of the time.

> Those popups exist because cookies were being abused

Those popups exist because browsers failed to do their job. If the users wants warning for cookies, that's something the browsers can do just fine by itself, yet few do (e.g. Lynx).

> Browser manufacturers did not build those tools for the sake of marketing companies.

I'd disagree on that. Google makes their money with ads, so of course they'll optimize both Chrome and Search for maximum ad friendliness. Meanwhile Firefox is also run on Google ad money, so they can't step to far out of line either. There aren't many browsers that are build for the user first. The "you are the product" quote applies to browsers just as much as it does to Facebook.

kevin_thibedeau · on April 28, 2022

> The only entity with any real power to decide which cookies the website uses is the website itself.

I have JS locked down and third-party cookies disabled. This site only managed to set one cookie for me because of my power to decide. Despite that, all content was readable.

jstimpfle · on April 27, 2022

Cookies as a mechanism are useful and required for a solid modern web experience. However, tracking cookies are arguably the opposite of that. A typical modern website with marketing comes with, I don't know, 100s of cookies. Are you really arguing that the user should be required to vet each individual cookie whenever following a link with unvetted cookies?

Or how do you solve this problem? Personally, the most I can be arsed to do is install some Adblock Plugin. I did that only a few months ago and I'm not even sure that it improved my experience by a lot.

zasdffaa · on April 27, 2022

> and required for a solid modern web experience

Absence of cookies don't make things unstable (non-solid?), and fuck knows what 'modern' is supposed to mean, or why it's good.

> Or how do you solve this problem?

Block all cookies except for rare moments like posting on HN, which then immediately get deleted. And no JS, which means CPU is trivial (so no burn-a-core-for-every-open-tab which is so common with page-sized pointless animations). Many problems can be solved if you want them to be.

jstimpfle · on April 27, 2022

But you realize you're the oddball that considers the problem solved like that? I'm not sure that being a "hacker" means to straight out refuse things. You're missing out on a lot of fun and inspiring information (and yes, many many hours wasted to irrelevant content).

zasdffaa · on April 27, 2022

You make your choices and I make mine. Should a person make the informed choice to immerse themselves in the web as-is with all its problems & risks, ok, but most people just pick the easy path then bitch after. I'm not one of them, and straight out refusal is in fact a viable option for me.

If I do need anything more, there's VMs. BTW what 'fun and inspiring information' do you refer to? Shadertoy is a loss I grant, but what else?

jstimpfle · on April 27, 2022

If you miss Shadertoy it won't be hard to imagine other similar things, of which there are plenty. Anything that requires interactivity beyond the one provided by HTML & CSS will obviously require Javascript. Any personalized experience (not only suggestions which yes are evil, but also personal storage) will obviously require cookies to function.

Deleting Cookies on exit (and/or at regular intervals) will probably not help much in terms of avoiding tracking, especially if you log back in using your reinitialized cookies.

zasdffaa · on April 27, 2022

> it won't be hard to imagine other similar things, of which there are plenty

which again you don't give.

> Anything that requires interactivity ... obviously require Javascript

jeez, no shit, I get it.

> (some defeatist blah about cookies)

Whatever.

You just persistently don't get it. These are my choices. I made them carefully. They suit me. They may not suit you. We could even compromise if you made an effort to see what I'm after but you won't/can't. Now please try to understand I'm not you, and just back off!

jstimpfle · on April 27, 2022

That escalated quickly.

eternityforest · on April 27, 2022

How exactly will sites remember that you are logged in? And how would be have any web apps that aren't horrendous without JS?

Also, where is this burn-a-core-for-every-open-tab stuff? Many websites are highly optimized and do not use much CPU. Not enough to be noticed without actually looking at the numbers anyway.

What sites have page size animations these days?

zasdffaa · on April 27, 2022

> How exactly will sites remember that you are logged in?

I don't want them to. I log back in if necessary (browser remembers id/pswd). For those few I need to stay logged in, I use a VM and save the state - I'm more concerned about controlling JS than cookies in such cases.

> And how would be have any web apps that aren't horrendous without JS?

I don't use web apps. My tradeoff.

> Also, where is this burn-a-core-for-every-open-tab stuff? Many websites are highly optimized and do not use much CPU.

Oddly, it seems to be corporate bullshit sites that are the worse offenders. Can't find one but you're right, it's not all by any means. I retract.

eternityforest · on April 28, 2022

You might be right on corporate bullshit sites, there are a few that can burn CPU(Usually without any actual content worth viewing....). I guess they are meant to be shown at a meeting on a high end business laptop?

But I think the vast majority of people would be upset if sites didn't keep you logged in and there were no web apps.

It's even worse if you prefer FOSS and use web apps, since Chromium no longer has password sync, Brave and FF block advanced features, and if you use BitWarden it takes a few extra clicks.

zasdffaa · on April 30, 2022

> Usually without any actual content worth viewing

Yeah. The less info the more clipart/general crap you'll get. Weird innit.

As to your other points, I can't argue. I accept a higher level of inconvenience for a higher level of security, that's just my choice. I won't inflict it on others who make different tradeoffs.

jimmaswell · on April 27, 2022

There is no problem to solve, the cookies can't hurt you and the website needs to stay afloat.

jstimpfle · on April 27, 2022

To state the obvious, some people don't love the extensive profiles that are created of them.

eternityforest · on April 27, 2022

Those people should be able to avoid the profiling, but any solution should be aimed at protecting those people, without impacting the 95% who don't care enough to give up convenience or pay for private services too much.

jstimpfle · on April 27, 2022

Maybe my view is warped (I'm from Germany) but 95% seems a tad high...

eternityforest · on April 28, 2022

It might be. I actually have no idea how to assess the real number.

The Cisco survey(https://iapp.org/news/a/new-cisco-study-emphasizes-consumer-...) says 79% are willing to invest time or money to protect their privacy, but a lot less seem to actually do anything about it.

Almost everyone I know is on Facebook and Gmail, most seem to use Chrome, etc.

It seems to vary a lot with subculture. Programmers always seem to be more willing to sacrifice convenience, and people who watch porn seem to be more interested in privacy than most.

I suspect there's a pretty large segment that only cares in theory, at most, and only then just on principle because of the other people who have more interesting data.

Maybe not 95% but probably 90% of certain subsets at least.

matthewmacleod · on April 27, 2022

Blaming others for making legitimate complaints about pervasive bad practices is learned assholishness.

We should all complain loudly and far more than we do about the creeping tendency of many companies to do so many obviously shitty things, instead of merely shrugging our shoulders.

zasdffaa · on April 27, 2022

Word. Tired of these "I don't want this but I won't spend any time or money on fixing it so someone else should do it" posts.

Hint: it's under Tools|Preferences in firefox/palemoon

Rygian · on April 27, 2022

No, it's not under "Tools|Preferences."

There is no setting anywhere, in any web browser, to "retain cookies that are technically necessary and reject marketing cookies" which is the desirable behaviour.

zasdffaa · on April 27, 2022

Define marketing cookie for me - do you mean 3rd party?

(Some possible control via Tools|Preferences|Exceptions... button allows you to customise by website, although I've never used it. Or just disallow all, which is what I do)

---

Edit: answer the question please, there may be an easy solution to what you want.

Edit2: No reply because god forbid there's an actual way you could take control, that would simply ruin everything (in a parallel universe, man complains the streets are rife with face stabbing but when presented with proof they're not, stabs self in face to prove otherwise).

Biggest problem with learned helplessness is that they like it that way. Gives them something to be angrily resentful about.

rini17 · on April 27, 2022

Easy, enable only cookies for the things you want (maintain your session with 1st website, plus core functionality like payments). Everything else are marketing cookies.

I used umatrix for years but gave up. The guessing what to enable to get a site to work got tiresome, and IIRC there was also problem with browser support.

Rygian · on April 27, 2022

Définition of cookies I don't consent to: any cookie that is not mandatory for the site to technically work.

zasdffaa · on April 27, 2022

You don't answer my question, then use a vague term of 'technically work' to ensure I can't give you useful info tl;dr you don't want to be helped.

Rygian · on April 28, 2022

I'm sorry that the correct answer to your question is vague. Such is the nature of the Internet. Not my choice, not my fault.

zasdffaa · on April 28, 2022

If you don't specify I can't help. If marketing = 3rd party then you can block these using the hosts file at the domain level, which I do. This blocks >95% of crap cookies. A clear question gets a clear answer.

Rygian · on April 28, 2022

Marketing cookies can be same-site or third-party.

The only entity that can specify whether a cookie is marketing-related or not is the website. No one else can.

zasdffaa · on April 28, 2022

You're right of course but 3rd party seem to be the great majority, so blocking them is a big easy win. Also, only 3rd party cookies can track between domains/sites, so that stops that, or so I believe.

dmz73 · on April 28, 2022

I don't understand how are surrogates in UTF16 a problem solved by UTF8. From the article it seems that two main improvements were smaller memory footprint due to mostly ASCII data and using better algorithm which resulted in even better performance in UTF16. I had to deal with surrogates in UTF16 and it is much easier than dealing with variable encoding in UTF8. Naive UTF8 decoder is easy and fast but if one takes time to fully validate each UTF8 code point then it becomes much more difficult and a lot slower...and that is going forward, trying to move backwards in UTF8 is again much harder than in UTF16. The main disadvantage of UTF16 is memory usage when dealing with ASCII data but if that is the case then just use array of bytes and don't worry about unicode.

languageserver · on April 27, 2022

I am always extremely doubtful of these types of blogposts that take a well-known algorithm and somehow beat all others (including academia, bioinformatics tools, etc.) with a fancy implementation in <insert cool programming language 2022>

Nebasuke · on April 27, 2022

The article is about how they moved an existing (fast) implementation in Haskell in UTF-16 to an even faster implementation in Haskell by switching to UTF-8. This is stated in the first paragraph.

The post they reference, is also very honest: ..., the fastest Haskell implementation of the Aho–Corasick string searching algorithm, which powers string search in Channable.

Basically the blog posts show that if you want to program in Haskell and still optimise, this is how you can do it. I think both posts are great resources and don't overstate their claims.

poorlyknit · on April 27, 2022

(author here)

I wrote this article during a short internship at Channable. Not to be apologetic but I think these kind of articles are so prevalent because young or unpopular languages usually have worse documentation than established ones (naturally). I basically wrote down the things I learned during my internship that I found noteworthy.

danschuller · on April 27, 2022

I was taught Haskell at university and I'm old. Looking at it's wiki page it's a 32 year old language not that much younger than 37 year old C++.