Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Non-LE BOMs (like surrogates, at least before the emojipocalypse) are rare enough that many "UTF-16" based tools simply only support UTF-16LE.

Gymnastics like this always occur when you don't know the encoding. If you didn't know you were getting UTF-8, welcome to heuristics or whatever.

> And like surrogates, the BOM is another wasted Unicode character that is only needed for the UTF-16 mess.

This isn't a big deal at all.

> You don't need validation for most UTF-8 tasks, GIGO is often more reasonable for things entered by humans.

You can't build robust systems this way. You always have to validate, and in order to do that, you gotta know what encoding you're getting.

> Unix tools would disagree.

Most of those use locales, or they don't expect ASCII they expect bytes. Again you can get lucky with this, but you can't build a robust system with luck.

> Many tools only care about finding substrings.

You have to scan through the string or save offsets (after an initial scan) either way, UTF-8 or 16.

> Not when it is being parsed it isn't.

Fair point!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: