In Swedish I *always* want e to match é, and I would *never* want a to match ä. ...

zzo38computer · on June 15, 2020

Yes, that is correct. HTML can declare languages (which are also useful for automatically selecting fonts), so that should be used for searching, too, in addition to font selection, indexing, etc.

Another idea would be to do that in Swedish text, "é" will always be represented in decomposed form, while "ä" will always be represented in precomposed form, so that you can tell the difference.

lokedhs · on June 16, 2020

That would be incorrect, as Unicode defines the two forms to be interchangeable. You'd do a decomposition before doing a search anyway in which case both forms become the same code points (which ones depends on the decomposition type).

The correct way to handle this is by tagging the text with a language tag, as has been mentioned in other replies to the parent post.

anoncake · on June 15, 2020

Making search work differently depending on the specified language just makes it unpredictable since not everything is marked up correctly.

necovek · on June 16, 2020

Do you have a better incentive in mind to mark text up properly?

I'd even suggest browser to mess up formatting of non English letters (by using wildly different fonts) to encourage better semantic markup, but it is a bit hard-core and everyone would shout "compatibility breakage" at them :)

zzo38computer · on June 17, 2020

If text is not marked up properly, then the user should be allowed to override that setting, but when it is marked up properly, such overriding should not be necessary.

As long as it is user configurable which fonts to use for which language, I think that it does not break compatibility to do that. Actually, I think it is a bit good idea. The document should only specify the language and the style (e.g. bold, emphasis, normal, fix pitch, heading, etc) and then that combination is mapped to a font in the browser. (If the user has enabled use of CSS fonts, and such fonts are specified, then they would override those specified by the user. If the user has not enabled use of CSS fonts, then the user's fonts are always used.) This would be needed anyways due to the Han unification that Unicode does, anyways (and Unicode is very messy, anyways). (I mentioned before that Unicode can be good for searching, and if that is what you are doing with Unicode rather than for writing and displaying documents, then Han unification is probably desirable, although again the Duocode that I mentioned before may help even more.)

sbierwagen · on June 15, 2020

What do you do on a page that's mostly English, but contains a single Swedish quote?

zzo38computer · on June 15, 2020

I suppose that you can write:

  <blockquote lang="sv">...</blockquote>