×
Login Register an account
Top Submissions Explore Upgoat Search Random Subverse Random Post Colorize! Site Rules
1

Is there any decisive criticism against using UTF-8 everywhere as standard?

submitted by SithEmpire to programming 2.0 yearsMay 13, 2022 15:04:47 ago (+1/-0)     (programming)

Infogalactic has UTF-8 at accounting for 85% of websites in 2015, and Wikipedia claims 98% presently. The latter whores itself out and has no criticism of UTF-8 at all, and the former only really covers the obvious points about some operations being mildly less convenient with variable-width encoding and sometimes using 3 bytes instead of 2 for an asian glyph. Everyone recommends it as a standard, even Microsoft.

I have no direct problem supporting UTF-8 and running tests against UTF-8 input and those icon characters it has, but its dominance has me very suspicious. If it has the blessing of that much of the world, that must include communists and degenerates. This almost never happens unless it's a format lock-in to secure future compliance of the developer, or incur royalties if a piece of a program happens to be able to encode it (MP3 is that way).

I don't doubt its usefulness, I just get concerned at the lack of anyone presenting any real contrary position at all. Is my suspicion unfounded, or does UTF-8 have some deep rabbit-hole beneath the surface?


13 comments block


[ - ] x0x7 0 points 1.7 yearsAug 9, 2022 23:00:32 ago (+0/-0)*

Here is one. In NodeJS, as well as likely other languages, every http response given by the programmer as a string involves a significant period of process blocking converting the string to a buffer. This sub-process isn't very efficient because converting from utf-32 or utf-16 that most languages use internally to represent utf to actual bytes for the network is branching code on every single character (requires if statements that prevent things like SIMD/AVX for a memcpy). If servers were using ascii the conversion wouldn't even be necessary.

This means every server in the world is 5x slower at some of the most common tasks even if the developer never intends to deliver unicode ever.

Node brags about being fast so they wouldn't do that if it wasn't necessary, which means everything to python to php are employing equivalent or worse solutions.

So if you code your server directly in C or C++ and only touch ascii and pipe your strings straight to hardware with no conversion (literally just telling the system call the location in memory where the data is with zero copy), your server will be faster. Higher level languages could use the same strategy under the hood if they weren't dealing with UTF8. In other words first class UTF8 in modern programming languages has nerfed all web servers unless you want to code servers in c, essentially writing nginx from scratch which is over kill for a simple site you just want to be fast.

The alternative would be that if you want to display a thumbs up on a web page you would just write some html that displays a thumbs up. i class="fa fa:thumbup" (can't write proper html in voat without it breaking). That would slow that specific web page's delivery a whole 0.15%, but because that would be cumbersome lets slow down the delivery of all pages in the world!

[ - ] SithEmpire [op] 0 points 1.7 yearsAug 19, 2022 13:37:08 ago (+0/-0)

Interesting stuff, though I think that is almost entirely a UTF-16 problem with JS, something I noticed a bit later before posting that. I would have thought unless the server is actually processing the HTML content somehow, it shouldn't need to care about UTF-8, just deliver the bytes and it will work.

I also found out that Python 3 checks constantly whether non-ASCII characters are put into a string, and resizes the char width of the entire string if so, presumably blighting normal programs and server backends alike.

[ - ] deleted 0 points 1.8 yearsJul 9, 2022 01:38:04 ago (+0/-0)

deleted

[ - ] chrimony 7 points 2.0 yearsMay 13, 2022 15:26:17 ago (+7/-0)

It fills a need. That's why it took over. It has been around for decades. I've never heard of any patents on it. And the best thing about UTF-8 is that it's compatible with ASCII. So stop worrying about stupid shit.

[ - ] SithEmpire [op] 0 points 2.0 yearsMay 13, 2022 15:43:01 ago (+0/-0)

I know most of all of that, the point was to check whether it could be a trap. Good to know there have been no patent problems though!

[ - ] SecretHitler 1 point 2.0 yearsMay 13, 2022 16:18:16 ago (+1/-0)

Your suspicion is unfounded and weird. I think America and Europe should be standardized on White people because we're superior. We can also standardize on UTF-8 because it's superior. Does that help you feel better about it?

[ - ] SithEmpire [op] 1 point 2.0 yearsMay 13, 2022 17:03:51 ago (+1/-0)

Yes! I am very happy for merit alone to drive UTF-8's first class status.

[ - ] SecretHitler 1 point 2.0 yearsMay 13, 2022 17:41:41 ago (+1/-0)

Good. While I'm at it let me say /r/n is for niggers and boomers

[ - ] Yargiyankooli 1 point 2.0 yearsMay 13, 2022 16:17:24 ago (+1/-0)

I am happy to be off python2.7 and everything utf-8 encoded. Saves me so much time figuring out encoding

[ - ] SithEmpire [op] 0 points 2.0 yearsMay 13, 2022 17:01:22 ago (+0/-0)

Indeed, the apt description of Python 2 encoding is "kicking and screaming", whereas Python 3 works almost too well with how its string characters are codepoints rather than bytes.

[ - ] lord_nougat 1 point 2.0 yearsMay 13, 2022 15:28:52 ago (+1/-0)

It has introduced some weird errors in very old databasen that I had to go and modify. And it was easy, so I'm not even complaining.

[ - ] mikenigger 1 point 2.0 yearsMay 13, 2022 15:06:39 ago (+2/-1)

If it has the blessing of that much of the world, that must include communists and degenerates

boomer logic

[ - ] SithEmpire [op] 1 point 2.0 yearsMay 13, 2022 15:44:48 ago (+1/-0)

It was a true statement and you know it, so go be a nigger some place else.