×
Login Register an account
Top Submissions Explore Upgoat Search Random Subverse Random Post Colorize! Site Rules
28

UTF-16 has been nothing but a 30-year mistake which blights the entire Unicode text system.

submitted by SithEmpire to programming 1.9 yearsJul 6, 2022 15:55:35 ago (+28/-0)     (programming)

Standard ASCII is one byte per character, ending at 0x7F, which gives just about enough space for control bytes, spaces, punctuation and uppercase/lowercase Latin.

Extended ASCII uses the upper half to include some accented letters, and there have been attempts to use dreaded "code pages" for switching alphabet entirely, but Unicode is the proper international solution... in return for accepting that the character encoding is no longer one byte each, and moreover may be variable width.

UTF-8 is proudly variable-width, matching ASCII in most documents while using only those upper half byte values to encode non-ASCII, a robust system which reduces impact when an unaware program reads it as ASCII. The variable width is an issue for a few types of program such as word processors, though most operations (copying, sorting, searching) need not care.

Then some butthurt happened in the early 1990s, and someone just had to insist on fixed byte width, and out popped UTF-16. Two bytes per character, costing a load of extra storage for most files, and absolutely no plan for what happens when we reach 0xFFFF. Java, C# and JavaScript all signed up to it for their fixed width string storage.

The Unicode character space started getting flooded with every alphabet in existence with all the durka durkas and the ching chang chongs, and symbols for mathematics, music and such. It looks to me that they got up to 0xD000 before admitting that it is going to run dry.

The solution? Reserve 0xD800 to 0xDFFF... for using pairs of characters in that range to represent 0x10000 and above. So, a variable width encoding.

This is indeed literally as retarded as having to use a gas-powered generator to recharge an electric car.

Even worse, the UTF-8 system could store Unicode up to 0x7FFFFFFF, but that UTF-16 extension only takes it up to 0x10FFFF, and now Unicode itself specifies 0x10FFFF as the limit just to appease UTF-16.

UTF-16 has fucked everything up for zero benefit in return. Even C++ has an entire wide-string type and string library suite for it which is widely known as a massive trap to avoid and put UTF-8 bytes inside normal strings instead. Windows purports to use wide-strings for files, but it is broken.

Python (a.k.a. programming for niggers) 2 tried to "help" with Unicode in the same way that cats "help" ensure that everything is knocked successfully onto the floor. Python 3 sort of behaves itself, while in the background it insists that if a single string character would be multi-byte in UTF-8 then the entire fucking string gets converted to a fixed width encoding big enough for that. Actually that is really a different manifestation of insistence on fixed width rather than UTF-16 specifically, but that type of thinking is still the problem.

UTF-16 must be destroyed.


24 comments block

headfire 0 points 1.9 years ago

Here’s my thing about python: I use it for algorithm prototyping because it’s quick and trivial to write prototypes/experiments with it (and really, it has to be, else niggers couldn’t use it at all).

Having done the POC, I then write the real thing in C++ (a.k.a. programming for White Men, not walking african faggot apologies).

Every pasty, fuckfaced, little purple-haired man-bun who has ever “published” some FOSS shit in python about their cool, new, zOMfG fRamEWoRk fOr NiGGerZ needs to be set on fucking fire. It’s a virtually useless fucking toy, nothing more. There is no intensity of airhead masturbation which will change this.

Python literally attracts niggers. It’s a nigger attractor, of all kinds. Faggot ass-toys who get horny about their supper awesome typists’ assistant programming language for production code should be bound and thrown off a building in some shitholistan hellhole.