×
Login Register an account
Top Submissions Explore Upgoat Search Random Subverse Random Post Colorize! Site Rules Donate
28

UTF-16 has been nothing but a 30-year mistake which blights the entire Unicode text system.

submitted by SithEmpire to programming 2.8 yearsJul 6, 2022 15:55:35 ago (+28/-0)     (programming)

Standard ASCII is one byte per character, ending at 0x7F, which gives just about enough space for control bytes, spaces, punctuation and uppercase/lowercase Latin.

Extended ASCII uses the upper half to include some accented letters, and there have been attempts to use dreaded "code pages" for switching alphabet entirely, but Unicode is the proper international solution... in return for accepting that the character encoding is no longer one byte each, and moreover may be variable width.

UTF-8 is proudly variable-width, matching ASCII in most documents while using only those upper half byte values to encode non-ASCII, a robust system which reduces impact when an unaware program reads it as ASCII. The variable width is an issue for a few types of program such as word processors, though most operations (copying, sorting, searching) need not care.

Then some butthurt happened in the early 1990s, and someone just had to insist on fixed byte width, and out popped UTF-16. Two bytes per character, costing a load of extra storage for most files, and absolutely no plan for what happens when we reach 0xFFFF. Java, C# and JavaScript all signed up to it for their fixed width string storage.

The Unicode character space started getting flooded with every alphabet in existence with all the durka durkas and the ching chang chongs, and symbols for mathematics, music and such. It looks to me that they got up to 0xD000 before admitting that it is going to run dry.

The solution? Reserve 0xD800 to 0xDFFF... for using pairs of characters in that range to represent 0x10000 and above. So, a variable width encoding.

This is indeed literally as retarded as having to use a gas-powered generator to recharge an electric car.

Even worse, the UTF-8 system could store Unicode up to 0x7FFFFFFF, but that UTF-16 extension only takes it up to 0x10FFFF, and now Unicode itself specifies 0x10FFFF as the limit just to appease UTF-16.

UTF-16 has fucked everything up for zero benefit in return. Even C++ has an entire wide-string type and string library suite for it which is widely known as a massive trap to avoid and put UTF-8 bytes inside normal strings instead. Windows purports to use wide-strings for files, but it is broken.

Python (a.k.a. programming for niggers) 2 tried to "help" with Unicode in the same way that cats "help" ensure that everything is knocked successfully onto the floor. Python 3 sort of behaves itself, while in the background it insists that if a single string character would be multi-byte in UTF-8 then the entire fucking string gets converted to a fixed width encoding big enough for that. Actually that is really a different manifestation of insistence on fixed width rather than UTF-16 specifically, but that type of thinking is still the problem.

UTF-16 must be destroyed.


24 comments block


[ - ] bonghits4jeebus 9 points 2.8 yearsJul 6, 2022 16:19:42 ago (+9/-0)

Clearly the solution is UTF-32

People that program like python don't care how many bytes everything takes up. Or even probably what happens in the background. They just want to write a minimal amount of code that's essentially glue to make an app.

[ - ] Tallest_Skil 4 points 2.8 yearsJul 6, 2022 17:25:20 ago (+4/-0)

UTF-Qbit will solve the problem by allowing every character that is physically possible to exist be encoded within a single Q-bit.

[ - ] SithEmpire [op] 1 point 2.8 yearsJul 6, 2022 21:10:04 ago (+1/-0)

I bet at most 1% of those programs actually derive any advantage from being able to seek in the strings by character index.

[ - ] bonghits4jeebus 1 point 2.8 yearsJul 6, 2022 21:23:48 ago (+1/-0)

If you're trying to do something like that, you're probably not handling localization well. There are many "tricks" you can do an ASCII strings, but if you need to localize you're better of treating them as atomic labels on which you can perform abstract operations like draw and query the size. Parsing localized input is another chore. You can't count on the thing you're inputting having spaces between words or whatever.

Not that I actually know anything about it :P

[ - ] SithEmpire [op] 0 points 2.8 yearsJul 7, 2022 03:19:02 ago (+0/-0)

My point was that the storage cost of UTF-32 only really gives fast seeking in return, which almost nothing actually needs (as you noted). Actual specifics or use-cases need not matter; 8 or 32 will work for anything, just with a different storage/compute tradeoff.

[ - ] CoronaHoax 0 points 2.8 yearsJul 6, 2022 22:37:45 ago (+0/-0)

Why write anything non machine learning in python?

I guess they’ve got the selenium library too?

It does seem to do a good job at being cross platform I guess unlike java, whose claim to fame was that but was shit at it.

[ - ] headfire 4 points 2.8 yearsJul 6, 2022 21:34:55 ago (+4/-0)

I could not agree more.

Python (a.k.a. programming for niggers)

lol Good god yes.

[ - ] SithEmpire [op] 1 point 2.8 yearsJul 7, 2022 03:27:58 ago (+1/-0)

As with Python, niggers don't understand punctuation. You also have to think like a nigger before it becomes possible to write `X if condition else Y` rather than `condition ? X : Y`.

I also like actually being able to embed variable assignments within expressions and lambdas rather than having to turn it into a separate function.

[ - ] headfire 0 points 2.8 yearsJul 8, 2022 02:44:56 ago (+0/-0)

Here’s my thing about python: I use it for algorithm prototyping because it’s quick and trivial to write prototypes/experiments with it (and really, it has to be, else niggers couldn’t use it at all).

Having done the POC, I then write the real thing in C++ (a.k.a. programming for White Men, not walking african faggot apologies).

Every pasty, fuckfaced, little purple-haired man-bun who has ever “published” some FOSS shit in python about their cool, new, zOMfG fRamEWoRk fOr NiGGerZ needs to be set on fucking fire. It’s a virtually useless fucking toy, nothing more. There is no intensity of airhead masturbation which will change this.

Python literally attracts niggers. It’s a nigger attractor, of all kinds. Faggot ass-toys who get horny about their supper awesome typists’ assistant programming language for production code should be bound and thrown off a building in some shitholistan hellhole.

[ - ] Splooge 3 points 2.8 yearsJul 6, 2022 19:09:48 ago (+3/-0)

Incredibly based, I didn't understand any of that. But I could tell.

[ - ] BrokenVoat 1 point 2.8 yearsJul 7, 2022 02:29:51 ago (+1/-0)

It was some American who came up with UTF-16 thinking 2 bytes is enough for every language. All usa company stores are tragic because they dont understand languages other than english. If you want to read reviews you may not appreciate auto translated reviews because USA english speakers think more than one language is too many.
UTF-16 is the product if arrogant anglo sphere who think they know how to program, many dont.
Btw: python is one of the few mainstram object oriented languages that lets you actually hide implementation details (unlike Java).

[ - ] SithEmpire [op] 0 points 2.8 yearsJul 7, 2022 03:31:31 ago (+0/-0)

Python can start with not having a global interpreter lock which prevents threading before it is allowed to start flaunting trivial virtues.

[ - ] BrokenVoat 0 points 2.8 yearsJul 7, 2022 04:00:37 ago (+0/-0)

You have never used Java and heard about aspect oriented programming?
It shows Java isnt good oo language
The global lock isnt an issue in Python when running network code, and unlike Java, Python works with C extensions.

[ - ] SithEmpire [op] 0 points 2.8 yearsJul 7, 2022 15:00:31 ago (+0/-0)

I use every language I mentioned, and many others besides. Python is total ass.

Yes, of course Python's shit attempt at threading can at least handle blocking I/O, but it can't actually multi-thread and you know it. That is fucking shit.

I could list over 9000 things I hate about Java, but they will be lost on a PyNigger. C integration isn't even on that list (JNI isn't exactly great but honestly Java has more pertinent problems).

I just wanted to dump on UTF-16, but it sounds like Python needs a second round.

[ - ] x0x7 1 point 2.8 yearsJul 6, 2022 20:00:31 ago (+1/-0)*

Part of me even dislikes the idea of UTF-8 a bit. I'm 50:50 on it. I see the advantage but I also see the disadvantage and also recognize the reality that any application that wants richer text can get it out of an ascii document. I kind of wish text-enriching were application specific and various applications could use standards.

One may say that utf-8 is exactly that but in some ways it goes beyond. It's treated as a first class text system on par if not propped ahead of ascii. Considering text as a series of bytes and a raw character as one byte consistently even in higher level languages has a lot of value. Whatever gui system can have different display characters.

It's responsible for a wopping amount of slowdown in nodeJS. When you profile a nodeJS server and you've knocked out most of the things you should, the largest and most consistent blocking code is converting string to buffer before sending it over network. In node you can write to an http-response object either with string or buffer, but under the hood it's just buffer. Unfortunately the conversion from one to another is not an uncomplex task and can't take advantage of compiler optimizations like replacing the copy with memset or with AVX/SIMD operations because the offset can go off by one at any point. Perhaps and likely if strings weren't utf8 not only would such a translation from string to buffer be faster, but it would likely be unnecessary entirely. Just point your OS to the location in memory to go to network and it's done in O(1) time as far as your process's timing is concerned.

So basically your nodeJS server can handle 1/5th the traffic it should be able to all because of utf8 whether its being used or not.

Utf8 should be a front-end concept only, and if you need to know the virtual character length of a string accurately on backend you should use a library and not slow down every server on the planet. C#, python, php, all likely have the same problem. If you are writing strings to network, which most http servers do, you have to convert every string as it goes out to hardware. Being able to index and substr utf8 strings correctly requires them to use either inefficient data structures or algorithms internally or have strings that require translation to write to any byte based interface.

[ - ] SithEmpire [op] 0 points 2.8 yearsJul 6, 2022 20:59:17 ago (+0/-0)

If I followed what your service was doing, surely much of the actual problem there is that the strings in Node.js are UTF-16, so it ends up converting UTF-8 input to UTF-16 internally, then back to UTF-8 for output rather than just outputting existing UTF-8.

Substring gets closer to causing the problem of a split sequence of course, though even then the most common substring usage is following a starts/ends test or a index-of, which all still work without needing to consider variable width.

Usually, the only times it actually needs to seek would be if the server is editing the text or outputting just part of it.

[ - ] Wahaha 1 point 2.8 yearsJul 6, 2022 18:32:01 ago (+1/-0)

I assumed they had too much space left and that's why they added all the stupid emojis to UTF-8. If that's not even the case then adding emojis is even more retarded.

[ - ] SithEmpire [op] 0 points 2.8 yearsJul 7, 2022 03:38:24 ago (+0/-0)

Definitely too much credit there, I doubt emojis and capacity are connected. UTF-16 will be modified or killed and Unicode range expanded when the present limit is reached.

I can just see it happening, some high range around U+100000 to U+10FFFF is reserved so that UTF-16 can have "surrogate quads", two pairs which each point into that range, together meaning something even higher.

[ - ] mannerbund 1 point 2.8 yearsJul 6, 2022 18:11:27 ago (+1/-0)

As a proper nigger (who's most proficient in Python), I approve of this message.

Although, I'd say python is more of a jewish language, willing to accept whatever and do whatever with no effort so long as it gets what it wants in the end.

[ - ] ScheduledSuicide 0 points 2.8 yearsJul 6, 2022 18:18:54 ago (+0/-0)

?nohtyP

[ - ] SithEmpire [op] 0 points 2.8 yearsJul 6, 2022 21:07:19 ago (+0/-0)

A bit of both. I call it nigger programming because trying to express a programmatic intent, at least coming from something like C or Java, is like trying to explain it to a nigger.

I can see its other role as a cheap whore though.

[ - ] CoronaHoax 0 points 2.8 yearsJul 6, 2022 22:34:32 ago (+0/-0)

Python is a language that can be used in command line form like matlab no?

Seems it’s main purpose is a matlab alternative that is free and open source.

Then everyone made a f ton of libs for it effectively making it possible to do anything. But at the end of the day it’s basically it’s an everything thrown at it script language run on c code.

[ - ] chrimony 0 points 2.8 yearsJul 6, 2022 22:58:11 ago (+0/-0)

Using Python for a Matlab-like program came years after Python was already established as a scripting language.

[ - ] CoronaHoax 0 points 2.8 yearsJul 6, 2022 22:30:42 ago (+0/-0)

I certainly hate any utf-16 niggers and especially any python nigger (kidding), and that includes fucksequel (MySQL) for piss poorly supporting it.