Standard ASCII is one byte per character, ending at 0x7F, which gives just about enough space for control bytes, spaces, punctuation and uppercase/lowercase Latin.
Extended ASCII uses the upper half to include some accented letters, and there have been attempts to use dreaded "code pages" for switching alphabet entirely, but Unicode is the proper international solution... in return for accepting that the character encoding is no longer one byte each, and moreover may be variable width.
UTF-8 is proudly variable-width, matching ASCII in most documents while using only those upper half byte values to encode non-ASCII, a robust system which reduces impact when an unaware program reads it as ASCII. The variable width is an issue for a few types of program such as word processors, though most operations (copying, sorting, searching) need not care.
Then some butthurt happened in the early 1990s, and someone just had to insist on fixed byte width, and out popped UTF-16. Two bytes per character, costing a load of extra storage for most files, and absolutely no plan for what happens when we reach 0xFFFF. Java, C# and JavaScript all signed up to it for their fixed width string storage.
The Unicode character space started getting flooded with every alphabet in existence with all the durka durkas and the ching chang chongs, and symbols for mathematics, music and such. It looks to me that they got up to 0xD000 before admitting that it is going to run dry.
The solution? Reserve 0xD800 to 0xDFFF... for using pairs of characters in that range to represent 0x10000 and above. So, a variable width encoding.
This is indeed literally as retarded as having to use a gas-powered generator to recharge an electric car.
Even worse, the UTF-8 system could store Unicode up to 0x7FFFFFFF, but that UTF-16 extension only takes it up to 0x10FFFF, and now Unicode itself specifies 0x10FFFF as the limit just to appease UTF-16.
UTF-16 has fucked everything up for zero benefit in return. Even C++ has an entire wide-string type and string library suite for it which is widely known as a massive trap to avoid and put UTF-8 bytes inside normal strings instead. Windows purports to use wide-strings for files, but it is broken.
Python (a.k.a. programming for niggers) 2 tried to "help" with Unicode in the same way that cats "help" ensure that everything is knocked successfully onto the floor. Python 3 sort of behaves itself, while in the background it insists that if a single string character would be multi-byte in UTF-8 then the entire fucking string gets converted to a fixed width encoding big enough for that. Actually that is really a different manifestation of insistence on fixed width rather than UTF-16 specifically, but that type of thinking is still the problem.
UTF-16 must be destroyed.
[ + ] bonghits4jeebus
[ - ] bonghits4jeebus 9 points 2.8 yearsJul 6, 2022 16:19:42 ago (+9/-0)
People that program like python don't care how many bytes everything takes up. Or even probably what happens in the background. They just want to write a minimal amount of code that's essentially glue to make an app.
[ + ] Tallest_Skil
[ - ] Tallest_Skil 4 points 2.8 yearsJul 6, 2022 17:25:20 ago (+4/-0)
[ + ] SithEmpire
[ - ] SithEmpire [op] 1 point 2.8 yearsJul 6, 2022 21:10:04 ago (+1/-0)
[ + ] bonghits4jeebus
[ - ] bonghits4jeebus 1 point 2.8 yearsJul 6, 2022 21:23:48 ago (+1/-0)
Not that I actually know anything about it :P
[ + ] SithEmpire
[ - ] SithEmpire [op] 0 points 2.8 yearsJul 7, 2022 03:19:02 ago (+0/-0)
[ + ] CoronaHoax
[ - ] CoronaHoax 0 points 2.8 yearsJul 6, 2022 22:37:45 ago (+0/-0)
I guess they’ve got the selenium library too?
It does seem to do a good job at being cross platform I guess unlike java, whose claim to fame was that but was shit at it.
[ + ] headfire
[ - ] headfire 4 points 2.8 yearsJul 6, 2022 21:34:55 ago (+4/-0)
lol Good god yes.
[ + ] SithEmpire
[ - ] SithEmpire [op] 1 point 2.8 yearsJul 7, 2022 03:27:58 ago (+1/-0)
I also like actually being able to embed variable assignments within expressions and lambdas rather than having to turn it into a separate function.
[ + ] headfire
[ - ] headfire 0 points 2.8 yearsJul 8, 2022 02:44:56 ago (+0/-0)
Having done the POC, I then write the real thing in C++ (a.k.a. programming for White Men, not walking african faggot apologies).
Every pasty, fuckfaced, little purple-haired man-bun who has ever “published” some FOSS shit in python about their cool, new, zOMfG fRamEWoRk fOr NiGGerZ needs to be set on fucking fire. It’s a virtually useless fucking toy, nothing more. There is no intensity of airhead masturbation which will change this.
Python literally attracts niggers. It’s a nigger attractor, of all kinds. Faggot ass-toys who get horny about their supper awesome typists’ assistant programming language for production code should be bound and thrown off a building in some shitholistan hellhole.
[ + ] Splooge
[ - ] Splooge 3 points 2.8 yearsJul 6, 2022 19:09:48 ago (+3/-0)
[ + ] mannerbund
[ - ] mannerbund 1 point 2.8 yearsJul 6, 2022 18:11:27 ago (+1/-0)
Although, I'd say python is more of a jewish language, willing to accept whatever and do whatever with no effort so long as it gets what it wants in the end.
[ + ] ScheduledSuicide
[ - ] ScheduledSuicide 0 points 2.8 yearsJul 6, 2022 18:18:54 ago (+0/-0)
[ + ] SithEmpire
[ - ] SithEmpire [op] 0 points 2.8 yearsJul 6, 2022 21:07:19 ago (+0/-0)
I can see its other role as a cheap whore though.
[ + ] CoronaHoax
[ - ] CoronaHoax 0 points 2.8 yearsJul 6, 2022 22:34:32 ago (+0/-0)
Seems it’s main purpose is a matlab alternative that is free and open source.
Then everyone made a f ton of libs for it effectively making it possible to do anything. But at the end of the day it’s basically it’s an everything thrown at it script language run on c code.
[ + ] chrimony
[ - ] chrimony 0 points 2.8 yearsJul 6, 2022 22:58:11 ago (+0/-0)
[ + ] Wahaha
[ - ] Wahaha 1 point 2.8 yearsJul 6, 2022 18:32:01 ago (+1/-0)
[ + ] SithEmpire
[ - ] SithEmpire [op] 0 points 2.8 yearsJul 7, 2022 03:38:24 ago (+0/-0)
I can just see it happening, some high range around U+100000 to U+10FFFF is reserved so that UTF-16 can have "surrogate quads", two pairs which each point into that range, together meaning something even higher.
[ + ] x0x7
[ - ] x0x7 1 point 2.8 yearsJul 6, 2022 20:00:31 ago (+1/-0)*
One may say that utf-8 is exactly that but in some ways it goes beyond. It's treated as a first class text system on par if not propped ahead of ascii. Considering text as a series of bytes and a raw character as one byte consistently even in higher level languages has a lot of value. Whatever gui system can have different display characters.
It's responsible for a wopping amount of slowdown in nodeJS. When you profile a nodeJS server and you've knocked out most of the things you should, the largest and most consistent blocking code is converting string to buffer before sending it over network. In node you can write to an http-response object either with string or buffer, but under the hood it's just buffer. Unfortunately the conversion from one to another is not an uncomplex task and can't take advantage of compiler optimizations like replacing the copy with memset or with AVX/SIMD operations because the offset can go off by one at any point. Perhaps and likely if strings weren't utf8 not only would such a translation from string to buffer be faster, but it would likely be unnecessary entirely. Just point your OS to the location in memory to go to network and it's done in O(1) time as far as your process's timing is concerned.
So basically your nodeJS server can handle 1/5th the traffic it should be able to all because of utf8 whether its being used or not.
Utf8 should be a front-end concept only, and if you need to know the virtual character length of a string accurately on backend you should use a library and not slow down every server on the planet. C#, python, php, all likely have the same problem. If you are writing strings to network, which most http servers do, you have to convert every string as it goes out to hardware. Being able to index and substr utf8 strings correctly requires them to use either inefficient data structures or algorithms internally or have strings that require translation to write to any byte based interface.
[ + ] SithEmpire
[ - ] SithEmpire [op] 0 points 2.8 yearsJul 6, 2022 20:59:17 ago (+0/-0)
Substring gets closer to causing the problem of a split sequence of course, though even then the most common substring usage is following a starts/ends test or a index-of, which all still work without needing to consider variable width.
Usually, the only times it actually needs to seek would be if the server is editing the text or outputting just part of it.
[ + ] BrokenVoat
[ - ] BrokenVoat 1 point 2.8 yearsJul 7, 2022 02:29:51 ago (+1/-0)
UTF-16 is the product if arrogant anglo sphere who think they know how to program, many dont.
Btw: python is one of the few mainstram object oriented languages that lets you actually hide implementation details (unlike Java).
[ + ] SithEmpire
[ - ] SithEmpire [op] 0 points 2.8 yearsJul 7, 2022 03:31:31 ago (+0/-0)
[ + ] BrokenVoat
[ - ] BrokenVoat 0 points 2.8 yearsJul 7, 2022 04:00:37 ago (+0/-0)
It shows Java isnt good oo language
The global lock isnt an issue in Python when running network code, and unlike Java, Python works with C extensions.
[ + ] SithEmpire
[ - ] SithEmpire [op] 0 points 2.8 yearsJul 7, 2022 15:00:31 ago (+0/-0)
Yes, of course Python's shit attempt at threading can at least handle blocking I/O, but it can't actually multi-thread and you know it. That is fucking shit.
I could list over 9000 things I hate about Java, but they will be lost on a PyNigger. C integration isn't even on that list (JNI isn't exactly great but honestly Java has more pertinent problems).
I just wanted to dump on UTF-16, but it sounds like Python needs a second round.
[ + ] CoronaHoax
[ - ] CoronaHoax 0 points 2.8 yearsJul 6, 2022 22:30:42 ago (+0/-0)