and for the record, i was not kidding about the orangutang
Is it in Unicode?
yes
all emoji are per definition unicode
I was thinking that emoji was small pictures used to convey feelings, but I think you are technically correct
the best kind of correct
Ok, mysql does something crazy with strings by default
which ignores caps, and collapses weird accent marks
and that is what I want to use for normalizing name strings
but I have no idea what the hell it is
that’s called collation
do not confuse that with unidode normalization
cool
collation is just for searching and sorting
see also: LC_COLLATE
it’s for locale specific stuff like ’should a and ä sort as the same character’
yeah
it encodes culture based expectations of software behaviour, not actual byte level representations of characters
If someone registers Fräd, I want that to be the same as if someone registered frad
not very globally inclusive
To be clear, people can use the exact string they register
i’d not like to see that be restricted to latin scripts either
it is only for the uniqueness bucketing that it would be collated down
you’ll disappoint a lot of nordic and german people that way - some genuinely different names will collapse to one that way
understood
plus everyone who is not on a flat latin script, so the majority of the global population
I don't think I would be preventing anyone from using any unicode string they like
so what do you collapse the orangutang to?
probably just orangutang
I'll absolutely put that in the unit tests for it
for hard mode, try family emojis with mixed skin tone modifiers
oh god
yeah, good idea
i like unicode
me too
yet you want to collapse to ASCII
i’ll be ASCII 0x0B then
only the things that collapse to ASCII
I suspect I am communicating poorly because I don't know the terminology in this space
my argument: treat user input just as a pile of bytes, do less, allow for more
I mostly agree with that. I am just trying to avoid people impersonating others easily by picking strange letters
with one caveat of normalize to NFC so people cannot have identical renderings from multiple inputs
I know I won't be able to prevent that completely
you can
NFC or NFKC?
K is always crap and allows for nonsense
pick NFC or NFD
composed is ’minimal set of bytes which can represent’ and decomposed is the opposite
or just hash the input and that’s the read id?
The hashing is being done, mostly to get me a fixed length item in the table
gitstyle short hash as discordstyle tail and done?
tail too short, easy to brute force a collision
poc attack demo required to continue discussion
ha, I might misunderstand how the tails are made
if you are counting entries and everyone has the same count, no problem
iirc git just takes beginning and end of the sha
if it is the hash of something, then someone can generate keys until it hashes to the same short tail
can play silly games like tail hash = hmac(tx_id, block_id, name)
so it is hard to know what the tail will be unless you are also the miner
and willing to throw away good blocks
yes, but an attack requires identical rendering input AND a hash collision
*prefix suffix hash collision
Think any of those collator modes will help?
that’s ultimately more to do with how to build search engines and human facing sortable tables of any sort
seems java would like to decompose and sort over more bytes - can make sense
and the strength values are for very complex tiered rulesets of sorting, such as some multi lingual academic library (books on shelves kind) would need
and i’ve been against using collation as a tool, just normalize to NFC and tail tag with a hash
gitstyle, hard auth with long, most use cases good enough with short
and an input collision is fine as the human facing ones are still unique?
and an attack poc welcome on a prefix-suffix collision on something which’d render the same
the important thing to remember is that if it doesn't have a tail, it's not a monkey
f̬r̸̹ạ̴̳̱̻͔̗̞͠d͇̝̮̠̮̟̪̹́
you also have to remember that the naive user you are trying to protect would not pick anything with staggering diacritics
good, cause my collation currently can't handle them. :wink:
I'm thinking about what you've said, it makes a lot of sense
but I do want to collapse case at least, I think
Assert.assertEquals("fireduck", ForBenefitOfUtil.normalize("fireduck"), ForBenefitOfUtil.normalize("𝓕ire𝐃uc𝐤"));
That one is pretty obviously different, but if the font rendering is a little wonky, it could look quick similar.
Especially things like bold small letters vs regular