about utf8sten

A bit of history

if you used social media platforms and tried to type or paste some large amounts of text, you probably know how frustrating it is when your text is only a few symbols over character limit.

it happened to me when i wrote a bio for my telegram, but it was a little too long. and then i remembered about utf8 encoding and unicode.

More technical part

ascii characters take just 1 byte(8 bits) but unicode have way more characters, like a LOT more than ascii, so one unicode codepoint can take more than 1 byte. BUT, in most of messangers unicode character also counts as 1, the same as 1 single byte ascii character. and then i thought - "why don't i just store mustiple ascii characters/bytes in one codepoint?", and it was it!

i found range of codepoints U+8000 - U+8fff where all codepoints are valid, so now in 1 unicode i can store 1.5 ascii character. awesome, now we can store 3 bytes in 2 unicode symbols, now formula for calculating output ±length is "y=(x*2)/3"

but we can go further, we can compress(eg. with gzip/zstd) our data/text and then send it, right? well, not directly, usually messangers sanitize your text so it doesn't contain non-text characters, but now we can store these raw bytes in unicode text! but why don't just use base64? we can but it will take more characters than your compressed bytes, while with utf8sten it takes less characters(but more bytes) than amount of bytes you encode.

utf8sten will not, and can not replace base64/32 because it will take more space in bytes, it was made specifically to make less unicode characters/symbols than input bytes. just use right tool.

TL;DR

with utf8sten you can overcome content length limits on some social media platforms
formula for calculating amount of output symbols is ± "y=(x*2)/3" where x is amount of input bytes and y is ±amount of symbols with encoded data
if you want to keep it simple you can just encode your text
if you need or want to squash/compress as much as possible, you can use gzip/zstd or any other compression tools and then encode its output using utf8sten
you can encode whatever binary data you want
utf8sten is not a replacement for base64/32, just use right tool for your needs

and now i can finally fit my text in message and bio