Better in what sense? I put some thought into this when designing an object serialization library modelled like a binary JSON.
When it got to string-encoding, I had to decide whether to go null-terminated vs length + data? The former is very space-efficient, particularly when you have a huge number of short strings. And let’s face it, that’s a common enough scenario. But it’s nice to have the length beforehand when you are parsing the string out of a stream.
What I did in the end was come up with a variable-length integer encoding that somewhat resembles what they do in UTF-8. It means for strings < 128 chrs, the length is a single byte. Longer than that and more bytes get used as necessary.
a variable-length integer encoding that somewhat resembles what they do in UTF-8. It means for strings < 128 chrs, the length is a single byte. Longer than that and more bytes get used as necessary.
What you used might be similar to unsigned LEB128, which is used in DWARF, Webassembly, Android’s DEX format, and protobuf. Essentially encodes 7 bits of the number in each byte, with the high bit being 1 in any byte except the last one representing the number.
Though unlike UTF-8 the number’s length isn’t encoded in the first byte but instead implied by the final byte. Arguably making the number’s encoding similar to a terminated string.
Better in what sense? I put some thought into this when designing an object serialization library modelled like a binary JSON.
When it got to string-encoding, I had to decide whether to go null-terminated vs length + data? The former is very space-efficient, particularly when you have a huge number of short strings. And let’s face it, that’s a common enough scenario. But it’s nice to have the length beforehand when you are parsing the string out of a stream.
What I did in the end was come up with a variable-length integer encoding that somewhat resembles what they do in UTF-8. It means for strings < 128 chrs, the length is a single byte. Longer than that and more bytes get used as necessary.
deleted by creator
What you used might be similar to unsigned LEB128, which is used in DWARF, Webassembly, Android’s DEX format, and protobuf. Essentially encodes 7 bits of the number in each byte, with the high bit being 1 in any byte except the last one representing the number.
Though unlike UTF-8 the number’s length isn’t encoded in the first byte but instead implied by the final byte. Arguably making the number’s encoding similar to a terminated string.