Tokyo Cabinet offers a performant on-disk b+tree implementation that I'm using as a cornerstone for a number of projects. The b+tree is interesting compared to TC's more popular hash database because it supports range queries and key ordering. TC's implementation even allows you to choose a number of different comparison functions, or to write your own.
I had some trouble with the built-in comparison functions' documentation. Does
tccmplexical mean sorting in alphabetic order? Can it be applied to non-ASCII keys? What are the edge behaviors? I dug into TC's source to find some clarity.
tccmplexical sorts keys in order of unsigned char value, one char at a time, for each key. 'aaa' comes before 'aab', and '\x01\x02' comes after '\x01\x01', where the "\x" means hex-format. Further, partial ties got to the longer key, so 'aaa' comes before 'aaaa'. I had assumed these were the case, but am glad to confirm that there isn't anything crazy going on. Refer to Tokyo Cabinet source version 1.4.46, myconf.h, line 531.
tccmpint64 don't work exactly as I'd hoped. These functions do not take into account anything but the first 32 or 64 bits of the key, respectively. Let that sink in for a moment. The keys '\x00\x00\x00\x00abc' and '\x00\x00\x00\x00' are judged equal, using
tccmpint32, as are '\x00\x00\x00\x00\x01\x02\x03\x04' and '\x00\x00\x00\x00\x05\x06\x07\x08'. This really kills the useful of key prefix searches.
Further, the comparison only happens in native-byte order. If you have a 1 byte key - '\x01' - the key will be zero-filled to
sizeof(int32_t) - '\x01\x00\x00\x00' - and on some machines will be interpreted as 1, on other 2^24, or 16,777,216. Kind of a bummer. I don't recommend these functions if you can avoid them. To see for yourself, refer to the Tokyo Cabinet source, version 1.4.46, tcutil.c, line 9793.
tccmpdecimal is an interesting fellow. The function attempts to read the string key as a decimal. First, it checks for both keys' signs, then reads the entire integer part of each (until the end of string, or the decimal is found). It returns the comparison of the integer parts, or continues on equality. The function then reads fractional part of the key, sticking each key's fractional part into a float, and returns the comparison of the two.
Aside - this seems like a strange choice. This solution introduces the error associated with floating point arithmetic without any obvious gain. Instead, a lexical digit-only search after the decimal place would make more sense. This should work (unlike a lexical sort for the entire decimal number) because we know the digit after the decimal represent a fixed place.
Anyway, if the function finds any unusual characters in unusual places, it resorts to a lexical sort of the entire key.
To judge the code for yourself, refer to the Tokyo Cabinet source, version 1.4.46, tcutil.c, line 9706.
Why does this matter?
For most projects, the standard lexical sort works perfectly- but for those of us using Tokyo Cabinet to power another DBMS, or for use in tandem with other systems that rely on numerical ids, it might be worth it to write our own comparison functions.
If you liked this post, give me some love on Hacker News. Corrections or critique? Comment!