Why exactly would it be so bad to just put a suitable index on the table containing strings? The time complexity of the resulting search would be the same, so I assume there will be some constant factor slowdowns. Is it that indices over string fields are stored inefficiently on disk? (If so, can that not be fixed in the db engine directly?) Or is this fine today but wasn't fine 15 years ago?
I think the article describes the solution that was used and not necessarily the best solution. If they made the “obvious” mistake in the first case they mightn’t arrive at the best solution in the second. Adding the single index and using the four tables have the same worst case complexity for reads, O(log n), with different constants and massively beat scanning at O(n).
It may be that the second solution wasn’t the best, or that it was better for write performance or memory usage—this was 2002 after all. It may also be the case that the cardinal it’s of some columns was low in such a way that the four table solution was better, but maybe getting the columns in the right order for a multi column index could do that too.
Either way, I don’t think it really matters that much and doesn’t affect the point of the article.
Early InnoDB* had pretty strict limits on varchar indexes and was not the most efficient. I don't remember the details but it's entirely possible the single-table format Rachel described ran head on into those limitations.
Also remember indexes take space, and if you index all your text columns you'll balloon your DB size; and this was 2002, when that mattered a lot more even for text. Indexes also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.
Finally, normalizing your data while using the file-per-table setting (also not the default back then) can additionally provide better write & mixed throughout due to the writes being spread across multiple file descriptors and not fighting over as many shared locks. (The locking semantics for InnoDB have also improved massively in the last 19 years.)
* Assuming the author used InnoDB and not MyISAM. The latter is a garbage engine that doesn't even provide basic crash/reboot durability, and was the default for years; using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.
indexes take space, and if you index all your text columns you'll balloon your DB … also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.
Indexes didn’t go away with these extra tables, they live in the ‘id’ field of each one. They also probably had UNIQUE constraint on the ‘value’ field, spending time on what you describe in the second half of my citation.
I mean, that should have saved some space for non-unique strings, but all other machinery is still there. And there are 4 extra unique constraints (also an index) in addition to 4 primary keys. Unless these strings are long, the space savings may turn out to be pretty marginal.
Right, indexes don't "go away" as you normalize, but the amount of data indexed drastically shifts and four unique indexes are pretty much guaranteed to be smaller in size than one massive combinatorial index of (all, four, possible, values). Not just in the case where any duplicates exist, but just alone in terms of the database's overhead in things like field delimiters and page metadata. While the specifics will vary a lot with specific DB implementations, generally with varchars involved you can usually assume a worst-case in terms of field delimiting (and from there sometimes a worst-case in average page size and how that impacts B-Tree balancing).
In general, Indexes alone can't save you from a normalization problems. From certain points of view an Index is itself another form of denormalization and while it is very handy form of denormalization, heavily relying on big composite indexes (and the "natural keys" they represent) makes normalization worse and you have to keep such trade-offs in mind just as you would any other normalization planning.
It is theoretically possible a database could do something much smarter and semi-normalized with for instance Trie-based "search" indexes, but specifically in 2002 so far as I'm aware of most of the databases at the time such indexes were never the default and didn't have great multi-column support and would have been expensive to compute if you did turn them own. Even such "smart" Indexes would likely still have suggested you normalize first.
To me it seems that spreading it over four tables would lead to a lot more potential read locks while the big combined table is waiting for a join on each of the others, and some process is trying to insert on the others and link them to the main. This is assuming they were using foreign keys and the main table 4-column index was unique.
InnoDB uses row-level locking, and foreign keys are (usually) a great feature to ensure data integrity. But using multiple foreign keys from tables `a`,`b` as a composite index for table `x` can cause deadlock if both are being updated in rapid succession, because an update on `a` gets a lock on `x`, which needs a read lock on `b` which is waiting for the lock from `a` to be released. I try to never structure multiple foreign keys as a composite index.
Indexes take space, except for the clustered index.
An important distinction.
If the point of this table is always selecting based on IP, m_from, m_to, then clustering on those columns in that order would make sense.
Of course, if it's expected that a lot of other query patterns will exist then it might make sense to only cluster on one of those columns and build indexes on the others.
But IP addresses are actually _numbers_, and you can also save them as bytes 32 bits for ip4, 128 bits for ip6 addresses. (Postgresql has a native datatype 'inet' which supports ip6 since 7.4, if you're using postgres and save an IP address in a character string, you're doing it wrong) If you would save the IP as number and put an index on it, all your queries would be blazing fast!
> using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.
The author mentions that these events took place in 2002. At that time the site was likely still running MySQL 3 and InnoDB was very new and considered experimental. Sure MySQL 4.0 had shipped in 2002 and InnoDB was a first class citizen, but back then upgrades from one version of MySQL to another weren't trivial tasks. You also tended to wait until the .1 release before making that leap.
So in fairness for the folks who originally set up that database MyISAM was likely their only realistic option.
I looked after and managed a fleet of MySQL servers from back then and for many years afterwards and even then it wasn't until MySQL 5 that we fully put our trust in InnoDB.
I had the same initial reaction in reading the post. My assumption was that indexing would have sufficiently sped up the query speed problem.
However, normalizing the fields that contain repetitive data into separate tables could create significant space savings since the full text for each column would not need to stored for each row. Instead of 20-30 bytes per email address, a 4 byte (assuming 32-bit era) OID is stored in its place.
It's pretty easy to imagine how quickly the savings would add up, and that it would be very helpful in the era before SSDs or even 1TB HDDs existed.
Normalization is important for deduplication, not only to index and compare a few short numbers instead of a few long string: those host names and email addresses are long and often repeated.
But if it’s guaranteed to be 1:1, why? The two implementations (normalized and denormalized) in that case should be completely isomorphic, quirks of the DB engine aside.
Yeah, while this is not optimal (ip should be converted into integer, time should be ts), the table would be small (as old entries could be safely deleted). The only real issue is the lack of indices.
Thanks! I was really surprised to find it myself. I used HN in some sample code for analyzing how SSL tickets are reused and was surprised to see an IPv4 address come back.
Yes. A proper mail implementation won't randomize it, so it should be consistent. Depending on the design of the validation stages it might only need a case-normalized lookup to the string index table and that id as part of the final multi-column unique index, and in that last case only to convey to the database the expectations to optimize around (and require the data adhere to).
Let's say the article is a critique to some system by someone that is not much better at databases. As many people also pointed out, a bad solution was replaced with a bad solution that works better for someone.