Lookup tables provide the essential function of helping you maintain data integrity in your database environment. For example, if you have users entering their gender into a data item, the table that contains the Gender item can reference a lookup table to verify that only the value M or F is used.
Most database environments have one lookup table to support each domain—a simple, scalable model. But recently I've read some articles proposing a new approach to lookup tables. Instead of one lookup table to support each domain, some database designers are advancing the idea of a single large lookup table that contains all valid values for every domain. The table uses some sort of indicator column so you know when and where to use the value in a column. Readers have asked me whether the one-table approach is better than the many-table technique. Let's take a look at how lookup tables function and why, in this case, breaking the rules of database design isn't a good idea.
Maintaining Integrity
First, let's review how you can use lookup tables throughout the database to restrict data entry while enforcing domain integrity. One of the most common uses for a lookup table is verifying the two-character state code. The entity relationship diagram (ERD) in Figure 1 shows a frequently used two-column version of the State lookup table. The first column in the State table is the two-character StateCode; the second column is StateLongName. While this is a good way to avoid mistakes on data entry ("Let's see, is Alaska AL or AK?"), it's also a way to enforce domain integrity. The domain of StateCode is the set of 50 two-character state abbreviations recognized by the US Postal Service. The domain of StateLongName is the 50 state names, from Alabama to Wyoming. No other combination of letters, numbers, and special characters is valid for state name.
Another name I use for lookup tables is reference tables because their data is referenced by columns in other tables. The traditional one-to-many relationship of the parent reference table, State, to the child table, Address, is also shown in Figure 1. Figure 2 shows the Joins tab of the Reference Properties window; you can see that the Address table's StateCode column is constrained by the set of values in the State table's StateCode column. So, for example, if you want to enter the value AK into the Address table's State-Code column, the value AK must already exist in the State lookup table's StateCode column. That's how a lookup or reference table works. Selecting the Mandatory parent check box on the Integrity tab, as Figure 3, shows, ensures that this relationship is enforced within the database.
Typically, more than one table in the database uses the same lookup table. In addition to the Address table, any other table that has a column containing the two-character state code could reference the State lookup table. For example, say this database also has a ShippingRate table that contains rates by state. If the ShippingRate table has a mandatory relationship with the State table, values in the State column of the ShippingRate table are constrained by the set of valid values for the StateCode column of the State lookup table.
Breaking the Rules
Now back to the original question of whether a one-table approach is better than using separate tables for each domain. While the single-table design might be a clever solution, it's not one that I would use for the simple reason that I like to design scalability into my databases. I much prefer individual lookup tables that enforce integrity in each distinct domain. By dumping all lookup values into one table, then adding a column (or using some other device) to decipher what the individual lookup values refer to, you violate three rules of designing for performance and data integrity:
- Always normalize your database.
- Design for performance and scalability.
- Simplify data management.
The following problems can arise when you break these rules.
You denormalize the lookup table. Remember that by the rule of normalization, each table contains data about one thing (one entity) and one thing only. While you might argue that a lookup table is only about reference data, and thus contains only one type of data, I think this is one level of abstraction too high for the real world. In the traditional approach, the data in a lookup table references only gender, only states, only room types, or whatever might be in a list of related items that you use to maintain the domain integrity of your database.
You create a performance bottleneck and limit scalability. Putting all lookup values, indicators, and codes into the same table requires every application that uses a drop-down or list box or that does a compare against the lookup table to access the same lookup table. This approach creates a "hot spot" on the disk on which the lookup table is located. That hot spot results in higher levels of locking and blocking, increased wait times, decreased performance, and an inability to scale beyond a certain point. "But," you say, "I'm running SQL Server 2005, and I can horizontally partition this large lookup table." However, SQL Server 2005 horizontal partitioning is meant for very large data sets. On smaller data sets (an all-in-one lookup table would be considered a small data set), horizontal partitioning might even be counterproductive. (I can't say for sure, because I haven't run any SQL Server 2005 partitioning benchmark trials yet.) Also partitioning isn't an option unless you're running SQL Server 2005 Enterprise Edition.