Use this technique to improve system performance and accuracy

Database normalization, or data normalization, is a technique to organize the contents of the tables for transactional databases and data warehouses. Normalization is part of successful database design; without normalization, database systems can be inaccurate, slow, and inefficient, and they might not produce the data you expect.

Following a successful SQL Server installation (or any database management system installation), you'll have to create a database to hold the data (see SQL Server Books OnlineBOL—for more information about how to create a database). After you've created the database framework, you must organize the files in such a way that you can easily manage them. The primary tool to help organize the data is the table, which looks like a two-dimensional structure that encompasses rows and columns.

A database table looks like a spreadsheet. One item of data, such as a first name, a last name, a phone number, or a street address, appears in each box, or cell, at each intersection of a row and column. Database designers use the term atomicity to describe this organization of data into one data item in each cell.

When you normalize a database, you have four goals: arranging data into logical groupings such that each group describes a small part of the whole; minimizing the amount of duplicate data stored in a database; organizing the data such that, when you modify it, you make the change in only one place; and building a database in which you can access and manipulate the data quickly and efficiently without compromising the integrity of the data in storage.

Data normalization helps you design new databases to meet these goals or to test databases to see whether they meet the goals. Sometimes database designers refer to these goals in terms such as data integrity, referential integrity, or keyed data access. Ideally, you normalize data before you create database tables. However, you can also use these techniques to test an existing database.

Data normalization is primarily important in the transactional, or online transactional processing (OLTP), database world, in which data modifications (e.g., inserts, updates, deletes) occur rapidly and randomly throughout the stored data. In contrast, a data warehouse contains a large amount of denormalized and summarized data—precalculated to avoid the performance penalty of ad hoc joins. In a data warehouse, updates happen periodically under extremely controlled circumstances. End users' updates to data in data warehouses are uncommon. This article addresses the normalization of OLTP data.

Data and Information
Data and information are terms people use interchangeably in everyday speech, but they mean different things. Data are raw facts, such as a name, a number, or a date. Information is organized, processed data. A data item (e.g., the date 7/15/99) means little. When you associate the data with other data items, such as a deadline and a subject, you can create information. For example, the deadline for your next project might be 7/15/99. You store data in a database; you retrieve information from the database.

One cornerstone of database design and data normalization is that data organization for storage differs from the information most people want to see. For example, a manager of a sporting goods supply company might want to see, for one sale, who the customer was, the destination of the order, the billing address, the contact phone number, the placement time of the order, the order's shipping destination, when and how delivery occurred, what articles the order included, and which of the company's employees was the key person in the sale. The organization of data in the company's database differs from the particular information the manager wants. Screen 1 (page 66) shows a diagram of the autogenerated Order Entry database from Microsoft Access 97. I'll use this database to illustrate the concepts in this article.

Business Rules
You need to conduct research at your company before you can begin to normalize a database. You need to perform a requirements analysis, which will identify policies and procedures and will list the business rules for them. You must have consensus on what the rules mean. By consensus, I mean that everyone who uses the database must agree on the definition and the use of these data items. Without consensus, if you ask three people in the company to define what customer means, you might get three different answers. To one person, a customer is the company that buys products and services. To a second person, the customer is the contact person for the company who buys product and services. To a third person, the customer is someone who might be interested in buying products and services. Some terms are standard, but under no circumstances can you assume the definition or meaning of a term. Confirm meanings of terms, and confirm how your company uses these terms.

You can use schemes, or methodologies, to guide you through the requirements-analysis phase of a database design. Think of schemes as playbooks or recipes for database design. If you are using a methodology and like it, continue to use it, no matter how small or insignificant the database design project. If you don't have a favorite methodology, you might want to explore techniques such as Bachman, IDEF1X (favored by the Department of Defense), or the new object-oriented Unified Modeling Language (UML—the scheme for the Microsoft Repository). If you're a Visual Basic (VB) programmer, you can download a free entry-level Visual Modeler from http://www.microsoft.com/vstudio. Using a scheme helps you analyze and design your database.

Relationships
After you know what tables you need, the next step in data normalization is to understand relationships and how they pertain to the data you're working with. A database is an organized, integrated collection of data items. The integration is important; data items relate to other data items, and groups of related data items—called entities—relate to other entities. The relationships between entities can be one of three types, one-to-one (1:1), one-to-many (1:M), and many-to-many (M:N).

Binary relationships, those relationships that involve two entities, are the basis for all other relationship sets. Figure 1 displays these relationships through a technique called entity relationship modeling (ERM). Entities are of two types: noun-type entities and verb-type entities. Noun-type entities represent people, places, and things. Verb-type entities represent actions and interactions between the noun-type entities.

ERM is a way to graphically represent the architecture of a database and to model the informational requirements. You can choose among many entity-modeling schemes. Personal preference, flexibility, functionality of the scheme (what it will let you do and how it will limit you) and established corporate standards will help you choose a scheme. Figure 1 uses the crow's foot scheme, or methodology. Screen 1 uses Microsoft's proprietary ERM methodology.

So many different entity modeling methodologies exist that there is not a true standard. However, one commonly accepted rule for good modeling is that entity names are always expressed in the singular. You can identify the type of relationship between singular entity names more easily than you can determine the relationship between plural entity names. A second, less common, rule is that entity names are always capitalized.

The result of your entity modeling efforts is an entity-relationship diagram (ERD). Figure 1 shows three small ERDs, each of which represents a binary relationship between two entities. A rectangle represents an entity. A line with symbols at each end joins two entities. The symbols are cardinality indicators, each of which shows how many of one entity relates to how many of another entity. The symbols differ from one methodology to another.

The one-to-one (1:1) relationship means that each instance of one entity (CUSTOMER) relates to one instance of a second entity (CREDIT_CHECK). Each CREDIT_CHECK relates back to one CUSTOMER.

The one-to-many (1:M) relationship means that each instance of one entity (ORDER) relates to one or more instances of a second entity (PAYMENT). Each PAYMENT relates back to one ORDER.

The many-to-many (M:N) relationship means that many instances of one entity (ORDER) relate to many instances of a second entity (PRODUCT). Stated a different way, an ORDER can include many PRODUCTS, and a PRODUCT can be part of many ORDERS. The M:N relationship is an interaction between two noun-type entities, and it results in the creation of a verb-type entity (INCLUDES). This verb-type entity is a gerund. A gerund is a verb or action word that takes a noun form. A 1:M relationship exists between each of the noun-type entities and the verb-type entity. A M:N relationship is a pair of 1:M relationships, one in either direction. The verb-type entity (INCLUDES) captures the data that results from the two noun-type entities interacting with each other.

Entities in an ERD eventually become tables in a database. You use relationships between data items and groups of data items to test the level of normalization in a database design.

Primary and Foreign Keys
A primary key (pkey) is an attribute (column) in a table that serves a special purpose. The data items that make up this attribute are unique; no two data item values are the same. The pkey value serves to uniquely identify each row in the table. You can select a row by its pkey value for an operation. Although the SQL Server environment doesn't enforce the presence of a pkey, Microsoft strongly advises that each table have an explicitly designated pkey.

Each table has only one pkey, but the pkey can include more than one attribute. You can create a pkey for the table in Screen 2 by combining CustID, OrderID, and ProductName. We call this combination a concatenated pkey.

A foreign key (fkey) is an attribute that forms an implied link between two tables that are in a 1:M relationship. The fkey, which is a column in the table of the many, is usually a pkey in the table of the one. Fkeys represent a type of controlled redundancy.

First Normal Form
You now have a basic familiarity with data, relationships, information, and business rules. Now let's understand the first three levels of normalization.

Screen 2 shows a simulated, poorly normalized table, which I'll call AllData. AllData is data from the OrderEntry database that I have reorganized for this example. A fictitious manager of a sporting goods supply company requested one order, customer identity, the order's shipping destination, billing address, contact phone number, placement time of the order, how and when delivery occurred, what articles were in the order, and which of the company's employees was the key person in the sale. Screen 2 shows the data rows in the ORDER table in a poorly normalized database. The level of redundancy is high.

When you normalize, you start from the general and work toward the specific, applying certain tests along the way. Users call this process decomposition. Decomposition eliminates insertion, update, and deletion anomalies; guarantees functional dependencies; removes transitive dependencies, and reduces non-key data redundancy. We'll decompose the sample data in Screen 2 from First Normal Form (1NF) through Second Normal Form (2NF) into Third Normal Form (3NF).

For a table to be in 1NF you need to ensure that the data is atomic, having no repeating groups. A concatenated pkey characterizes a 1NF table.

Atomic data is a form of minimalism for data items. A data item is atomic if only one item is in each cell of a table. For example, in the AllData table in Screen 2, the attribute ShippedTo encompasses a street address, a city, a state, a postal code, and a country abbreviation. To render this data atomic, you separate this single attribute into several—ShipAddr, ShipCity, ShipState, ShipZip, and ShipCountry. You must do the same separation with the BilledTo attribute. Then, the data in the AllData table is atomic.

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

I like the meat of the article. The author has something to say. For me, I'm always looking at defining normalization in it's simplest terms to myself and others in the business world. What did detract from the article was that 1NF, 2NF, and 3NF figures and screens did not display primary key and the text was ambiguous as to the primary key for many of the illustrations. Primary keys are a pivotal point of normalization. Confuse the reader here and you've lost them for the rest of the article. Consider updating this article for clarity. It has good content.

Mark Solomon

It amazes me that after all this data normalization, city and state remain in the address table, when they're dependent on the zip code.

Wil Cooley

Great Article! I would recommend Michelle present an article in the SQL Server Magazine each month about a different Database Design. To many people use the Customer/Orders Database Design as an example.

Steve

i am so gland when i get it.i make my assingment from this so it is so help full from me. thanks who make that notes.

ABDUL REHMAN

Very useful information was given. This will help many people in the world.

Anonymous User

Article Rating 5 out of 5

thanks...i can finish my assignment...thanks for the useful info....really....its availablity makes me really interesting in SQL Server....

Anonymous User

Article Rating 5 out of 5

nbnm

Anonymous User

Article Rating 4 out of 5

 
 

ADS BY GOOGLE