William Kent, "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 26(2), Feb. 1983, 120-125. Also IBM Technical Report TR03.159, Aug. 1981. Also presented at SHARE 62, March 1984, Anaheim, California. Also in A.R. Hurson, L.L. Miller and S.H. Pakzad, Parallel Architectures for Database Systems, IEEE Computer Society Press, 1989. [12 pp]
Copyright 1996 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm
William Kentept 1982
> 1 INTRODUCTION . . . 2
> 2 FIRST NORMAL FORM . . . 2
> 3 SECOND AND THIRD NORMAL FORMS . . . 2
>> 3.1 Second Normal Form . . . 2
>> 3.2 Third Normal Form . . . 3
>> 3.3 Functional Dependencies . . . 4
> 4 FOURTH AND FIFTH NORMAL FORMS . . . 5
>> 4.1 Fourth Normal Form . . . 6
>>> 4.1.1 Independence . . . 8
>>> 4.1.2 Multivalued Dependencies . . . 9
>> 4.2 Fifth Normal Form . . . 9
> 5 UNAVOIDABLE REDUNDANCIES . . . 12
> 6 INTER-RECORD REDUNDANCY . . . 13
> 7 CONCLUSION . . . 13
> 8 ACKNOWLEDGMENT . . . 14
> 9 REFERENCES . . . 14
1 INTRODUCTION
The normalization rules are designed to prevent update anomalies and data inconsistencies
2 FIRST NORMAL FORM
First normal form [1] deals with the "shape" of a record type.
Under first normal form, all occurrences of a record type must contain the same number of fields.
First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.
3 SECOND AND THIRD NORMAL FORMS
Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.
We deal now only with "single-valued"
3.1 Second Normal Form
---------------------------------------------------
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
====================-------------------------------
- The warehouse address is repeated in every record that refers to a part stored in that warehouse.
- If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.
- Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.
- If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.
------------------------------- ---------------------------------
| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |
====================----------- =============--------------------
The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the unnormalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs.
3.2 Third Normal Form
Third normal form is violated when a non-key field is a fact about another non-key field, as in
------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------
- The department's location is repeated in the record of every employee assigned to that department.
- If the location of the department changes, every such record must be updated.
- Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.
- If a department has no employees, there may be no record in which to keep the department's location.
To satisfy third normal form, the record shown above should be decomposed into the two records:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued)
3.3 Functional Dependencies
----------------------------------------------
| PERSON | ADDRESS |
-------------+--------------------------------
| John Smith | 123 Main St., New York |
| John Smith | 321 Center St., San Francisco |
----------------------------------------------
---------------------------------------
| PERSON | ADDRESS |
-------------+-------------------------
| John Smith | 123 Main St., New York |
| John Smith | 123 Main Street, NYC |
---------------------------------------
----------------------------------------------------------
| EMPLOYEE | FATHER | FATHER'S-ADDRESS |
|============------------+-------------------------------|
| Art Smith | John Smith | 123 Main St., New York |
| Bob Smith | John Smith | 123 Main Street, NYC |
| Cal Smith | John Smith | 321 Center St., San Francisco |
----------------------------------------------------------
However, in formal terms, there is no functional dependency here between FATHER'S-ADDRES
4 FOURTH AND FIFTH NORMAL FORMS
In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.
4.1 Fourth Normal Form
The term "independent" will be discussed after considering an example.
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================
Instead, they should be represented in the two records
-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
(1) A disjoint format, in which a record contains either a skill or a language, but not both:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | |
| Smith | type | |
| Smith | | French |
| Smith | | German |
| Smith | | Greek |
-------------------------------
(2) A random mix, with three variations:
(a) Minimal number of records, with repetitions:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
(b) Minimal number of records, with null values:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | | Greek |
-------------------------------
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | |
| Smith | | German |
| Smith | type | Greek |
-------------------------------
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | cook | German |
| Smith | cook | Greek |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
- If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent.
- Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.
- Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted.
Fourth normal form minimizes such update problems.
4.1.1 Independence
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Thus the employee:skill and employee:langua
4.1.2 Multivalued Dependencies
- An employee uses certain skills on certain projects.
- An employee uses certain languages on certain projects.
If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.
4.2 Fifth Normal Form
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | GM | truck |
-----------------------------
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
-----------------------------
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car |
| Smith | GM | | Ford | truck | | Smith | truck |
| Jones | Ford | | GM | car | | Jones | car |
------------------- | GM | truck | -------------------
---------------------
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
| Jones | Ford | truck |
| Brown | Ford | car |
| Brown | GM | car |
| Brown | Totota | car |
| Brown | Totota | bus |
-----------------------------
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car | Fifth
| Smith | GM | | Ford | truck | | Smith | truck | Normal
| Jones | Ford | | GM | car | | Jones | car | Form
| Brown | Ford | | GM | truck | | Jones | truck |
| Brown | GM | | Toyota | car | | Brown | car |
| Brown | Toyota | | Toyota | bus | | Brown | bus |
------------------- --------------------- -------------------
- Jones sells cars and GM makes cars, but Jones does not represent GM.
- Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
- Brown represents Ford and Brown sells buses, but Ford does not make buses.
In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-pre
5 UNAVOIDABLE REDUNDANCIES
Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown Section 4.1.1, it seems unavoidable that we record the fact that "Smith can type" several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that "Smith sells cars" several times.
6 INTER-RECORD REDUNDANCY
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
-----------------------
| EMPLOYEE | LOCATION |
============-----------
Inter-record redundancy has been recognized for some time [1], and has recently been addressed in terms of normal forms and normalization [8].
7 CONCLUSION
- Single-valued vs. multi-valued facts.
- Dependency on the entire key.
- Independent vs. dependent facts.
- The presence of mutual constraints.
- The presence of non-unique or non-singular representations
.
And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications.
8 ACKNOWLEDGMENTS
I am very grateful to Ted Codd and Ron Fagin for reading earlier drafts and making valuable comments, and especially to Chris Date for helping clarify some key points.
9 REFERENCES
- E.F. Codd, "A Relational Model of Data for Large Shared Data Banks", Comm. ACM 13 (6), June 1970, pp. 377-387.
The original paper introducing the relational data model.
- E.F. Codd, "Normalized Data Base Structure: A Brief Tutorial", ACM SIGFIDET Workshop on Data Description, Access, and Control, Nov. 11-12, 1971, San Diego, California, E.F. Codd and A.L. Dean (eds.).
An early tutorial on the relational model and normalization.
- E.F. Codd, "Further Normalization of the Data Base Relational Model", R. Rustin (ed.), Data Base Systems (Courant Computer Science Symposia 6), Prentice-Hall, 1972. Also IBM Research Report RJ909.
The first formal treatment of second and third normal forms.
- C.J. Date, An Introduction to Database Systems (third edition), Addison-Wesley,
1981. An excellent introduction to database systems, with emphasis on the relational.
- R. Fagin, "Multivalued Dependencies and a New Normal Form for Relational Databases", ACM Transactions on Database Systems 2 (3), Sept. 1977. Also IBM Research Report RJ1812.
The introduction of fourth normal form.
- R. Fagin, "Normal Forms and Relational Database Operators", ACM SIGMOD International Conference on Management of Data, May 31-June 1, 1979, Boston, Mass. Also IBM Research Report RJ2471, Feb. 1979.
The introduction of fifth normal form.
- W. Kent, "A Primer of Normal Forms", IBM Technical Report TR02.600, Dec. 1973.
An early, formal tutorial on first, second, and third normal forms.
- T.-W. Ling, F.W. Tompa, and T. Kameda, "An Improved Third Normal Form for Relational Databases", ACM Transactions on Database Systems, 6(2), June 1981, 329-346.
One of the first treatments of inter-relationa
l dependencies.
No comments:
Post a Comment