Roles and Activities > Developer Role Set > Database Designer > Database Design
The steps presented below assume a relational data model. The steps for an object database are similar in nature but differ in the details.
The reader is assumed to be familiar with database concepts and terminology, as covered by references such as [DAT99].
Once the structure of the persistent design classes (design classes whose persistent property is set) stabilizes, they should be mapped into the data model. The guidelines presented above describe a coarse-grained strategy for performing the mapping, but ultimately a framework and tools are necessary to perform the conceptual mapping and generate calls to the framework to perform the mapping at run-time. The importance of tools and run-time data access frameworks becomes essential when the work becomes highly iterative, as it is impractical to manage the mapping process and related interface code generation without a reliable framework and tool set.
Objects which are retrieved together can be stored together for improved performance.
In the case of a relational data model, the initial mapping generally yields a simple class-to-table mapping. If objects from different classes need to be retrieved at the same time, the RDBMS uses an operation called a table join to retrieve the rows related to the objects of interest. For frequently accessed data, join operations can be very computationally expensive. To eliminate the cost of the join, a standard relational technique is de-normalization.
De-normalization combines columns from two or more different tables into the same table, effectively pre-joining the information. De-normalization reflects a trade-off between more expensive update operations in favor of less expensive retrieval operations. De-normalization also reduces the performance of the system in queries which are only interested in the attributes of one of the objects which are effectively joined in the de-normalized table, since all attributes are retrieved on every query. But for cases where the application normally wants all attributes, there can be a significant performance improvement.
De-normalizing more than two tables is rare and increases the cost of inserts and updates as well as the cost of non-join queries. Limiting de-normalization to two tables is a good policy unless strong and convincing evidence can be offered on the benefits.
De-normalization can be inferred from the Design Classes in cases where classes are nested. Nested classes should always be mapped to a de-normalized table.
Some object databases allow a concept similar to de-normalization, in which related objects are clustered together on disk and retrieved in single operations. The concept in use is similar: to reduce the object retrieval time by reducing the work the system must do to retrieve related objects from the database.
In some cases, optimizing the data model may unmask problems in the Design Model, either performance bottlenecks, poor modeling, or incomplete designs. In this event, discuss the problems with the Designer of the class, triggering change requests where appropriate.
Once the table structure has been designed, determine the types of queries that will be performed against the data. Indexing is used by the database to speed access. The following indexing strategies should be considered:
On the down-side, the use of indexes is not free; the more indexes on a table the longer inserts and updates will take to process. The following precautions should temper the use of indexes:
Many databases offer a choice of index types. The most common include:
Choice of indexing strategy can have a large impact on performance, as can the timing of index creation can also affect performance. Bulk data-loads should be performed without indexes (this can be achieved by dropping the index, loading the data and then re-creating the index). The reason for this is that as each row is added, the index structure is re-balanced. Since subsequent rows will change the optimal index structure, the work done re-balancing the index as each row is inserted is largely wasted. It's faster and more efficient to load data without indexes, then re-create the index when the data load is done. Some databases provide bulk data-loaders to do this automatically.
Databases perform I/O not on rows or records, or even whole tables, but on disk blocks. The reason for this is simple: block I/O operations are usually optimized in software and hardware on the system. As a result, the physical organization of the tables and indexes in the database can have a dramatic impact on the performance of the system. There are several dimensions:
Disk Page Density
The density of disk pages depends on the extent to which data is expected to change over time. Without getting into the details, a less-dense page is more capable of accepting changes in values or addition of data over time, while a fuller data page provide better read performance since more data is retrieved per block read.
To simplify the disk management, group tables by the extent to which they tend to change. Three groups is a good start: highly dynamic, somewhat dynamic, and mostly static. The highly dynamic tables should be mapped onto disk pages that have a good deal of empty space in them (perhaps 30%), the somewhat dynamic less empty space (perhaps 15%), and the mostly static very little empty space (perhaps 5%). The indexes for the tables should be similarly mapped.
Disk Page Location
Next, determine where to put the disk pages. The goal here is to try to balance the workload across a number of different drives and heads to reduce or eliminate bottlenecks. Some guidelines to consider:
The primary rule is that there are never enough drives over which to spread the I/O. I/O balancing is an iterative, experimental process. Data access performance prototyping during the elaboration phase, coupled with appropriate instrumentation to monitor physical and logical I/O, will highlight performance problems early while there is still time to adjust the database design.
Disk Space Allocation
Using the characteristics of the persistence Design Mechanism, estimate the number of objects that need to be stored. The amount of disk space required to store the objects will vary from DBMS to DBMS, but the size estimate typically is a function of:
size = (fixed overhead for the table) + (number of rows * (avg. row size/avg. data density))
where the avg. data density is the page density percentage. In addition, be sure to account for the size of the indexes as well. The Database Administrator (DBA) manual for the product will provide the precise size estimation formulae. When calculating disk space, make sure to account for growth due to additions of data.
Use the disk space allocation calculations along with the page location to place data as well. Some databases allow tables to span physical drives, some do not. If yours does not, you may have to sub-optimize the placement of data to accommodate the data volumes. A second alternative is the use of RAID (which stands for 'redundant arrays of independent disks') drives, which allow many independent disk drives to appear as a single large disk which can be easily expanded. RAID drives effectively stripe the data automatically, but offer less control over physical placement of data, making fine optimization more difficult.
Often there are standard look-up tables, validation tables, or reference tables used throughout the project. Since the data in these tables tends be frequently accessed but changes infrequently, it is worth special consideration. In the design model, these tables are things like standard product codes, state or province codes, postal or zip codes, tax tables, area code validation tables, or other frequently accessed information. In financial systems this might be lists of policy codes, insurance policy rating categories, or conversion rates. Look in the design model for classes which are primarily read, providing validation information for a large number of clients.
To tie-in earlier steps, the data should be located on a fast drive, in relatively packed data pages. If it is small, don't bother indexing it, since for small tables indexing actually adds more overhead. A small frequently accessed table will also tend to remain in memory, as the least-recently-used (LRU) caching algorithms tend to keep these tables pegged in the data cache.
Make sure the database cache is large enough to keep all reference tables in memory, if possible, plus normal "working set space" for queries and transactions. Often. the secret to increasing database performance is reducing I/O. Relatively speaking, memory is inexpensive, so if you need more, buy it.
Once the reference table structures are defined, determine a strategy for populating the reference table. Since these are accessed early in the project, determining the reference values and loading the tables often needs to occur relatively early in the project. While the Data Designer is not responsible for obtaining the data, she is responsible for determining how and when the reference tables will be refreshed.
Data integrity rules, also known as constraints, ensure that data values lie within defined ranges. Where these ranges can be identified, the database can enforce them. (This is not to say that data validation should not be done in the application, but only that the database can server as a 'validator of last resort' in the event that the application does not work correctly). Where data validation rules exist, define the database constraints to enforce them.
In addition, referential integrity rules can be enforced by the database. These ensure that for every foreign key value there is a primary key value in a referenced table. Definition of foreign key constraints is also often used by the query optimizer to accelerate query performance. Based on associations in the Design Model, create foreign key relationship constraints on the associated tables in the Data Model. In many cases, the foreign key enforcement rules will use the reference tables discussed in the previous step.
Most databases support a stored procedure capability. A stored procedure is executable code which runs within the process space of the database management system, and provide the ability to perform database-related actions on the server without having to transfer data across a network. Their judicious use can the improve performance of the system.
Stored procedures usually come in two flavors: actual procedures and triggers. Procedures are executed explicitly by an application, generally have parameters, and can provide an explicit return value. Triggers are invoked implicitly when some database event occurs (insert a row, update a row, delete a row, etc), have no parameters (since they are invoke implicitly), and do not provide explicit return values.
In database systems that lack constraints, triggers are often used to enforce referential and data integrity. Otherwise, they tend to be used when an event needs to trigger (or cause) another event.
The Design classes should be examined to see if they have operations which should be implemented using the stored procedure or trigger facility. Candidates include:
Remember that improving database performance usually means reducing I/O; as a result, if performing a computation on the DBMS server will reduce the amount of data passed over the network, the computation should probably be performed on the server.
Work with the Designer of the class to discuss how the database can be used to improve performance. The Designer will update the operation method to indicate that one or more stored procedures can be/should be used to implement the operation.
Continuously throughout the activity, consider the Checkpoints: Data Model to assess the completeness and quality of the effort.
Rational Unified Process