by Craig S. Mullins
Data is not sedentary. Once data has been created, organizations tend to move it around to support many different purposes—different applications, different geographies, different users, different computing environments, and different DBMSs.
Rarely is a single copy of any piece of data good enough. Data is copied and transformed and cleansed and duplicated and stored many times throughout the organization. Different copies of the same data are used to support transaction processing and analysis; test, quality assurance, and operational systems; day-
The DBA uses many techniques and technologies to facilitate data movement and distribution. This chapter discusses some of the primary tools that are used to move data from place to place and to support data at multiple locations.
One of the simplest ways for the DBA to move data from one place to another is to use the LOAD and UNLOAD utilities that come with the DBMS. The LOAD utility is used to populate tables with new data, and the UNLOAD utility is used to read data from a table and put it into a data file. A LOAD utility is used to perform bulk inserts of data into database tables. It typically can support
It is important to note that each DBMS may call the actual LOAD utilities by different names, but the functionality is the same or similar from product to product. For example, Microsoft SQL Server and Sybase provide the BCP utility, and Oracle offers SQL*Loader.
When loading data, the DBA must take into account many factors to ensure success. Is the LOAD utility restartable in the event it fails? Although it takes more time to implement a restartable load process, it is easier to support. For the load process to be restartable, the DBA must ensure that any work files used are allocated and that the LOAD utility can restart from where it left off.
It is usually a good idea to create all required indexes before loading data into a table. The LOAD utility is usually more efficient at populating the indexes during the load process than creating new indexes for a fully populated table. Of course, the DBA should verify this to be the case for the DBMS and version in use.
If the LOAD utility is capable of performing tasks in parallel, the DBA should take advantage of this when large amounts of data are being loaded. The LOAD utility might be capable of accepting multiple input files for concurrent loading into different segments or table partitions, or it might be able to build multiple indexes in parallel rather than building each sequentially. Another performance-
If the LOAD utility provides an option to turn off logging, consider using it to speed up the load process and minimize overhead. However, the DBA will have to back up the data after the LOAD completes to ensure recoverability. If logging is turned off, the DBMS cannot recover the table without an image copy backup.
Database information frequently needs to be moved or copied to other locations. For example, you may want to move data to a different database, from a table to a sequential file for external processing, or possibly to another relational database system or platform. Certain database schema changes require database objects to be dropped and recreated—and when the objects are dropped, so is the data. Therefore, you need to unload the data before making database object changes. Perhaps you just want to extract a subset of rows from a table for use as test data. Even to reorganize a database object typically requires the data to be unloaded, optimized, and then reloaded.
The purpose of the UNLOAD utility is to read data from a database and write it to an output data file. Without an UNLOAD utility, database users are forced to use SQL SELECT statements issued by an interactive SQL facility, report writer, or application program in order to unload data. However, these methods are error prone and slow for large quantities of data.
Many modern UNLOAD utilities are capable of unloading data from an image copy backup. Such a capability is useful because it enhances concurrent data access. Unloading from an image copy backup can be beneficial because the live data is unaffected—meaning no locks are taken on the live data nor is any data read from the actual table on disk. Because the UNLOAD utility reads the data from an image copy backup, the performance and availability of applications running against the live data will be unaffected by the concurrent unload operation.
Similar to an UNLOAD utility, an EXPORT utility reads data from a table and places it into an external file. An IMPORT utility reads an external file created by the EXPORT utility and inserts the data into a table. IMPORT and EXPORT facilities typically work with more than just the data, though. Sometimes an EXPORT data file contains the schema for the table along with the data. In such cases, the IMPORT utility can create the table and import the data using just the EXPORT data file. Sometimes the EXPORT file contains more than just a single table. Some EXPORT facilities enable the DBA to specify a single table, and then follow the relationships for that table to extract all of the related files and data.
Some IMPORT/EXPORT facilities provide UNLOAD-
Although UNLOAD, LOAD, IMPORT and EXPORT are the most common methods used by DBAs to move large amounts of data from place to place, there are other methods for moving large quantities of data. ETL software can be used to move large amounts of data. ETL stands for extract, transform, and load. ETL software is primarily used to populate data warehouses and data marts from other databases and data sources.
Another method of moving data is through replication and propagation. When data is replicated, one data store is copied to one or more data stores, either locally or at other locations. Replication can be implemented simply by copying entire tables to multiple locations. Alternatively, replicated data can be a subset of the rows and/or columns. Replication can be set up to automatically refresh the copied data on a regular basis.
Propagation, on the other hand, is the migration of only changed data. Propagation can be implemented by scanning the transaction log and applying the results of data modification statements to another data store. Initial population of a data warehouse can be achieved by replication, and subsequent population of changes by either replication (if the data is very dynamic) or propagation.
Messaging software, also known as message queuing software or application integration, is another popular form of data movement. When using a message queue, data is placed onto the queue by one application or process; the data is read from the queue by another application or process. Messaging software works by providing APIs to read and write formatted messages to and from a queue. An application can read or write messages to and from the queue from any platform supported by the software.
Of course, many other methods exist for moving data—from the simple, such as using a table editing tool to highlight and copy data, to the complex, such as writing programs to read the database and write to external files or directly to other databases. But whatever methodology you use, one thing is sure, data sure moves around a lot these days!
From Database Trends and Applications, May 2013.
© 2013 Craig S. Mullins,