Business intelligence is a cornerstone of every successful enterprise. Business-critical processes rely on business intelligence environments: demand-planning applications, marketing campaigns, personalized Web site content, and more are driven by knowledge obtained from business intelligence systems.
Access to business intelligence has gradually grown from tens of users to hundreds and even thousands, including fully automated processes that drive business processes. The largest business intelligence databases have grown from about 5 TB in 1998 to almost 30 TB in 2003 (as independently verified by Winter Corporation). Databases may well run into the hundreds of terabytes. Managing these large database environments, and guaranteeing their availability and performance with shrinking budgets, presents several challenges for IT.
We explore those demands, and suggest approaches to tackle them, in this article.
The Problem
Business intelligence is about using data to obtain competitive advantage. The more data you have, the better your decisions. You may also want to include all possible dependencies in order to make the right decisions. As a result, there is a natural tendency to consolidate data in one place.
Consolidation results in:
* More users accessing the (single) data set
* Queries accessing more data and hence using more resources in order to make good, qualified decisions
If you don't carefully manage this multi-dimensional growth, you may end up with an overloaded system that benefits no one.
The Challenges
From a manageability perspective, every IT department faces three major challenges:
Performance is key to the end users' experience when they use their business intelligence tools or applications to run queries. How long does it take to retrieve results?
Scalability is key to supporting ever-growing data volumes as well as growing numbers of users running complex queries. Your end-user population must be served appropriately despite growing data volumes and increased system demand.
Availability is also key, since more and more businesscritical processes rely on consolidated information in data warehouses. Availability of business intelligence depends on reliability of infrastructure components (servers, networks, etc.) and, of course, access to information stored within data warehouses.
In every business intelligence deployment, these three requirements go hand in hand. Every IT organization managing a large-scale business intelligence solution must ensure that all three requirements are met while keeping one important factor in mind: cost. Despite increasing data volumes and a growing end-user population, IT organizations don't get additional budget to address the requirements.
This article explains how organizations can implement today's technologies to address the requirements for a large-scale business intelligence deployment with a flat (or even reduced) IT budget.
By executing complex queries and workloads as multiple parallel processes (as opposed to a single process), query execution, data loading, and other database operations can be executed much more rapidly.
Performance
At a high level, there are two ways to address performance requirements in a large-scale business intelligence implementation:
* Add resources to the individual queries or data loads so they can finish more quickly.
* Be smart about the amount of resources used to satisfy the queries or perform the data loads.
Add Resources: Parallel Execution
Parallelism is the ability to apply multiple CPU and I/O resources to the execution of a single command. In a parallel query, the data is distributed across multiple CPUs that perform computations independently before the data set is combined to perform any remaining operations and be presented to the end user.
In a single-server environment, parallelism is implemented on multiple CPUs in the same server, each contributing to the execution of the command. In a multi-server clustered environment, multiple CPUs across multiple servers in the cluster can contribute to the execution of the command.
Today's database technology can automatically and dynamically set the degree of parallelism for a query, depending on query complexity, table size, hardware configuration, and the active system load. By executing complex queries and workloads as multiple parallel processes (as opposed to a single process), query execution, data loading, and other database operations can be executed much more rapidly.
Be Smart: Query Optimization
uery optimization techniques are crucial for good performance. Common optimization techniques include the use of indexes and summary tables. Queries that scan or access massive amounts of data to produce a query result can use summary tables and indexes to reduce the total resource consumption. Taking advantage of summary tables and indexes can save computation resources as well as I/O throughput.
Databases use internal statistics, such as the number of rows in a table and the estimated number of rows a query would retrieve, to choose the execution strategy. As a result, queries can be dynamically rewritten if the database optimizer decides that using summary tables and indexes is more efficient than retrieving the data directly from tables.
Be Smart: Partitioning
Partitioning large tables and indexes provides several benefits for query performance. Consider, for example, a table that contains sales data for the last three years, and assume the data has been partitioned by month. A query that retrieves sales figures for September 2004 will access only one out of 36 partitions, with a performance improvement of up to 36 times. Also, note that it does not matter whether the table contains three years of data versus five or 10 years—the query to retrieve September 2004 sales data will only access the partition that contains the September 2004 data.
Partitioning is extremely complementary to parallel processing. When a database assigns resources to queries, it takes partitions into account. Queries will run faster when separate processes access separate partitions.
Scalability and performance enhancements include running queries against multiple tables that share the same partition attributes in the partition definition. The “join” condition between tables will eliminate partitions, resulting in less data being accessed and faster query performance. Typically, different query requirements will result in using different partitioning mechanisms to ensure more granular access to individual partitions for fastest query performance. Among the common partitioning techniques: range and hash algorithms ensure an equal distribution of data across different partitions.
Be Smart: Data Compression
Data compression is a readily available technology that can address cost-effective storage of large data volumes online. Databases enable data to be stored in a compressed format to reduce data volume. The compression comes at a small performance cost, but the hit is only taken when the data is loaded in a compressed format. Query retrieval actually benefits from the data being compressed because less data needs to be read from the disks. In any business intelligence application, it’s always disk I/O that slows performance.
Data compression works hand in hand with partitioning. Data in business intelligence environments typically remains active for a certain period, after which it does not change any more. For example, a retailer may allow a 30- day return period after the purchase, after which the records will not change. Once data is “frozen,” it can be compressed and made available for query-only purposes. A time-based partitioning scheme helps in identifying the data set that can be compressed.
Be Smart: Cut Costs
In a typical business intelligence system, active and less-active data can be distinguished. Costs can be reduced if less-active data is stored on low-cost, typically lower-performing storage systems. Active (and typically more-recent) data can still sit on more expensive, highperformance storage to satisfy end-user queries. As data becomes less active, it can move to low-cost storage.
It’s also wise to consider compressing the data on low-cost storage to reduce the I/O bandwidth requirements when the data is still accessed.
If the less-active data is identified using the partitioning scheme, the partition can be used as a unit of data for movement and compression. For example, if a monthly partition scheme is used and data older than a year is considered less active, then every month it can be moved and compressed to a one-month partition.
Scalability
Scalability in a business intelligence environment covers two distinct but related aspects:
* Number of users
* Data volume
Number of Users
Database systems can handle hundreds or thousands of concurrent users. Database technology has progressed in several areas to make sure these users get all the resources they need when they need them.
Scale up or scale out. The terms "scale up" and "scale out" typically describe the differences between a single-server environment and a clustered system.
In a single-server environment, the system can be scaled by adding more CPUs or replacing single-core CPUs with double-core CPUs. This strategy is called scaling up. Even though the same approach can be used in a clustered environment, adding servers to a cluster to increase computing resources is also an option. This strategy is called scaling out. Both scaling up and scaling out are common strategies in business intelligence environments.
The limits of the scale-up approach are determined by the hardware limits of the system: the number of CPUs, number of dual-core CPUs, and how much memory fits in the server until it must be replaced. Most business intelligence systems have not outgrown the hardware restrictions of today¡¯s biggest servers, but large servers (as well as components for these servers) are expensive.
Clustered environments involve more—and different—hardware components. In general, the limits in a clustered environment are more software- than hardware-related. For example, how well does the system scale when nodes are added? How many servers can be included in a cluster?
Today's database technology is definitely ahead of the current requirements for business intelligence environments. A clustered database can provide a cost-effective infrastructure for large-scale business intelligence implementations.
Automatic memory management. Good memory management is key to supporting large numbers of users running concurrent, complex queries, often in conjunction with data loading. Database engines require memory to perform query and loading operations, and every query or load process needs memory to execute. Traditionally, DBAs would tune memory parameters and allocate memory for database operations and memory for every process that executes inside the database. This isn't the best approach because some processes require more memory than others. Besides, some users or processes may be more important than others.
Today, databases have built-in techniques to allocate memory to processes dynamically based on the system workload. DBAs define coarse limits for the maximum amount of memory that the engine is allowed to consume, and the database guarantees optimal execution within these boundaries. The result is more efficient memory consumption and better service (i.e., faster response times) to end users running business intelligence queries.
Resource management. DBAs need tools and utilities to manage the relative importance of users and processes. Resource managers can indicate that certain users or processes have to be restricted in their resource consumption for other processes to be serviced appropriately.
Resource managers also provide query governors to monitor ongoing activities in databases. Limits can be set to particular queries; if these queries go beyond the limits, an error is returned. Resource management tools should be applied to control memory allocations as well as CPU and I/O resources, enabling DBAs to automatically put systems resources where they are most needed.
Data Volume
Today’s database technology can easily handle almost any imaginable data volume. The restrictions are not in data size, but rather in the type of analysis performed. From a data-size perspective, DBAs want to make sure they do not waste time managing the data volume.
Data management for a large business intelligence system is not really different from data management for any other system. As a general rule, data should be striped across many disks in order to get the throughput needed for queries. Storage software (or even database software) can take care of storage striping across disks. Data can automatically be rebalanced when disks are added (or removed). DBAs only need to tell the system that a disk or logical volume should be added to the storage pool and the software automatically does the rebalancing.
Data compression is also an important factor to keep the storage costs under control. Data that is not frequently accessed can take full advantage of the available storage because it does not have the same throughput requirements that frequently accessed data has. Frequently accessed data should sit on many disks in order to get sufficient throughput.
Backup and recovery. Data size also poses a challenge for backup and recovery strategies. Organizations want to make sure they can recover from a disaster, but they don’t want to backup their full 100+ TB business intelligence database. Consider the amount of resources necessary to perform a full system backup of that size within a reasonable time!
Currently, database backup and recovery utilities must only make a full database backup once. From that point, the utility will track the incremental changes so that it never needs to take a full backup again. A restore operation would restore the initial full backup and roll forward any changes.
Availability
Availability in the context of a business intelligence system has two dimensions. First, the system must be available to end users. Second, data must be available when users want to access it.
System Availability
Clustering technologies have always been praised for the high availability achieved through server redundancy. If a server in a cluster fails, other servers take over the workload, and with transparent application fail-over capabilities, the impact to end users is minimized.
The impact of a server failure in a cluster is proportional to the number of servers in the cluster. If one server in a two-node cluster fails, then 50 percent of the computing power is lost. However, if one server in an eight-node cluster fails, only 12.5 percent of the computing resources are lost.
High-availability solutions also include complete fail-over sites. Every new commercial database contains utilities to set up a fail-over site using data replication. Fail-over can be used in single-server environments to increase system availability. However, this is generally not the most cost-effective approach. More recently, multi-node business intelligence systems on low-cost components running Linux have been implemented, proving that highly available clusters can be implemented at an extremely attractive cost.
Data Availability
Data availability is key to a successful business intelligence system. Business decisions must be made that take into account all relevant parameters.
Clusters: Shared-Disk versus Shared Nothing. At a high level, there are two common storage architectures for clustered databases:
* Shared disk: in a shared-disk environment all servers can access the full data set.
* Shared nothing: in a shared-nothing environment, individual servers bear responsibility for a particular data slice. A hash-based distribution is applied to the data to achieve an equal data distribution across all servers. Note that queries in a shared-nothing architecture have minimal parallelism equal to the number of servers owning the data being accessed by the query.
Overall, system availability benefits from a cluster of servers. If one server fails, the remaining server(s) continue to serve queries and other operations. In a shared-disk architecture, all that is lost is the computing resources that the failed server provided. In a sharednothing architecture, however, the data that the server was responsible for can be lost. Depending on the query, the full data set requested may not be accessible.
Shared-nothing architectures typically do have an answer to high availability for the entire data set. However, the disk size would have to be doubled in order to implement such a solution.
Disk Failure. Disks (and any other component) may fail in a business intelligence infrastructure. As more disk space (perhaps in low-cost storage units) is used, the likelihood of a failure increases. It's important that the data be safe even if that happens; the system should continue to run without interruption.
The most common approach to handling disk corruption is by using a data-mirroring technique. Obviously, a data mirror requires additional disk space, but generally one needs disks to satisfy throughput requirements. When a disk turns out to be bad, it must be taken offline, data must be rebalanced across the remaining disks, and the system should continue to operate without interruption. Today's storage-vendor and database software can handle such a scenario.
Data Volume. Sheer data volume is a challenge for a large business intelligence implementation. However, business decisions benefit from the availability of large amounts of data. Trend analysis may not be useful if only one year of data is available; but if it can analyze trends based on five or more years worth of experiences, it may suddenly be able to derive valuable information.
Data compression, the ability to implement low-cost storage solutions, and partition-based data management all enable organizations to give users access to more data.
Backup and Recovery. In the ideal world, systems never break. Clustering eliminates server failure as a common cause of unplanned downtime, but this can still happen during power outages or natural disasters. The reality is that systems do fail, and businesses need to be prepared. They must implement a backup and recovery strategy that efficiently manages protection and restoration of business intelligence solutions.
Because of the sheer volume of data, this task deserves careful attention—something database vendors know. For the successful management of large-scale business intelligence backup and recovery processes, the IT department must consider:
* Large business intelligence systems should not be taken offline unnecessarily. The system has to remain available while the backup is running, even though the backup should run during less active hours. Online logging capabilities can track smaller changes that occur during the backup cycle so a complete environment can be restored from the last backup, and a point-in-time recovery can restore changes that occurred since the last data load. Note that some of the data loads are reproducible, so data loads may not explicitly write logs to enable a point-in-time recovery. One would restore the backup and re-run the data load process to reach the same situation.
* Make partitions of the database read-only. Data in business intelligence environments typically remains active for some time, after which it does not change. Unchanged data only needs to be backed up once. Backup utilities take the read-only property into account during a backup-and-restore scenario.
* In conjunction with read-only capabilities, data compression reduces the volume of data to be written to backup devices.
In clustered environments, backups can be managed by assigning servers in the cluster to perform backups, while other servers remain busy servicing the needs of end users. Whichever strategy or best practice is being implemented, be aware that to backup and recover tens of terabytes is a much greater challenge than that of the (smaller volume) OLTP systems.
Cost-Saving Opportunities
Business intelligence solutions are valuable to businesses. However, an organization must avoid business intelligence solutions that become prohibitively expensive and must be scaled down or allow the business intelligence efforts to be eliminated altogether.
This article has explored several approaches to reducing expenses:
* Implement clusters on low-cost components
* Use low-cost storage solutions for less active data
* Implement data compression to reduce the data volume
* Take advantage of the infrastructure's self-managing capabilities
In addition, there are cost-saving opportunities that are less obvious at first glance but may well be feasible in the mid- to long-term. The opportunities below reduce the number of components to manage:
* Take advantage of the database as an ETL engine. Database technologies have advanced. While a separate ETL engine was the obvious choice a number of years ago, you may be able to leverage the database engine and eliminate the entire environment to manage and maintain covering the ETL requirements.
* Databases have become smarter and smarter, and provide sophisticated analytical capabilities. Why not use these instead of extracting data into a separate environment in order to perform analytic calculations or mine the data?
* Take advantage of database technology to manage data storage, high availability, and backups.
Conclusion
Managing large-scale business intelligence solutions can be a difficult and challenging task. Business intelligence environments have three major requirements: performance, scalability, and availability. Nothing is impossible if there is an unrestricted budget; but realistically, IT departments still face cost restrictions.
When planning large-scale business intelligence deployments, size matters. Think carefully about available database-related technologies capable of meeting performance, scalability, and availability requirements.
Monday, January 25, 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment