Teradata and Data Warehouse Architectures,
1 Teradata and Data Warehouse Architectures
1.1 Evolution to Active Data Warehousing
1.1.1 Data Warehouse Usage Evolution
There is an information evolution happening in the data warehouse environment today. Changing business requirements have placed demands on data warehousing technology to do more things faster. Data warehouses have moved from back room strategic decision support systems to operational, business-critical components of the enterprise. As your company evolves in its use of the data warehouse, what you need from the data warehouse evolves, too.
Stage 1 Reporting: The initial stage typically focuses on reporting from a single source of truth to drive decision-making across functional and/or product boundaries. Questions are usually known in advance, such as a weekly sales report.
Stage 2 Analyzing: Focus on why something happened, such as why sales went down or discovering patterns in customer buying habits. Users perform ad-hoc analysis, slicing and dicing the data at a detail level, and questions are not known in advance.
Stage 3 Predicting: Sophisticated analysts heavily utilize the system to leverage information to predict what will happen next in the business to proactively manage the organization's strategy. This stage requires data mining tools and building predictive models using historical detail. As an example, users can model customer demographics for target marketing.
Stage 4 Operationalizing: Providing access to information for immediate decision-making in the field enters the realm of active data warehousing. Stages 1 to 3 focus on strategic decision-making within an organization. Stage 4 focuses on tactical decision support.. Tactical decision support is not focused on developing corporate strategy, but rather on supporting the people in the field who execute it. Examples: 1) Inventory management with just-in-time replenishment, 2) Scheduling and routing for package delivery. 3) Altering a campaign based on current results.
Stage 5 Active Warehousing: The larger the role an ADW plays in the operational aspects of decision support, the more incentive the business has to automate the decision processes. You can automate decision-making when a customer interacts with a web site. Interactive customer relationship management (CRM) on a web site or at an ATM is about making decisions to optimize the customer relationship through individualized product offers, pricing, content delivery and so on. As technology evolves, more and more decisions become executed with event-driven triggers to initiate fully automated decision processes. Example: determine the best offer for a specific customer based on a real-time event, such as a significant ATM deposit.
1.1.2 Active Data Warehouse
Data warehouses are beginning to take on mission-critical roles supporting CRM, one-to-one marketing, and minute-to-minute decision-making. Data warehousing requirements have evolved to demand a decision capability that is not just oriented toward corporate staff and upper management, but actionable on a day-to-day basis. Decisions such as when to replenish Barbie dolls at a particular retail outlet may not be strategic at the level of customer segmentation or long-term pricing strategies, but when executed properly, they make a big difference to the bottom line. We refer to this capability as "tactical" decision support.
Tactical decisions are the drivers for day-to-day management of the business. Businesses today want more than just strategic insight from their data warehouse implementations-they want better execution in running the business through more effective use of information for the decisions that get made thousands of times per day.
The origin of the active data warehouse is the timely, integrated store of detail data available for analytic business decision-making. It is only from that source that the additional traits needed by the active data warehouse can evolve. These new "active" traits are supplemental to data warehouse functionality. For example, the work mix in the database still includes complex decision support queries, but expands to take on short, tactical queries, background data feeds, and possibly event-driven updates all at the same time. Data volumes and user concurrency levels may explode upward beyond expectation. Restraints may need to be placed on the longer, analytical queries in order to guarantee tactical work throughput. While accessing the detail data directly remains an important opportunity for analytical work, tactical work may thrive on shortcuts and summaries, such as operational data store (ODS) level information. And for both strategic and tactical decisions to be useful to the business, today's data, this hour's data, even this minute's data has to be at hand.
Teradata is positioned exceptionally well for stepping up to the challenges related to high availability, large multi-user workloads, and handling complex queries that are required for an active data warehouse implementation. Teradata technology supports the evolving business requirements by providing high performance and scalability for:
- Mixed workloads (both tactical and strategic queries) for mission critical applications
- Large amounts of detail data
- Concurrent users
Teradata provides 7x24 availability and reliability, as well as continuous updating of information so data is always fresh and accurate.
1.1.3 Evolution of Data Processing
Traditionally, data processing has been divided into two categories: on-line transaction processing (OLTP) and decision support systems (DSS). For either, requests are handled as transactions. A transaction is a logical unit of work, such as a request to update an account.
An RDBMS is used in the following main processing environments:
- DSS
- OLTP
- OLAP
Decision Support Systems (DSS)
In a decision support environment, users submit requests to anaylze historical detail data stored in the tables. The results are used to establish strategies, reveal trends, and make projections. A database used as a decision support system (DSS) usually receives fewer, very complex, ad-hoc queries and may involve numerous tables. Decision support systems include batch reports, which roll-up numbers to give business the big picture, and over time, have evolved:
- Instead of pre-written scripts, users now require the ability to do ad-hoc queries, which are unpredictable in their processing and which allow "what if " types of questions. These types of questions are essential for long range planning.
- DSS systems are able to read through huge volumes of data. DSS is an area where Teradata has traditionally excelled with its parallel architecture.
On-line Transaction Processing (OLTP)
Unlike the DSS environment, an on-line transaction processing (OLTP) environment typically has users accessing current data to update, insert, and delete rows in the data tables. OLTP is typified by a small number of rows (or records) or a few of many possible tables being accessed in a matter of seconds or less. Very little I/O processing is required to complete the transaction. This type of transaction takes place when we take out money at an ATM. Once our card is validated, a debit transaction takes place against our current balance to reflect the amount of cash withdrawn. This type of transaction also takes place when we deposit money into a checking account and the balance gets updated. We expect these transactions to be performed quickly. They must occur in real time.
On-line Analytical Processing (OLAP)
OLAP is the kind of processing that takes place in many data warehouses or data marts. Here, the user may be looking for historical trends, sales rankings or seasonal inventory fluctuations for the entire corporation. Usually, this involves a lot of detail data to be retrieved, processed and analyzed. Therefore, response time can be in seconds or minutes. In the most sophisticated OLAP systems, the systems will make automated purchasing or inventory decisions without any human intervention.
1.1.4 Advantages of Using Summary Data
Until recently, most business decisions were based on summary data. The problem is that summarized data is not as useful as detail data and cannot answer some questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of reporting period and are cut in half.
Here's another example. Think of your monthly bank statement that records checking account activity. If it only told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data.
Decision support-answering business questions-is the real purpose of databases. To answer business questions, decision-makers must have four things:
- The right data
- Enough detail data
- Proper data structure
- Enough computer power to access and produce reports on the data
Consider your own business and how it uses data. Is that data detailed or summarized? If it's summarized, are there questions it cannot answer?
1.2 The Data Warehouse
A data warehouse is a central, enterprise-wide database that contains information extracted from the operational systems. Data warehouses have become more common in corporations where enterprise-wide detail data may be used in on-line analytical processing to make strategic and tactical business decisions. Warehouses often carry many years worth of detail data so that historical trends may be analyzed using the full power of the data.
Many data warehouses get their data directly from operational systems so that the data is timely and accurate. While data warehouses may begin somewhat small in scope and purpose, they often grow quite large as their utility becomes more fully exploited by the enterprise.
Data Warehousing is a process, not a product. It is a technique to properly assemble and manage data from various sources to answer business questions not previously possible or known.
1.3 Data Marts
A data mart is a special purpose subset of enterprise data used by a particular department, function or application. Data marts may have both summary and detail data for a particular use rather than for general use. Usually the data has been pre-aggregated or transformed in some way to better handle the particular type of requests of a specific user community.
Independent Data Marts
Independent data marts are created directly from operational systems, just as is a data warehouse. In the data mart, the data is usually transformed as part of the load process. Data might be aggregated, dimensionalized or summarized historically, as the requirements of the data mart dictate.
Logical Data Marts
Logical data marts are not separate physical structures or a data load from a data warehouse, but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology, it removes the need for massive data loading and transforming, making a single data store available for all user needs.
Dependent Data Marts
Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the movement and transformation of data but may provide a better vehicle for performance-critical user queries.
1.4 Data Mart Pros and Cons
Independent Data Marts
Independent data marts are usually the easiest and fastest to implement and their payback value can be almost immediate. Some corporations start with several data marts before deciding to build a true data warehouse. This approach has several inherent problems:
- While data marts have obvious value, they are not a true enterprise-wide solution and can become very costly over time as more and more are added.
- A major problem with proliferating data marts is that, depending on where you look for answers, there is often more than one version of the truth.
- They do not provide the historical depth of a true data warehouse.
- Because data marts are designed to handle specific types of queries from a specific type of user, they are often not good at "what if" queries like a data warehouse would be.
Logical Data Marts
Logical data marts overcome most of the limitations of independent data marts. They provide a single version of the truth. There is no historical limit to the data and "what if" querying is entirely feasible. The major drawback to logical data marts is the lack of physical control over the data. Because data in the warehouse in not pre-aggregated or dimensionalized, performance against the logical mart will not usually be as good as against an independent mart. However, use of parallelism in the logical mart can overcome some of the limitations of the non-transformed data.
Dependent Data Marts
Dependent data marts provide all advantages of a logical mart and also allow for physical control of the data as it is extracted from the data warehouse. Because dependent marts use the warehouse as their foundation, they are generally considered a better solution than independent marts, but they take longer and are more expensive to implement.
1.5 A Teradata System
A Teradata system contains one or more nodes. A node is a term for a processing unit under the control of a single operating system. The node is where the processing occurs for the Teradata Database. There are two types of Teradata systems:
- Symmetric multiprocessing (SMP) - An SMP Teradata system has a single node that contains multiple CPUs sharing a memory pool.
- Massively parallel processing (MPP) - Multiple SMP nodes working together comprise a larger, MPP implementation of Teradata. The nodes are connected using the BYNET, which allows multiple virtual processors on multiple nodes to communicate with each other.
To manage a Teradata system, you use:
- SMP system: System Console (keyboard and monitor) attached directly to the SMP node
- MPP system: Administration Workstation (AWS)
To access a Teradata system, a user typically logs on through one of multiple client platforms (channel-attached mainframes or network-attached workstations). Client access is discussed in the next module.
1.6 Node Components
A node is a basic building block of a Teradata system, and contains a large number of hardware and software components. A conceptual diagram of a node and its major components is shown below. Hardware components are shown on the left side of the node and software components are shown on the right side.
1.7 Shared Nothing Architecture
The Teradata vprocs (which are the PEs and AMPs) share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.
1.8 Using the BYNET
The BYNET (pronounced, "bye-net") is a high-speed interconnect (network) that enables multiple nodes in the system to communicate. It has several unique features:
- Scalable: As you add more nodes to the system, the overall network bandwidth scales linearly. This linear scalability means you can increase system size without performance penalty -- and sometimes even increase performance.
- High performance: An MPP system typically has two BYNET networks (BYNET 0 and BYNET 1). Because both networks in a system are active, the system benefits from having full use of the aggregate bandwidth of both the networks.
- Fault tolerant: Each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally, in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are re-routed to BYNET 1.
- Load balanced: Traffic is automatically and dynamically distributed between both BYNETs.
1.9 BYNET Hardware and Software
The BYNET hardware and software handle the communication between the vprocs and the nodes.
- Hardware: The nodes of an MPP system are connected with the BYNET hardware, consisting of BYNET boards and cables.
- Software: The BYNET software is installed on every node. This BYNET driver is an interface between the PDE software and the BYNET hardware.
SMP systems do not contain BYNET hardware. The PDE and BYNET software emulate BYNET activity in a single-node environment. The SMP implementation is sometimes called "boardless BYNET."
1.10 Communication Between Nodes
The BYNET hardware can carry the following types of messages between nodes:
- Broadcast message to all nodes
- Point-to-point message from one node to another node
1.11 Communication Between Vprocs
On an MPP system, BYNET hardware is used to first send the communication across nodes (using either the point-to-point or broadcast messaging described previously).
On an SMP system, this first step is unnecessary since there is only one node.
Once a node receives a communication, vproc communication within the node is done by the PDE and BYNET software using the following types of messaging:.
- Point-to-point
- Multicast
- Broadcast
Point-to-Point Messages
With point-to-point messaging between vprocs, a vproc can send a message to another vproc on:
- The same node (using PDE and BYNET software)
- A different node using two steps:
- Send a point-to-point message from the sending node to the node containing the recipient vproc. This is a communication between nodes using the BYNET hardware.
- Within the recipient node, the message is sent to the recipient vproc. This is a point-to-point communication between vprocs using the PDE and BYNET software.
Point-to-Point Message on the Same Node
Point-to-Point Message on a Different Node
Multicast Messages
A vproc can send a message to multiple vprocs using two steps:
- Send a broadcast message from the sending node to all nodes. This is a communication between nodes using the BYNET hardware.
- Within the recipient nodes, the PDE and BYNET software determine which, if any, of its vprocs should receive the message and delivers the message accordingly. This is a multicast communication between vprocs within the node, using the PDE and BYNET software.
Broadcast Messages
A vproc can send a message to all the vprocs in the system using two steps:
- Send a broadcast message from the sending node to all nodes. This is a communication between nodes using the BYNET hardware.
- Within each recipient node, the message is sent to all vprocs. This is a broadcast communication between vprocs using the PDE and BYNET software.
1.12 Cliques
A clique (pronounced, "kleek") is a group of nodes that share access to the same disk arrays. Each multi-node system has at least one clique. The cabling determines which nodes are in which cliques -- the nodes of a clique are connected to the disk array controllers of the same disk arrays.
1.13 Cliques Provide Resiliency
In the rare event of a node failure, cliques provide for data access through vproc migration. When a node resets, the following happens to the AMPs:
- When the node fails, the Teradata Database restarts across all remaining nodes in the system.
- The vprocs from the failed node migrate to the operational nodes in its clique.
- Processing continues while the failed node is being repaired.
1.14 Cliques in a System
Vprocs are distributed across all nodes in the system. Multiple cliques in the system should have the same number of nodes.
The diagram below shows three cliques. The nodes in each clique are cabled to the same disk arrays. The overall system is connected by the BYNET. If one node goes down in a clique the vprocs will migrate to the other nodes in the clique, so data remains available. However, system performance decreases due to the loss of a node. System performance degradation is proportional to clique size.
1.15 Software Components
A Teradata node requires three distinct pieces of software:
For each node in the system, you need both of the following:
- Operating system license (UNIX or Microsoft Windows)
- Teradata software license
1.16 Operating System
The Teradata Database can run on the following operating systems:
- UNIX MP-RAS
- Microsoft Windows 2000
1.17 Parallel Database Extensions (PDE)
The Parallel Database Extensions (PDE) software layer was added to the operating system by NCR to support the parallel software environment.
1.18 Trusted Parallel Application (TPA)
A Trusted Parallel Application (TPA) uses PDE to implement virtual processors (vprocs). The Teradata Database is classified as a TPA. The four components of the Teradata TPA are:
- AMP (Top Right)
- PE (Bottom Right)
- Channel Driver (Top Left)
- Teradata Gateway (Bottom Left)
1.19 Teradata Software: PE
A Parsing Engine (PE) is a vproc that manages the dialogue between a client application and the Teradata Database, once a valid session has been established. Each PE can support a maximum of 120 sessions. The PE handles an incoming request in the following manner:
- The Session Control component verifies the request for session authorization (user names and passwords), and either allows or disallows the request.
- The Parser does the following:
- Interprets the SQL statement received from the application.
- Verifies SQL requests for the proper syntax and evaluates them semantically.
- Consults the Data Dictionary to ensure that all objects exist and that the user has authority to access them.
The Optimizer is "parallel aware," meaning that it has knowledge of the system components (how many nodes, vprocs, etc.), which enables it to determine the fastest way to process the query. In order to maximize throughput and minimize resource contention, the Optimzer must know about system configuration, available units of parallelism (AMPs and PEs), and data demographics. The Teradata Optimizer is robust and intelligent, and enables Teradata to handle multiple complex, ad-hoc queries efficiently. The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the BYNET for execution by the AMPs. After the AMPs process the steps, the PE receives their responses over the BYNET. The Dispatcher builds a response message and sends the message back to the user.
1.20 Teradata Software: AMP
The AMP is a vproc that controls its portion of the data on the system. The AMPs work in parallel, each AMP managing the data rows stored on its vdisk. AMPs are involved in data distribution and data access in different ways.
Data Distribution
When data is loaded, inserted, and updated, the AMP:
- Receives incoming data from the PE.
- Formats rows and distributes them on its vdisk.
Data Access
When data is accessed, the AMP retrieves the rows requested by the PE in the following manner:
- The database management subsystem receives the steps from the Dispatcher over the BYNET.
- The database management subsystem processes the steps. The subsystem on the AMP can:
- Lock databases and tables
- Create, modify, or delete definitions of tables
- Join tables
- Insert, delete, or modify rows within tables
- Sort, aggregate, or format data
- Retrieve information from definitions and rows from tables
1.21 Teradata Software: Channel Driver
Channel Driver software is the means of communication between an application and the PEs assigned to channel-attached clients. There is one Channel Driver per node.
In the diagram below, the blue dots show the communication from the channel-attached client, to the host channel adapter in the node, to the Channel Driver software, to the PE, and back to the client.
1.22 Teradata Software: Teradata Gateway
Teradata Gateway software is the means of communication between an application and the PEs assigned to network-attached clients. There is one Teradata Gateway per node.
In the diagram below, the blue dots show the communication from the network-attached client, to the Ethernet card in the node, to the Teradata Gateway software, to the PE, and back to the client.