Data Warehouse Architecture: Traditional vs. Cloud

A data warehouse is an electronic system that collects data from a wide range of sources within a company and uses the data to support management decision-making.

Businesses are increasingly moving towards cloud-based data warehouses instead of traditional on-premises systems.

Cloud-based data warehouses differ from traditional warehouses in the following ways:

There is no need to purchase physical hardware

. It’s faster and

cheaper to set up and scale data warehouses

in the cloud.

Cloud-based data warehousing architectures typically perform complex analytic queries much faster because they use massively parallel processing (MPP).

The remainder of this article covers traditional data warehouse architecture and introduces some architectural ideas and concepts used by the most popular cloud-based data warehousing services.

For more information, see our page on data storage concepts in this guide.

Traditional Data Storage Architecture

The following concepts highlight some of the established ideas and design principles used to build traditional data warehouses

. Three-tier architecture

The traditional data warehouse architecture employs a three-tier structure composed of the following tiers

. Lower level: This tier contains the database server used to extract data from many different sources, such as transactional databases used for front-end applications.

Level Intermediate: The middle tier hosts an OLAP server, which transforms the data into a structure better suited for analysis and complex queries. The OLAP server can function in two ways: as an extended relational database management system that maps operations on multidimensional data to standard relational operations (relational OLAP) or by using a multidimensional OLAP model that directly implements multidimensional data and operations.
Top level: The top level is the client layer. This level contains the tools used for high-level data analysis, query reporting, and data mining.

Kimball

vs

. Two

data warehousing pioneers named Bill Inmon and Ralph Kimball had different approaches to data warehouse design

Ralph Kimball’s approach emphasized the importance of data marts, which are repositories of data belonging to particular lines of business. The data warehouse is simply a combination of different data marts that facilitates reporting and analysis. Kimball’s data warehouse design uses a “bottom-up” approach.

Bill Inmon considered the data warehouse as the centralized repository for all business data. In this approach, an organization first creates a normalized data warehouse model. Next, dimensional data marts are created based on the warehouse model. This is known as a top-down approach to data storage.

In

traditional architecture there are three common models of data warehousing

: virtual warehouse, data mart, and enterprise data warehouse:

A virtual data warehouse is a set of separate databases, which can be queried together, so that a user can effectively access all data as if it were stored in a data warehouse.
A data mart model is used for line-of-business-specific reporting and analysis. In this data warehouse model, data is aggregated from a variety of source systems relevant to a specific business area, such as sales or finance.
An enterprise data warehouse model prescribes that the data warehouse contain aggregated data that spans the entire organization. This model sees the data warehouse as the heart of the enterprise information system, with integrated data from all business units.

Star schema vs. snowflake schema The star schema and

snowflake schema are two ways to structure a data warehouse. The star schema has a centralized data repository

, stored in a fact table.

The schema divides the fact table into a series of tables of denormalized dimensions. The fact table contains aggregated data that will be used for reporting purposes, while the dimension table describes the stored data.

Denormalized designs are less complex because the data is grouped. The fact table uses only one link to join each dimension table. The simpler design of the star schema makes it much easier to write complex queries.

The snowflake schema is different because it normalizes the data. Normalization means effectively organizing data so that all data dependencies are defined and each table contains minimal redundancies. Therefore, single-dimensional tables branch into tables of separate dimensions.

The snowflake schema uses less disk space and better preserves data integrity. The main disadvantage is the complexity of the queries needed to access the data: each query must drill down to get to the relevant data because there are several combinations.

ETL

vs

. ETL ELT ETL

and ELT are two different methods for loading data into a store

Extract, Transform, Load (ETL) first extracts data from a group of data sources, which are typically transactional databases. The data is saved in a temporary staging database. Transformation operations are then performed to structure and convert the data into a form suitable for the target data storage system. Structured data is loaded into the warehouse, ready for analysis.

With Extract Load Transform (ELT), data is loaded immediately after it is extracted from the source data groups. There is no staging database, which means that the data is immediately uploaded to the single, centralized repository. Data is transformed within the data warehouse system for use with business intelligence tools and analytics.

Organizational maturity

The structure of

an organization’s data warehouse also depends on its current situation and needs

The basic structure allows warehouse end users to directly access summary data derived from source systems and perform analysis, reporting, and data mining on that data. This structure is useful when data sources derive from the same types of database systems.

A warehouse with a staging area is the next logical step in an organization with disparate data sources with many different data types and formats. The staging area converts the data into a structured summary format that is easier to refer to with analysis and reporting tools.

A variation of the staging structure is the addition of data marts to the data warehouse. Data marts store summary data for a particular line of business, making the data easily accessible for specific forms of analysis. For example, adding data marts can allow a financial analyst to more easily perform detailed queries about sales data, to make predictions about customer behavior. Data marts facilitate analysis by tailoring data specifically to meet end-user needs.

New data storage architectures

In recent years, data warehouses are moving to the cloud. The new cloud-based data warehouses don’t adhere to the traditional architecture; Each data warehouse offering has a unique architecture.

This section summarizes the architectures used by two of the most popular cloud-based stores: Amazon Redshift and Google BigQuery.

Amazon Redshift

is a cloud-based representation of a traditional data warehouse

Redshift requires compute resources to be provisioned and configured in the form of clusters, which contain a collection of one or more nodes. Each node has its own CPU, storage, and RAM. A leader node compiles queries and transfers them to compute nodes, which execute the queries.

At each node, data is stored in chunks, called segments. Redshift uses columnar storage, which means that each block of data contains single-column values in multiple rows, rather than a single row with multi-column values.

Source: AWS Documentation

Redshift uses an MPP architecture, which divides large data sets into chunks that are mapped to segments within each node. Queries work faster because compute nodes process queries on each segment simultaneously. The leader node aggregates the results and returns them to the client application.

Client applications, such as BI and analysis tools, can connect directly to Redshift using open source PostgreSQL JDBC and ODBC drivers. In this way, analysts can perform their tasks directly on Redshift data.

Redshift can only load structured data. You can load data into Redshift using pre-integrated systems, including Amazon S3 and DynamoDB, by sending data from any local host with SSH connectivity, or by integrating other data sources using the Redshift API.

The Google

BigQuery

architecture

BigQuery is serverless, which means that Google dynamically manages the allocation of machine resources. Therefore, all resource management decisions are hidden from the user.

BigQuery allows customers to upload data from Google Cloud Storage and other readable data sources. The alternative option is to stream data, which allows developers to add data to the data warehouse in real time, row by row, as it becomes available.

BigQuery uses a query execution engine called Dremel, which can scan billions of rows of data in just a few seconds. Dremel uses parallel bulk queries to scan data into the underlying Colossus file management system. Colossus distributes files in 64-megabyte chunks among many computing resources called nodes, which are clustered.

Dremel uses a columnar data structure, similar to Redshift. A tree architecture distributes queries across thousands of machines in seconds.

Image source

Simple SQL commands are used to query

data.

Panoply Panoply

provides end-to-end data management as a service. It makes it easy to connect all your data to a central data warehouse, reducing time from data to value.

Panoply’s cloud data platform includes the following features

Code-free data integrations: Connect to all your data sources without complicated code

Low-maintenance cloud storage: Keep a copy of your data in the cloud so it’s ready for analysis when you are

Simple SQL-based views: Create and apply core business logic for consistent downstream metrics.

Beyond cloud data warehouses

Cloud-based data warehouses

are a big step up from traditional architectures. However, users still face several challenges when configuring them:

loading data into cloud data

warehouses is nontrivial and, for large-scale data pipelines, requires configuring, testing, and maintaining an ETL process. This part of the process is usually done with third-party tools.
Updates, enhancements, and deletions can be complicated and must be done carefully to avoid degradation in query performance.
Semi-structured data is unwieldy: it must be normalized into a relational database format, which requires automation for large data flows.
Nested structures are typically not supported in cloud data stores. You must dock the nested tables in a format that the data store can understand.

Backup and

recovery: While data storage providers offer numerous options for backing up your data, they are not trivial to set up and require monitoring and attention.

Panoply takes care of all the above complex tasks, saving valuable time and helping you reduce time from data to

information. Learn more about

Panoply’s

cloud data platform.

Learn more about data warehouses

Data Warehouse Concepts: Traditional Database vs. Cloud Database vs. Data Warehouse Data

Mart vs. Data Warehouse
Amazon Redshift Architecture

Blogs

Data Warehouse Architecture: Traditional vs. Cloud – Panoply