A data warehouse is an electronic system that collects data from a wide range of sources within a company and uses the data to support management decision-making.
Businesses are increasingly moving towards cloud-based data warehouses instead of traditional on-premises systems.
Cloud-based data warehouses differ from traditional warehouses in the following ways:
- There is no need to purchase physical hardware
- cheaper to set up and scale data warehouses
- Cloud-based data warehousing architectures typically perform complex analytic queries much faster because they use massively parallel processing (MPP).
. It’s faster and
in the cloud.
The remainder of this article covers traditional data warehouse architecture and introduces some architectural ideas and concepts used by the most popular cloud-based data warehousing services.
For more information, see our page on data storage concepts in this guide.
Traditional Data Storage Architecture
The following concepts highlight some of the established ideas and design principles used to build traditional data warehouses
. Three-tier architecture
The traditional data warehouse architecture employs a three-tier structure composed of the following tiers
. Lower level: This tier contains the database server used to extract data from many different sources, such as transactional databases used for front-end applications.
- Level Intermediate: The middle tier hosts an OLAP server, which transforms the data into a structure better suited for analysis and complex queries. The OLAP server can function in two ways: as an extended relational database management system that maps operations on multidimensional data to standard relational operations (relational OLAP) or by using a multidimensional OLAP model that directly implements multidimensional data and operations.
- Top level: The top level is the client layer. This level contains the tools used for high-level data analysis, query reporting, and data mining.
Kimball
vs
. Two
data warehousing pioneers named Bill Inmon and Ralph Kimball had different approaches to data warehouse design
.
Ralph Kimball’s approach emphasized the importance of data marts, which are repositories of data belonging to particular lines of business. The data warehouse is simply a combination of different data marts that facilitates reporting and analysis. Kimball’s data warehouse design uses a “bottom-up” approach.
Bill Inmon considered the data warehouse as the centralized repository for all business data. In this approach, an organization first creates a normalized data warehouse model. Next, dimensional data marts are created based on the warehouse model. This is known as a top-down approach to data storage.
In
a
traditional architecture there are three common models of data warehousing
: virtual warehouse, data mart, and enterprise data warehouse:
- A virtual data warehouse is a set of separate databases, which can be queried together, so that a user can effectively access all data as if it were stored in a data warehouse.
- A data mart model is used for line-of-business-specific reporting and analysis. In this data warehouse model, data is aggregated from a variety of source systems relevant to a specific business area, such as sales or finance.
- An enterprise data warehouse model prescribes that the data warehouse contain aggregated data that spans the entire organization. This model sees the data warehouse as the heart of the enterprise information system, with integrated data from all business units.
Star schema vs. snowflake schema The star schema and
snowflake schema are two ways to structure a data warehouse. The star schema has a centralized data repository
, stored in a fact table.
The schema divides the fact table into a series of tables of denormalized dimensions. The fact table contains aggregated data that will be used for reporting purposes, while the dimension table describes the stored data.
Denormalized designs are less complex because the data is grouped. The fact table uses only one link to join each dimension table. The simpler design of the star schema makes it much easier to write complex queries.
The snowflake schema is different because it normalizes the data. Normalization means effectively organizing data so that all data dependencies are defined and each table contains minimal redundancies. Therefore, single-dimensional tables branch into tables of separate dimensions.
The snowflake schema uses less disk space and better preserves data integrity. The main disadvantage is the complexity of the queries needed to access the data: each query must drill down to get to the relevant data because there are several combinations.
ETL
vs
. ETL ELT ETL
and ELT are two different methods for loading data into a store
.
Extract, Transform, Load (ETL) first extracts data from a group of data sources, which are typically transactional databases. The data is saved in a temporary staging database. Transformation operations are then performed to structure and convert the data into a form suitable for the target data storage system. Structured data is loaded into the warehouse, ready for analysis.
With Extract Load Transform (ELT), data is loaded immediately after it is extracted from the source data groups. There is no staging database, which means that the data is immediately uploaded to the single, centralized repository. Data is transformed within the data warehouse system for use with business intelligence tools and analytics.
Organizational maturity
The structure of
an organization’s data warehouse also depends on its current situation and needs
.
The basic structure allows warehouse end users to directly access summary data derived from source systems and perform analysis, reporting, and data mining on that data. This structure is useful when data sources derive from the same types of database systems.
A warehouse with a staging area is the next logical step in an organization with disparate data sources with many different data types and formats. The staging area converts the data into a structured summary format that is easier to refer to with analysis and reporting tools.
A variation of the staging structure is the addition of data marts to the data warehouse. Data marts store summary data for a particular line of business, making the data easily accessible for specific forms of analysis. For example, adding data marts can allow a financial analyst to more easily perform detailed queries about sales data, to make predictions about customer behavior. Data marts facilitate analysis by tailoring data specifically to meet end-user needs.
New data storage architectures
In recent years, data warehouses are moving to the cloud. The new cloud-based data warehouses don’t adhere to the traditional architecture; Each data warehouse offering has a unique architecture.
This section summarizes the architectures used by two of the most popular cloud-based stores: Amazon Redshift and Google BigQuery.
Amazon Redshift
Amazon Redshift
is a cloud-based representation of a traditional data warehouse
.
Redshift requires compute resources to be provisioned and configured in the form of clusters, which contain a collection of one or more nodes. Each node has its own CPU, storage, and RAM. A leader node compiles queries and transfers them to compute nodes, which execute the queries.
At each node, data is stored in chunks, called segments. Redshift uses columnar storage, which means that each block of data contains single-column values in multiple rows, rather than a single row with multi-column values.
Source: AWS Documentation
Redshift uses an MPP architecture, which divides large data sets into chunks that are mapped to segments within each node. Queries work faster because compute nodes process queries on each segment simultaneously. The leader node aggregates the results and returns them to the client application.
Client applications, such as BI and analysis tools, can connect directly to Redshift using open source PostgreSQL JDBC and ODBC drivers. In this way, analysts can perform their tasks directly on Redshift data.
Redshift can only load structured data. You can load data into Redshift using pre-integrated systems, including Amazon S3 and DynamoDB, by sending data from any local host with SSH connectivity, or by integrating other data sources using the Redshift API.
The Google
BigQuery
architecture
BigQuery is serverless, which means that Google dynamically manages the allocation of machine resources. Therefore, all resource management decisions are hidden from the user.
BigQuery allows customers to upload data from Google Cloud Storage and other readable data sources. The alternative option is to stream data, which allows developers to add data to the data warehouse in real time, row by row, as it becomes available.
BigQuery uses a query execution engine called Dremel, which can scan billions of rows of data in just a few seconds. Dremel uses parallel bulk queries to scan data into the underlying Colossus file management system. Colossus distributes files in 64-megabyte chunks among many computing resources called nodes, which are clustered.
Dremel uses a columnar data structure, similar to Redshift. A tree architecture distributes queries across thousands of machines in seconds.
Image source
Simple SQL commands are used to query
data.
Panoply Panoply
provides end-to-end data management as a service. It makes it easy to connect all your data to a central data warehouse, reducing time from data to value.
Panoply’s cloud data platform includes the following features
:
- Code-free data integrations: Connect to all your data sources without complicated code
- Low-maintenance cloud storage: Keep a copy of your data in the cloud so it’s ready for analysis when you are
- Simple SQL-based views: Create and apply core business logic for consistent downstream metrics.
.
.
Beyond cloud data warehouses
Cloud-based data warehouses
are a big step up from traditional architectures. However, users still face several challenges when configuring them:
loading data into cloud data
- warehouses is nontrivial and, for large-scale data pipelines, requires configuring, testing, and maintaining an ETL process. This part of the process is usually done with third-party tools.
- Updates, enhancements, and deletions can be complicated and must be done carefully to avoid degradation in query performance.
- Semi-structured data is unwieldy: it must be normalized into a relational database format, which requires automation for large data flows.
- Nested structures are typically not supported in cloud data stores. You must dock the nested tables in a format that the data store can understand.
- recovery: While data storage providers offer numerous options for backing up your data, they are not trivial to set up and require monitoring and attention.
Backup and
Panoply takes care of all the above complex tasks, saving valuable time and helping you reduce time from data to
information. Learn more about
Panoply’s
cloud data platform.
Learn more about data warehouses
Data Warehouse Concepts: Traditional Database vs. Cloud Database vs. Data Warehouse Data
- Mart vs. Data Warehouse
- Amazon Redshift Architecture