Data warehouses have been the most powerful enterprise application for company decision-making and analysis for over a decade. Many other applications supporting data warehouses have also been developed.
These include conventional OLAP and ETL tools provided by countless hardware and software vendors, DW appliances created by combining data bases and servers, in-memory databases, and virtualization tools.
The Hadoop ecosystem, which is developing along with big data environments, also has a set of various technologies for data warehouse construction.
For example, Apache Hive is a data warehouse option with the Hadoop ecosystem that can process large data sets saved in distributed storage environments using SQL. Even major vendors that provide data warehouse related tools and databases tend to reinforce their connection to Hadoop by including Hadoop connectors to their databases and tools.
Even though there are lots of records showing cases which constructed multi-dimensional modeling on databases provided by Oracle, IBM, and Microsoft, data modeling using a Hadoop ecosystem is not very well known to us. In the near future however, Hadoop may act as a new platform for commercial data warehouse construction as big data technologies advance.
Today, I’d like to introduce what multi-dimensional modeling needs to consider when newly constructing or moving a data warehouse to a Hadoop ecosystem.
Multi-dimensional modeling is a modeling technique used for data warehouse construction. It creates an environment in which users can extract data out of a data warehouse more easily and faster.
Multi-dimensional modeling, commonly known as Star Schema is the most common method for modeling since it helps users understand data more easily and provides query results faster.
The main components of multi-dimensional modeling include a fact table and a dimension table. Fact tables list numerical values (sales numbers, payments, etc.) generated in the business process.
One of the most important characteristics of a fact table is granularity, meaning each part of a fact table must have an equivalent amount. As long as this rule is followed, the fact table calculation can keep its integrity regardless of the number of values or mathematical operations.
The dimension table is the other major component which explains the fact table by showing the date, branch, and clients. Multi-dimensional models using fact and dimension tables are called Star Schema or Snowflake Schema, and their entities are converted into physical tables to be used for data accumulation and analysis.
Multi-dimensional modeling needs to provide information that helps users understand how data is organized, because users first have to grasp the data itself in order to search and analyze it.
For this reason, multi-dimensional modeling should be performed first when creating a Hadoop based data warehouse, and the result must be provided to users. In other words, Hadoop based data warehouses also require the same multi-dimensional modeling, and this model is there to help users understand data more easily.
This process is similar to RDBMS based data warehouse designing, because it requires additional physical designing procedures to reflect characteristics of Hadoop, so that the performance level of numerous unpredictable queries can be secured when converting a multi-dimensional model into a physical model.
Just like multi-dimensional modeling is done to use various functions of a data warehouse when constructing one using RDBMS, multi-dimensional modeling is similar to the logic-modeling phase when using Hadoop. Based on this process, physical modeling redesigns it by combining/separating entities to provide optimum query performance.
In addition, it has to enhance the general performance of various queries instead of improving only one or a few of them, because user query patterns are unpredictable. Let’s see what there is to consider during physical modeling after designing a logic model called a multi-dimensional model.
The process of physical model implementation for Hadoop based data warehouses is as follows. First, the physical model is completed through the following stages based on multi-dimensional modeling.
※ The Following modeling process is based on HDFS/Hive.
De-normalization is commonly used to improve performance when implementing a Star Schema model based on a Hadoop distributed file system (HDFS). Unlike RDBMS, Hadoop allows data overlapping caused by de-normalization thanks to its relatively cheap equipment and software.
However, the procedure through which modifications are reflected to the de-normalized dataset isn’t quite easy, since modifications should be made to all overlapping data.
Therefore, the decision to apply de-normalization on datasets that change often should be made carefully, considering the tradeoff between search performance and deployment. The following is an example of the de-normalization method that can be used for Hadoop-based data warehouse construction.
A. Dimension Integration
Multiple dimensions with a small number of rows are bundled up and turned into a single dimension dataset so that the number of fact datasets and joins can be reduced. It’s difficult to optimize the system with a large number of joins between dimension datasets and fact datasets when using Hadoop.
Therefore, dimension datasets that have a small amount of data commonly used for queries are formed into a single dataset through a Cartesian product (set A x set B), and the dataset and fact dataset are joined.
Because a small amount of large files is better than a large number of small files for optimal performance with Hadoop, a large dimension is processed quicker than multiple small dimensions. This method is similar to that of organizing Star joins supporting RDBMS into a physical dataset.
B. Fact-Dimension Integration
Dimensions and fact datasets used together for queries are organized into a single dataset by applying a join into the deployment process instead of an in user query.
Data sets are joined at the point of data loading to reduce the number of joins and have the smallest number of datasets being used for queries. Everything is similar to the process of dimension integration except the integration is for fact-dimension.
C. Dataset Generation for Deployment
Because it’s generally faster to regenerate the entire dataset than to update a single event considering how HDFS works, datasets required for the loading process are added according to the data characteristics.
If a dataset which explains the present moment is needed, one with the entire history is added in the loading stage rather than regularly updating events through a data mart. This dataset can then be utilized for the final data mart regeneration.
② File Storage Methods
Storage options for HDFS can be defined. Storage options may differ according to the characteristics of each dataset. Sequence files for a row-based data approach and Columnar formats such as Parquet and RCFile for a column-based approach can be used in order to enhance performance.
Especially when users organize a data warehouse that extracts and utilizes a part of a column from a dataset, a columnar format which allows extracting partial column data helps improve search performance greatly.
③ Compression Codec
Compression helps enhance processing performance by reducing not only the size of stored data, but also the amount of disk I/O. It’s better to use a compression codec which allows splittable processing so that nodes can be processed in parallel in a distributed environment.
If the compression codec is unsplittable, additional splittable file storage methods such as Avro and sequence files are necessary. Like file storage, different compression codecs can be used for each partition within a single dataset according the characteristics of each dataset and partition.
A codec with a higher compression rate can be used for the data that hasn’t been used for a long time, while a speedy compression codec is better for frequently used data.
④ Dataset partitioning strategy
Datasets are partitioned according to data search patterns, so that results can be found without searching for the entire dataset. Partitioning helps enhance performance since it narrows down the objects for query scans and reduces I/O.
Once a dataset is divided into partitions according to some criterion such as dates which reflect user search patterns, only one daily partition has to be scanned when extracting data from a certain date, making it much more efficient compared to a full scan on the entire dataset.
The partitioning method is similar to that of when constructing an RDBMS-based data warehouse. One difference is that unlike RDBMS which defines how to partition a dataset, Hadoop creates multiple partitioning strategies according to a user’s search patterns.
Hadoop copies the original dataset and creates multiple datasets, and each dataset is strategically partitioned. In other words, Hadoop uses a lot more disk space to copy and multiply a dataset in order to enhance performance.
⑤ Dataset Bucketing Strategy
Bucketing is a process through which a dataset is divided into buckets, clustered groups of data, by using values from a certain column. Each bucket can have multiple column values as criteria, but a random value among criteria column values is included in only a single bucket.
Bucketing can be used with partitions, and a partitioned dataset can be divided into smaller units (like in the RDBMS hash partitioning strategy). Benefits of bucketing are similar to that of partitioning; meaning only specific buckets instead of the entire dataset can be joined in parallel for Hadoop MapReduce in order to improve performance dramatically.
Another benefit of bucketing is that it makes datasets small enough to use a map side join, which uploads this bucket to memory for fast processing.
⑥ Indexing Strategy
On Apache Hive, indexes that play similar roles to RDBMS indexes can be generated. Hive supports a compact index, bitmap index, and so on. It’s important to first analyze user query patterns in order to generate indexes reflecting these patterns (like in the RDBMS indexing strategy).
When using an index, statistical data should be collected regularly to help the optimizer find the best route.
⑦ Other Physical Modeling Options
When constructing data warehouses, HBase can be considered to lookup datasets. An HBase table can be defined as a Hive external tale and can be joined on Hive.
In addition, the modeler needs to understand the role of resource administrator (YARN or Mesos) that can be used under a Hadoop ecosystem, while setting suitable resource administrator options according to user query patterns. Deployment and recovery of resources within a cluster may affect query performance.
Often used datasets can be saved on memory (i.e. Apache Ignite). This improves performance, since disk access is unnecessary.
Until very recently, data warehouses were considered conventional and the only systems that could manage large data for companies. With advancements in the Hadoop ecosystem and a growing emphasis on big data, many companies are utilizing Hadoop to complement conventional data warehouses or are switching their data warehouses into Hadoop-based ones.
A data warehouse is an application from the Hadoop ecosystem’s point of view. The fundamental value of a data warehouse is the same even if it is constructed as an application within a Hadoop ecosystem.
This means the basic role of a data warehouse, providing analyzed data flexibly and immediately to help companies make the best decisions should be maintained no matter how they change. Fortunately, the methodology for data warehouse constructions can be used without much alteration when constructing Hadoop-based ones.
As long as the characteristics of Hadoop ecosystems are reflected in its physical modeling process, your data warehouse will be prepared for the era of big data, where both structured and un-structured data will be integrated.
Written by Dongsik Jung, LG CNS
 RDBMS: A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model as invented by E. F. Codd, of IBM’s San Jose Research Laboratory. In 2016, many of the databases in widespread use are based on the relational database model. (Source: https://en.wikipedia.org/wiki/Relational_database_management_system) [back to the article]