Have you heard of the big data technology called Hadoop? Hadoop is a technology commonly employed to analyze log data such as social media sites like Facebook, Twitter and videos which used to be disposed of. However, many believed that Hadoop was not suited for sensitive corporate data.
Most important corporate data, for this reason, was taken care through a company’s data base (hereafter DB). Such misunderstanding arose because Hadoop was mostly used for unstructured data processing such as documents and logs which the DB could not process.
These days, Hadoop is also being employed to analyze critical corporate data as well as unstructured data, and one of the areas Hadoop has proven to be useful is in data warehouses (hereafter DW).
DW is a type of DB which converts the accumulated data from multiple systems in the company into a standard form to help the decision making process for corporations. In other words, it is a warehouse for data that a company needs to manage itself better.
As you see in [Image 1], DW collects data from the corporate source system to one place (staging) and processes/integrates the data so that it can be easily analyzed (data warehouse). It then integrates and summarizes the data according to the purpose of analysis (data mart).
The data created like this is used for user analysis. This process in which data is collected, processed and accumulated for the next procedure is called ETL (Extract, Transformation, and Load).
It used to be the data base management system (hereafter DBMS) for analysis that handled this type of task. Once we entered the age of big data, the existing technology began to show its limits in processing such data.
According to information from market research institute IDC, 40% of the companies that took the survey said the scale of their data was increasing by 50% each year, whereas the annual increase of DW in scale is only at 18%.
This means although the amount of data which companies have to handle is growing and their demand for DW is also increasing, the development of DW is not catching up with the speed of data increase and the demand for DW.
Processing big data means handling up to hundreds of terabytes to petabytes, and unstructured data takes about 85% of it. DBMS is suited for structured data and not flexible enough to process unstructured data.
There are other issues such as limits in massive data storage and the high expense associated with the storage and processing when using DB. This is why DBMS is considered to be not quite enough in the age of big data.
In order to solve this issue, the DW industry soon launched high-performance appliances (both hardware and software combined) for DW that can process massive data faster. They also advanced the unstructured data process function and joined with Hadoop to make their massive data processing more cost-efficient.
Hadoop is an open source distributed system consisting of a distributed file system (Hadoop Distributed File System, HDFS) and a parallel data processing technology (MapReduce). It is known to be cost-effective as it uses x86 equipment which is cheaper than the existing DW system.
Besides, it has flexible architecture for high-performance parallel distributed processing which handles both structured and unstructured massive data. It is becoming an attractive alternative for companies looking for cost-efficient ways to store and process both structured and unstructured data.
DW providers are making effort to accommodate such demand by launching appliances exclusively for Hadoop or grafting Hadoop technology to the DW technology.
Can Hadoop replace the entire DW systems, then? First, we need to look at the characteristics of DBMS for DW and Hadoop to get an answer.
As you see from [Table 1], DBMS for DW is specifically developed to quickly search structured data. Because of this main purpose, it focuses on advanced data search functions (SQL, query optimizer, index) as well as concurrent user support.
However, Hadoop was created as a general platform for diverse tasks such as array, data search, and streaming, instead of being designated for a specific type of work. Reflecting the characteristics of big data, its focus is on advanced unstructured data processing.
Hadoop is great for array-based processing as it started with MapReduce. From the characteristics of these two, we can see that they complement each other’s technologies rather than replace them.
The industry is developing the technology to complement the demerits while keeping their own merits, so we may see them competing with each other for certain tasks in the near future.
Let’s take a look at the DW diagram I shared in the beginning of this posting and see what’s different in the image below.
[Image 2] shows hybrid DW which incorporates both DB and Hadoop. The parts where you see the yellow elephant are where Hadoop can be applied. Once the hybrid DW system is constructed, it can collect unstructured data from in and outside of the company and process it for its DW tasks using Hadoop.
Utilizing Hadoop, which is known for its capability in arrays, ETL arrays that convert and load massive data can help create a cost-effective and high-performance ETL system.
Data storage periods get longer using this method as well. DB can only keep a minimum amount of data and holds the rest as backup files on tape because it is either difficult to extend the data capacity or costs too much. This means the files that exceed the storage period should be recovered from tape and sent to the DB to be analyzed, which takes a long time not to mention the inconvenience.
Creating an online backup system with Hadoop for data after its DB storage period will make it possible to cheaply bring back old data for analysis, whenever needed. The data mart for user analysis can be created with DB instead of Hadoop so that it can support multiple concurrent users’ quick data searches.
According to the resource utilization rate of the existing DW system, data integration (ETL) takes about 80% of the entire resource utilization and the data search only uses less than 20% of it. In other words, many companies have bought expensive DB systems and spent 80% of their resources for the tasks their DBs were not even created for.
Senior researcher at Forrester Research James Kobielus predicted that Hadoop-based DW appliances will become the most popular platform in the future and many IT vendors will embrace Hadoop.
LG CNS has developed the next generation HIA (Hybrid Information Architecture) which combined the DW platform, a big data platform and data virtualization while operating a project for it.
Adopting HIA will improve the accuracy of analysis by utilizing both structured and unstructured data, as well as lowering the cost for analysis infrastructure through proper timing of decision making enabled by a shortened data processing time.
Many companies that have already adopted hybrid DW have shown positive effects such as fast data processing and low cost. Hybrid DW, combining merits of Hadoop and DB will become the new standard in the age of big data.
Big Data Analytics to Prevent IT Infrastructure Failure -Big Data Analytics (1)-
Integrated Security Logs and Finding Risk Patterns -Big Data Analytics (2)-
FDS for Safer Payments with Big Data -Big Data Analytics (3)-
Written by Hye Hwa Moon, SoftWare Architecture Advisory at LG CNS Big Data Business Group