As Internet services advance, storing and processing the data created by users has become necessary and so have big cache systems.
B2C service system with 800 users had been concerned about database (hereafter DB) errors and extension costs, but was finally able to lower the CPU usage rate to the aspiration level by building open source software called Redis and HBase based cache system.
The Go match between AlphaGo and Se-Dol Lee shocked many of us while provoking our imaginations on a new world. What made AlphaGo possible was the technology called Deep Learning, and the parallel distributed computing behind it. Through these, 1,202 CPUs and 176 GPUs can now work at the same time to execute operations.
Even though the term ‘big data’ has been around for quite a while, it still feels somewhat abstract as if it is only being used in a few advanced countries. When big data is being talked about in the media or in our conversations, it usually involves its ‘analysis’ function and concerns for ‘big brother’ systems. Big data is another technology which is backed by parallel distributed computing.
For businesses that are looking for more ways to apply big data, the first step to make people more familiar with big data is in understanding that it is basically parallel distributed computing. It is time to forget about the word ‘analysis’, and think about how to apply distributed processing. Once you find ways to apply it, big data will be gathered naturally. It is then time to talk about analysis.
Today, I’ll introduce the architecture which lessens DB loads by using parallel distributed computing. This is unrelated to analysis, but projects based on this architecture successfully gathered large amounts of data.
Although parallel distributed computing is quite common for online businesses, public institutions and private enterprises are sticking to conventional architecture with commercial software. The biggest problem they may experience with conventional architecture is related to their DB.
Because RDBMS (Relational Database Management System) is a commercial software, the cost for maintenance and expansion is tremendous. Let’s say a company goes through an expansion despite the cost. Not only will it continue to have errors, but it will also need additional expansion soon after. Even bigger problems can be found from the limitations RDBMS has itself. Even if a company comes up with a new advanced service such as a personalized recommendation system, the conventional architecture and RDB expansion won’t be able to make it possible.
In 2013, LG CNS suggested the open source software Redis as well as a big cache system and a big table system based on Hbase to a Client, an IPTV service company, who were concerned about their DB errors and expansion costs in the midst of an explosion in services and users. It was decided that a data caching system was to be applied to the cell phone service where the largest number of transactions was being made. The company began their work in August of 2013, and then opened the service to its 8 million clients on June 25th of 2014.
① Limits of RDB Architecture
Users request data to the application server through terminals like cell phones, Set Top Boxes, and PCs. The application server sends a part of the request to the file cache located on the local file system, but it mostly provides the service by reading the data from ORACLE.
② Improvement Issues
The characteristics of this service come from the fact that a number of users may access the service at the same time. Even though there are multiple application servers to suite different services and users to take care of this problem, it’s not the same for DB where all access is concentrated on a single spot. With over 8 million users, figuring out what level transaction density to expect can be quite difficult, not to mention the fact that the size of it is beyond RDBMS capacity. Here we can see the limits of RDBMS. Different forms of improvement issues appear in both read and write operations.
The CPU usage rate of the DB server is the most important parameter when deciding on the stability of the service and the need for expansion. The change in the CPU usage rate is similar to that of LRB. The lower line on Fig. 1 is for LRB and the higher one is for the CPU usage rate. PRB change does not correlate to LRB and CPU usage rate changes. Compared to LRB, PRB is close to none.
Fig. 3 shows that LRB is about 3,510 times that of PRB. This is quite different than the usual case where increases in CPU usage rates accompany higher PRB. This means all search data is located on ORACLE SGA, a part of system memory.
Because the performance frequency is so high, the CPU load only goes up by logical reading. Here, ORACLE is basically a memory DB. This is what made some ignore the suggestion to adopt a memory DB. What’s needed here is distributed architecture, and data cache and big table systems have this distributed architecture.
There is a big difference in CPU usage rates between the busy hour and other times, but the company has to plan the expansion conservatively based on the rate during the busiest hour to ensure stable operation. This means the resources will be wasted during less busy hours. These extra resources can become quite burdensome to businesses.
Unlike Fig. 2 which shows a predictable increase, Fig. 4 presents an unpredictable increase. Because there are a large number of users and usage rates can change significantly according to a program’s popularity, events, holiday seasons, and Olympics, expanding the system only based on the busy hour statistics may not be enough.
A number of writes take place on the watch and purchase table, on average about 2 million per table every day. Data more than a week old is automatically deleted due to performance and cost issues. Long term data storage is necessary to make the service more useful, but this goal is hard to satisfy because of DB limits.
Although not as high as those from the read concentration, surges in CPU usage ratesare also influenced by writes. In this case, the ORACLE wait event known as ‘TX: index concentration’ and ‘gc buffer busy’ take place excessively when write delay errors occur.
ORACLE objects which go through write competition are the indexes timestamped onto the history table. These are righthand indexes where multiple processes are inserted into a single block. As the service requests recent search history, it is impossible not to create the index for timestamps. Fig. 4 shows the actual list of righthand indexes.
The load is concentrated on the leaf block on the far right where the recent data index is (Fig. 6), and a large quantity of gc buffer busy is created due to the call from processes from the RAC node. In order to take care of this problem, one needs to either give up on the index for timestamping or go for hash partitioning.
Even though the righthand index problem can be solved when partitioning is done through hash, a large amount of data has to be deleted as TRUNCATING is impossible. It also takes lots of resources, and may lead to unstable service. Even if the TRUNCATE issues are taken care of using Range, righthand index problems don’t go away.
Transposing the rows which divide users with a criterion such as region, connecting it to the timestamp row, and then performing composite partitioning may help alleviate this problem, but this cannot solve the fundamental issue when there is a severe write concentration.
Written by Euijin Lim, LG CNS
 LRB (Logical Read Blocks): Internal ORACLE index meaning the number of blocks read from SGA ORACLE. [back to the article]
 PRB (Physical Read Blocks): Internal ORACLE index meaning the number of blocks read from Disk. [back to the article]
 AWR (Automatic Workload Repository) shows the statistics of the workload automatically created by ORACLE. [back to the article]