At Arches National Park in Utah, you can see many humongous arch-shaped rocks in various shapes. These rocks were formed through years of weathering. There could be some new rocks being washed out and chipped away into an arch even now.
As hard rocks eventually end up with holesin them over long periods of time facing the wind, corporate security may also have holes caused by consistent threats. Now that these security threats are becoming more intelligent and advanced, a single system or security data within a short period of time can no longer efficiently detect external invasions and internal leaks.
Big data technology is aggressively being adopted in order to detect these complex threats coming about over an extended amount of time. As a result, we’re finally able to build an analytic environment with high-performance data processing and low investment costs.
The Korean market is also starting to construct more integrated security log analytics systems based on big data. Today, let’s learn how data is collected from various source systems and processed in the integrated security log system.
Security Log Collection and Analytics
The process of collecting and analyzing security logs generally goes through scenario definition – data collection – analysis – and monitoring. Let me show you each procedure in detail.
This step is designed to come up with a hypothesis on who would attempt an invasion or internal leak and through what channel. Scenario definition can also be done with or after data collection depending on the case. By defining scenarios, you can prevent wasting resources spent to collect unnecessary data, and in return you can save time and money while creating a system.
However, it may provide an environment where you can freely search diverse data sources if data collection is done before scenario definition. Through this, it can predict scenarios and patterns that haven’t come up beforehand.
Scenarios can be divided into single, complex, and pattern scenarios. A single scenario detects abnormal activities within a single data source, whereas complex scenarios detect them from two or more data sources. Pattern scenarios detect abnormal activities repeated over a certain period of time or that don’t match the general patterns. These scenarios are usually defined by the ones in charge of security, or sometimes through a security consultant.
After defining a scenario, it’s time to collect necessary data to implement it. The collected data generally include emails sent to externally as well as web access, file transmission, and VPN (Virtual Private Network) access logs. This means the security log that’s directly related to security is collected along with basic information such as personnel and asset data to identify the actor.
Analysis on the present system condition, planning how to collect data, defining the data storage policy, and determining how to link the system with identification data can also be included in the data collection stage.
Data collection may take place in various forms depending on infrastructural situations such as the log formation cycle of the source system, transmission protocol, requirements for scenario analytics, and in-company network capacity. The cycle and method of collection are determined according to these variables. Once the collecting cycle and method are determined, the actual data collection process begins. The following are the major tools commonly used for data collection.
The Hadoop ecosystems most commonly used in Hadoop based bid data collection are Apache Flume (hereafter Flume) and Apache Sqoop (hereafter Sqoop).
Flume supports both the agent method which requires the agent to be installed in the source system for data collection and the agentless method that collects data by directly sending it to the server for storage through Syslog or FTP (File Transfer Protocal) without an agent.
Unlike Flume, which supports various protocols for data collection, Sqoop is optimized for data collection using SQL (Structured Query Language) on the database such as RDB (Relational database). It is considered an optimal tool for big data processing, because it can collect data in parallel up to the distributed number of servers or even more according to the Hadoop distribution environment.
Companies usually choose either Flume or Sqoop based on their own data collection requirement. Messaging queues such as Apache Kafka, RabbitMQ, ZeroMQ can also be added to the configuration for more stable collection and processing.
During this data collection stage, it is needed to decide how to store the data. There are three ways: adding the identification data, filtering the necessary data for analysis, and saving the entire amount of data as a whole.
Data collection is the step which takes the longest and the most amount of effort in the entire log integration and analytics process. The analytic ease and performance may be affected greatly based on how well the data collection/storage has been planned and constructed based on constitution qualify requirements. This step, therefore, evidently takes close consideration and careful planning.
Analytics (scenario implementation)
After collecting the data, the scenario is implemented using the collected data. It can be implemented with tools best suited for various requirements such as the monitoring cycle, the range of data for concurrent processing, and keyword search.
Apache Storm (hereafter Storm), Apache Spark (hereafter Spark), and Apache HBase (hereafter HBase) can be used for real-time monitoring implementation. It’s important to choose the right tool for implementation as Strom, Spark, and HBase have different features.
Real-time analytics usually takes a lot of memory, and therefore relatively high-performance hardware is needed. Careful consideration is required as to whether the real-time analytics requirements are mandatory objects of implementation. This is to prevent possible waste caused by equipping hardware for unnecessarily high performance.
For Batch processing, which doesn’t require real-time analytics, Hadoop MapReduce (hereafter MapReduce), Apache Hive (hereafter Hive), and Apache Pig (hereafter Pig) can be used. Using MapReduce for implementation involves utilizing a distributed parallel processing framework called Hadoop MapReduce for Java programming.
With a MapReduce framework, the developer can focus on the business logic because the general function needed for the distributed parallel processing is taken care of. Developing Java using MapReduce, however, still takes quite a lot of studying and work, and the script based distributed parallel processing tools like Hive and Pig were created to complement MapReduce.
Once the script is in the form of SQL (Structure Query Language), which is used for RDB (Relational Database), Hive interprets it and turns it into a program for distributed parallel processing like for MapReduce. In other words, the developer can use a familiar interface such as SQL through Have and Pig, to implement the distributed parallel processing logic.
There are other tools such as Impala and Apache Tajo that play a similar role as Hive and Pig. These tools enable more stable and faster relation analysis among various types of data and long term massive data analysis.
These days, search tools like Apache Solr (hereafter Solr) and Elasticsearch are also used for faster massive data searches. Solr and Elasticsearch provide an environment to search for data quickly witha search engines similar to Google and Naver. You can search through big data over the course of months to years within less than a minute.
In the analysis step, not only it advances settled scenario and finds new patterns by searching data through search demand or script, but it also implements the already defined scenario.
It is very important to visualize the results of the analysis and recognize to respond to threats quickly through regular monitoring by generating notifications on events. Having someone repetitively search and run analysis results that have to be monitored everyday is quite inefficient, though.
This is why the system needs to support effective visualization and automation so that workers focus on their work. By providing an environment where abnormal patterns and symptoms can be immediately checked, the person in charge of security can detect diverse risk patterns and new risk factors.
There are commercial software Tableau, Pentaho, Qlickview as well as web screens using D3.js and Highchart commonly used as major tools for visualization and automation. Either software or web screens are developed according to their requirements for visualization and budgets.
For regular data monitoring, you can also use the integral scheduling function in the commercial tools or utilize Apache Oozie and Spring Batch to implement it yourself. For more advanced analysis, it can also be considered to use a statistics tool such as R or scenarios detecting abnormal symptoms for analyzing pattern.
LG CNS Supporting Integrated Security Log Analytics
So far, we learned about what to consider when collecting and analyzing security logs and the main tools for them.
There had been similar analysis on security logs under the name of SEM (Security Event Management), SIM (Security Information Management), and SIEM (Security Information and Event Management) even before big data based integrated security log analytics.
The high adoption cost and performance limitations caused by constant increases in data, however, resulted in higher demand for big data based integrated security log analysis with higher performance and efficiency.
Although big data based systems have lots of merits including better performance and lower costs, we can’t say that the big data based solutions are simply better than other conventional solutions in all aspects. Therefore, it’s important to carefully consider what the optimal solution is for the company environment taking into account general conditions such as the size and function of data for collection and analysis, performance requirements, maintenance system, and budget.
LG CNS supports diverse forms of integrated security log analytics covering everything from security analysis based on conventional solutions to developing and distributing big data based solutions for integrated security log analysis including data collection/analysis/monitoring. The Smart LAP in the figure above LG CNS Big Data Solutions Suite is one of these.
Nothing can be done until all the pieces are put together. Although in the past it was impossible to even gather everything together, now we can easily integrate and save data with the big data technology.
The gathered big data will be useless without the effort to find patterns, monitor them, and advance their analytics. LG CNS will keep working hard to create big value and to lead big data technology.
Written by Donghyun Kim, Software Architecture Specialist at LG CNS Big Data Technology Support Team