Inside IT
How to Organize and Search for Data

We seem to be inseparable from our smartphones these days. What’s your favorite smartphone function?

I use the Internet a lot on my phone, like checking top portal site search results, searching for information, and saving interesting pages. One day, I started to wonder about the mechanism through which data I created, searched for, and used were stored and viewed.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___1

All the data produced by countless people is somewhere we can’t physically see, so how is this kind of big data being organized and then shown as search results?

Let’s see how big data is processed and shown when it’s searched for.

hadoop-distributed-big-data-processing-technology

Hadoop is most commonly used when processing and analyzing big data.

① What Is Hadoop? 

Hadoop is a system developed in 2004 by Doug Cutting, an American programmer and the developer of Apache Lucene, using Google MapReduce to improve big data processing. It’s a distributed file system which stores big data easily through low-priced servers and hard disks.

Before Hadoop, large data was processed through expensive equipment like super computers. Due to a lack of storage and resources as well as the high cost and lack of technology and engineers, only data considered important was analyzed. There was a consistent demand for a method to process existing data in a cost effective manner for this reason.

Hadoop, a technology which binds multiple regular computers as though they’re one to process big data, fulfilled this need.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___2

Hadoop logo (Source: Hadoop website)

Hadoop consists of HDFS through which large files are stored across thousands of distributed equipment and MapReduce, a computing platform for quick and easy analysis on stored file data utilizing DPUs and memories on distributed servers. Basically, HDFS stores data and MapReduce processes it.

Hadoop overcame the past where supercomputers were necessary to store and process large data, by making it possible to combine ten regular PCs as if they are one supercomputer as a kind of large storage space and reducing costs tremendously. Doug Cutting made the source code open to the public so other developers could use and enhance this technology as well.

② Characteristics of Hadoop 

The New York Stock Exchange creates 1TB of exchange data daily, and Facebook has 30PB of image files. Companies realize that using Hadoop is better than other expensive external storage equipment to save their big data, because it’s cheaper and easier to establish and operate.

Experts also suggest Hadoop as a more effective way to analyze big data. The Hadoop framework helps analyze large amounts of data quickly and inexpensively.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___3

Companies can lower the entry cost for big data analysis while taking care of compatibility issues with their existing data systems by using Hadoop. Unlike in the past where supercomputers had to run for days, Hadoop will also perform real-time analysis through its x86 servers.

For example, Facebook saves 30PB (three times that of the data held at the U.S. Library of Congress) through Hadoop. Large photo data is constantly divided into smaller data and saved over 2,000 servers, so that users can easily upload and download pictures on Facebook or instantly view other people’s photos with a single click.

What made Hadoop really popular is its convenience. Companies could connect PCs in parallel for distributed big data processing with only a little bit of training and could focus their attention on developing new consumer-oriented services.

Lastly, another characteristic of Hadoop is that it can be operated on multiple machines that don’t share memory or disks. Because data is divided into little pieces and these pieces are distributed, processed, and automatically rejoined together when requested, big data can be easily processed anywhere and at any time.

lucene-high-performance-open-source-information-retrieval

Now let’s take a look at Lucene, an information retrieval (IR) Library based on Hadoop.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___4

Lucene logo (Source: Lucene website)

① What Is Lucene? 

Lucene was also developed by Doug Cutting in 1999, as a Java based expandable high-performance open source IR library. It’s most well-known functions are indexing, searching, and analysis on full texts from multiple countries.

Because Lucene is a simple software library rather than a separate program, the developer needs to implement the search service and application through the library first.

② Characteristics of Lucene 

Lucene, which was originally created with Java, is potted with languages like Perl, Python, C++, and .NET. Lucene is designed to completely interconvert index data from different languages for further use.

The IT industry also commonly uses Lucene because indexing and search functions can be added to software programs without much professional knowledge.

In addition, the full text analyzer search engine for multiple countries is another major feature of Lucene. When developers use the index function, they can index and search through various documents instead of looking for a simple character string. With Lucene, all the content of documents is converted into character strings for indexing and searching, instead of using an unprescribed binary file.

In order to index diverse documents such as XML, PDF, HTML, and MS Word, a procedure through which the content is parsed and turned into texts so that Lucene’s analyzer can understand it is used.

Once there’s a method for full-text (content) and text split by words (analyzer), it becomes much faster to index and get search results. Lucene is a more efficient way of searching for content.

③ Lucene Development Case 

The cases which applied Lucene for indexing and searching are as follows.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___5eng

nutch-open-source-search-engine

Nutch is one of the Lucene-based open source search engines.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___7

Nutch logo (Source: Nutch website)

① What Is Nutch?

Nutch, an open source Internet web search engine project, was based on Lucene and implemented by Java, to provide a search service without any commercial aspect unlike others which are filled with advertisements.

The fact that it is an open source search engine itself is new compared to existing methods from other companies. The source from the Nutch engine is completely open to the public, and anyone can freely reuse it or modify it to better suit their application programs.

② Characteristics of Nutch

Lucene only consists of an indexer and searcher, and Nutch expanded the functions to make web searches possible by including all the elements required for it.

Because Nutch uses Lucene and it’s well modularized, it can combine hundreds of millions of web pages for indexing and searching and add various plug-ins. It also can save data in a form which doesn’t rely on any specific language, even though it’s made with Java.

Nutch becomes even more efficient when implemented by Hadoop, and it can be implemented and operated in multiple servers.

③ The Structure of Nutch

The overall structure of Nutch is similar to that of other web-based search systems.

The search process is described in the table below.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___8eng

  1. Web server receives a search order from a user.
  2. Request handler processes search words and sends them to multiple index engine servers.
  3. Search results from the index servers are ranged according to their search scores.
  4. Index servers that aren’t giving any results within one or two seconds are excluded (results are guaranteed to come out in two seconds).

solr-targeting-enterprises

The last one to introduce is Solr, which was expanded from Nutch to target enterprises.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___9

Solr logo (Source: Solr website)

① What Is Solr?

Solr is a Lucene-based search server for companies, and its development focused on adding more diverse and professional functions. This Lucene-based server supports various functions such as full-text searching, multi-aspect searching, real-time indexing, clustering, database integration, diverse document processing and searching, and Solr distributed indexing.

Solr provides all functions through HTTP protocol. These functions include indexing, searching, deletion, and updating, as well as adding schema, updating, and replication. Solr also uses POST and GET to manage indexing and requesting searches.

Because everything can be done with HTTP, search applications can also be easily developed using tools like CURL.

② Characteristics of Solr

Solr works as a separate application server, and provides REST API. Documents can request indexing and then search for and receive the results in the form of XLM·Json·CSV·Binary.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___10eng

I started with simple questions like “where do the pictures and document files I put online go?” and “how can they be shown through search engines?”, then ended up researching the mechanism of these technologies.

lg_cns___%ea%b2%80%ec%83%89_%ec%97%94%ec%a7%84___11

I also learned that many convenient functions I’m used to became available thanks to countless developers who never gave up discussing and making updates to create better kinds of technology. As the amount of data keeps increasing, new and diversified functions are also being introduced.

Now that I’ve learned the mechanism behind data storage and processing, I’ll be able to think about how they were developed when using these functions. I have also decided to have a more active attitude by thinking about the mechanisms involved when faced with new technology, rather than using it just because it’s there.

If you have ever wondered how some functions or technologies became possible, find out how they work by searching online. You’ll be able to enjoy them more once you know how they work!

Written by Seowon Cho, LG CNS Student Reporter

Post navigation

'Inside IT' Category Post
  • IoT
  • Cloud
  • Big Data
  • Security
  • Data Center
  • e-Government
  • Transportation
  • Energy
  • Manufacturing
  • Finance