What image comes to mind when thinking of “data visualization”? The images created by blogs, books, or lectures can vary, but there’s one thing shared by all these different images.
It’s that data visualization has interesting stories. It might have convinced you to change your mind on something, given you new perspectives, or made you question stereotypes prevalent in society. After all, it’s all about the clear and intuitive stories they tell, regardless of the size or form of visualization—such as an image from a news article or a presentation slide.
Since an uncountable amount of information is flooding the world in the era of big data, people are looking for ways to organize necessary information and make it easier for others to understand the analysis results.
Conventional data visualization was based on graphs, which summarized data as a table on a spreadsheet showing results, but new techniques and methods of visualization that find hidden meanings in big data and contain interesting stories about the analysis process are being studied.
Big data visualization means a process of expression which shows the results of analysis using visual methods for a more effective delivery. The concept of information visualization can help understand the meaning of big data visualization.
Information visualization means expressing large quantitative and non-quantitative data visually through color, statistical images (diagrams and graphs) and other images. These methods can be subcategorized according to what they focus on—time, proportion, relationship, difference, and spatial relationship.
These various methods draw the viewers’ attention and reduce time to understand data and lead to quicker circumstantial judgment. Visualization also disperses information more quickly, helps imprint the information in people’s memories longer, and above all things, results in more effective communication about the information.
① Visualizing Over Time
We look at time every day on computers, clocks, and phones. Data related to time seems quite natural to us, since we sort of know when to wake up or go to bed even without a clock. The biggest characteristics of time series data is that it shows trends and tendencies.
People change their opinions, demographics change, and businesses grow over time. These changes can be measured over time to become what is called time series data. To find the pattern of the change, it’s important to focus more on the big picture rather than individual data. More meaningful stories emerge when looking at the context of a section.
One example is the line chart which extracted people’s Facebook status line and analyzed their peak break-up times. People break up the most in spring and winter, and relatively more between April fools and summer vacation.
What if we analyze this data according to days of the week? We will find out that people spend their weekends in agony, then change their status lines to “broken-up” or “single” on the coming Monday.
Visualizing data over time should be able to capture both the big picture and details. Questions like whether there was an event resulting in abnormal figures, any notable sections, dramatic changes (increase or decrease), or a regular pattern should be carefully asked according to the analyzer’s insight and judgment.
② Visualizing Proportions
Proportional data is similar to time series data expect that proportional data focuses on categorization, sub-categorization, and the number, meaning possible options or results. Proportional data generally shows maximum, minimum, and overall distribution.
When visualizing the caloric proportion of each meal, the food through which you’ve taken in the most calories would be maximum, while the opposite being minimum. The overall distribution needs to be checked to figure out whether calories are coming from all macronutrients like fat, protein, and carbohydrate or these figures are imbalanced.
The image above probably won’t mean anything if you don’t know what it is. It’s a tree map chart showing in what scale and with what motivation billions of dollars were spent throughout the globe. The sizes represent the scale of spending, the color is for motivation—purple meaning conflicts, red for donations, and green for income.
The Organization of Petroleum Exporting Countries (OPEC) is earning 790 billion USD, but only spends 3 billion USD to fund climate change related activities. The war in Iraq has cost up to 3 trillion USD, and the sum of the world’s debt caused by the global financial crisis is 11.9 trillion USD.
Within the same time period, we can also intuitively notice that the scale of spending through income and conflict are about the same. As you see, tree map charts help us understand proportions of each category through the sizes of squares, and their sub-categories through colors.
③ Visualizing Relationships
Statistics help us find relationships among data. The key question for statistics is whether there’s commonality among different groups or sub-groups within a group. The most well-known relationship in the realm of statistics is correlation.
For example, the relationship between one’s height and weight can be correlated, since taller people tend to be heavier than short people.
However, just like not every tall person is heavy, some data doesn’t follow a linear relationship and rather it has a more complicated relationship due to multiple options and non-linear relationship patterns. Let have a look at an example to see what this means.
The best example of relationship visualization is a bubble chart. The image above analyzed crime rates in each state, showing four different types of variables at once. The X-axis shows the number of murderers among 100,000 people and the Y-axis for that of burglars, while the size of the circle represents the population of each state and the color for crime frequency of each state—red means great frequency, and blue stands for low frequency.
We usually think states with large populations would have more murderers or burglars. This is not always true, because even though states with large population like Texas, California, and Florida are located in the top right corner meaning there are many burglars and murderers, states with small populations like Louisiana and Maryland are further to the right.
Relationships between the population and the number of criminals may not be same everywhere. Yet, the proposition that the regions which have lots of burglars also have lots of murderers seems to be hard to oppose, considering the linear relationship from the bottom left to top right.
We can analyze various relationships that crime rates have with other data through this bubble chart which utilized over 100,000 crime records. It’s a good example to learn that information visualization is a type of compressed knowledge.
Large amounts of knowledge and information extracted from countless sources of data are compressed into a small space. Decisions can be made more quickly, since interactive analyzing and animated visualization became possible thanks to diverse visualization analysis tools.
④ Spotting Differences
It’s easy to compare data with a single variable. Some houses are larger than others, and some cats are heavier than others. It becomes more complicated when there are more than two variables to consider, but not impossible.
This house is larger than the other but has a smaller bathroom, and this cat is heavier than the other but has shorter hair. Imagine you have to categorize a hundred houses or a hundred cats. The number of variables will become even larger, and you will have to compare the number of bedrooms, the size of yards, and maintenance costs.
This means the list you have to check becomes the number of objects times the number of variables. This can be a real headache.
The example above is a heat map chart which analyzed statistical data on NBA players in 2008. The X-axis stands for performance and the Y-axis for player names, while the colors represent their scores. Because heat maps involve multiple variables, the starting point needs to be set prior to analysis.
The chart above chose points—which are the most crucial part of basketball—on the third column as its analysis starting point, then listed all of them in descending order. Dwayne Wade was listed on the top of the list since he scored the most, and Nate Robinson was on the bottom line for scoring the least.
The analysis starting point can be used if other variables have relationships with player scores. You will also notice particular cases such as Dwight Howard with the most rebounds and Chris Paul for the most assists.
The biggest benefit of a heat map is that it shows all of the data at once. No matter how many variables the data has, it can still be categorized and find exceptions.
By comparing other aspects of two players who have similar scores, we can figure out what the differences and similarities between the two of them are. By comparing differences and similarities, we’ll be able to understand players just like professional sport commentators do.
⑤ Visualizing Spatial Relationships
Maps are a type of information visualization which utilizes intuitiveness at its core. Since navigation services have advanced, we can simply use map applications to prevent getting lost at new locations. Reading maps is quite similar to reading statistical graphs.
Comparing one spot on a map with another is similar to reading a scatter chart for relationship visualization. The difference is that maps use latitude and longitude instead of an X and Y axis. The connection made between points A and B can be expressed as distance or travel time.
As you must’ve already noticed, data becomes much more interesting when the aspect of time is added. Combining multiple maps—each of them reflecting a certain moment—can show what kind of changes a certain region have gone through over time. This kind of animation or switching time technique makes it easy to see real estate market or population changes over time.
The bubble chart above used a visualization analysis tool to show the population of each administrative district of Korea. The sizes of circles stand for sales scales of each region, and the color is for population.
Unlike the bubble chart I introduced earlier, the locations of these dots actual hold information on different regions. Regions with large populations—large circles—tend to have larger sales scale—red circle. This leads to an intuitive conclusion that sales scales are bigger in major cities with larger populations.
A single map chart with time or gender options can also show the trends over time or differences among different groups of people.
Dots on maps only represent one location. These dots are harder to read when coordinates are dense, because dots will be concentrated in small areas. Since each city, state, country, and continent has borders, territories can be colored instead of using colored dots.
This type of graph which colors different areas is called a choropleth chart. Choropleth charts can visualize a single variable, however, because it can only use color to show difference.
In the next posting, I will introduce the range and procedure of big data visualization, as well as its tools more specifically.
Written by Jinwon Park, LG CNS