Looking at the Numbers
The era of big data we find ourselves in is, in the simplest sense, built on collecting data and processing it. The social barriers to collecting data have been broken down and the networks to transmit it have been built out. The housing of all of this data is dramatically cheaper than it's ever been – either in the cloud or on an owned server. Most recently, compute power has become more accessible, punctuated by the launch of Frontier earlier this year, the largest exascale computer in the world.
Despite all of this progress, the way that we look at the data hasn’t really changed, despite the growing need for it to. As our datasets get larger, more accurate, and contain more dimensions, the way we visualize the data itself is starting to crack under the pressure. Our fundamental ideas and understanding of data has gone so far beyond what we once believed possible, to the point where it feels like modern and future-facing data science is a totally new study. Is it time to take our data visualizations to the next level too?
How is Data Visualized?
If you’re finding this page, JDI sincerely hopes that you have seen a graph before. Instead of rattling off the different types of graphs, of which there are many, let’s instead dive into what governs good data visualizations. Data has a historical reputation of integrity, but visualizations have been seen as disingenuous. The belief was that once an artist “interpreted” the numbers, they lost accuracy and were subject to falsification. Datasets weren’t made into a visualization, but an illustration instead. Keeping numbers in datasets was a way of gatekeeping less educated readers away from mathematical or analytical study.
This was the trend until the early 60s, when John Tukey revolutionized the way that we approached both data analysis and the way it was visualized. His governing principles turned data on its head: that data is a way of reinforcing an argument or aiding a story. The best visuals are like a movie or play: they tell a story without thinking about the constituent parts. You see what the director wants you to.
Following in Tukey’s footsteps were Leland Wilkinson and Edward Tufte, whose principals still stand to this day. Wilkinson established the Grammar of Graphics, still present today in the R package known as ggplot2. He believed that graphs and graphics were subdivided into data, aesthetics, scales, statistics, process, facets, coordinates and positioning. This was a much more expansive approach to data modeling than was previously utilized, and put heavy emphasis on readability and viewer understanding.
Tufte, who wrote at great lengths about data visualization, had his most famous work in The Visual Display of Quantitative Information. He advocated for graphs that assumed the reader was more intelligent than before. Tufte advocated for something called, “Small Multiples”, a method of breaking out data for easy comparisons. Readers could look at data across different categories, with the same axes and scales for fast analysis. There are two examples below: one is from the Pew Research Institute and the other from the Seaborn python package. The Pew Research example shows different voting demographics and their thoughts on a polling topic, and where they stand respectively. The second example shows passenger counts by month per year, broken down into a grid view. Each graph shows a highlighted line letting readers compare month-to-month and year-to-year trends quickly and simultaneously.
Another one of Tufte’s main ideas was the data-ink ratio: a proportion of the amount of ink used for data to the amount to print the graph. Tufte advocated for graph designers to push the overall value closer and closer to 1.0 – within reason. To remove data-ink, he suggested removing redundant factors of categorization or changing the scale. When removing non-data-ink, look at making labels more concise, removing any added background or redundant, distracting or extraneous information. While narration and storytelling elements can help with providing greater context, Tufte advocated for the cleaner and more simple graphs we use today.
For the examples above, the Pew Research example, the data-ink ratio is fairly high: providing both quick analysis through the bars, but more in-depth analysis with the exact numerical values. On the other hand, the second example could be improved by only having one line per graph, removing the need for the different colored lines as well.
Big Data and Data Visualization
Big data lives up to its name: large, unstructured and unwieldy. Cleaning the dataset is already enough of a challenge, but to get a reasonable visualization where the reader can make the necessary comparisons, or derive insights, requires further transformation and standardization. More plainly, just to get the dataset to a point where it is usable, there is a very real compute cost associated with those calculations. Adding on top of this is the computations for transferring a numerical value to an individual, unique point within a cohesive visualization.
As mentioned earlier, exascale computing might be the answer to data visualization woes. In situ processing-visualization has long been a goal for researchers, and advancements in feature tracking are able to do so in up to 4 dimensions. The feature tracking toolkit, developed by a team of public researchers, is being used to analyze and measure fusion reactions. Other efforts include better modeling of the parameter space to better predict positive and attainable outcomes.
Despite most data visualizations being consumed on a virtual screen, when it comes to big data, the data-ink ratio is just as important. When there is so much data, the temptation becomes to show it all to the viewer – to tell the whole story. However, just because the resolution of the dataset is increasing, does not mean the quality follows. The datasets end up being over-indexed, and will obscure, or overload, the actual information. A classic example of this is seen in heatmaps, pictured below. What do the colors mean? Why does the reader need to see a map of the entire United States? Can there be more or better information provided by showing just a single state or area?
More specifically, this is a US National Weather Service map: what if I just want to know if I need a coat today or not? If it will rain or not?
More practically, dynamic data visualizations can be used to create a flow between high-level analysis and granular insight. For an example of a dynamic – and narrative – data visualization, JDI loves the backstory to this conflict of interest map. The reader immediately sees the vast number of connections, but is then able to easily dive into individual data points to learn more.
Closing the gap between what is possible in data processing and what we can accomplish with visualization is a key to unlocking a whole new level of understanding and insight from data. For the weather map above, users could potentially zoom into their town or city to find out exactly what the weather will be like for that day, or switch between temperature and precipitation chance.
Further practices in how different dimensions interact with each other can only help to reduce noise and drive better learning. We live in this big data world, and it should work for all of us. Not everyone can pour over lines of code or rows and rows within a spreadsheet, but we can all look at a well-designed graphic of data and get something out of it. Just like Tufte believed, “What is sought in designs for the display of information is the clear portrayal of complexity. Not the complication of the simple, rather the task of the designer is to give visual access to the subtle and the difficult.”