The Four Pillars of Visual Analytics

1402

INTRODUCTION

Every chart’s selection begins with one common factor: The Data.

When people think about choosing a chart, is because they are trying to tell a story about their data, but very often they run into problems trying to tell that story. Sometimes it’s incomplete, other times it’s misleading and often it’s confusing or just plain wrong.

YOU HAVE TO UNDERSTAND YOUR DATA IF YOU WANT TO CHOOSE THE CORRECT VISUALIZATION

Here we will cover what I consider to be the Four Pillars of Mapping Data to Visualizations in greater depth. Those four Pillars would be Data Attributes, Visual Encoding,  Usage (relation between data) and Target Audience (Data Science, Analyst, CEOs…).

DATA TYPES

When it comes to data attributes, there are two categories: quantitative data and qualitative data. Quantitative data is exactly what it sounds like: a numerical value placed on an ascending scale (i.e. I am 32 years of age, I drank two bottles of water today). Qualitative data refers to values that cannot be measured numerically, but can be described through language (i.e. I came in 3rd place at the swim meet, since I’m always on the run I prefer a laptop over a desktop).

Within these two categories are a total of four subcategories as well:

Quantitative Data

  • Ratio: Data you can perform arithmetic operations on (add, divide, etc)
    Example: How much do these clothes I want to buy cost if I add them all together? (cost $10, $20, $30 or age 10 years old, 20 years old, 30 years old)
  • Intervals: Data with a set value that you cannot perform all arithmetic operations on.
    Example: You cannot calculate the sum of temperature during a week but you can calculate the average temperature per day and the high/low for each day. (temperature -5°, 10°, 25° or time 1am, 5am, 9am)

Qualitative Data

  • Ordinal: Data with a fixed ranking with indeterminate distance between the values
    Example: A large soda in Sweden is very different from a large soda in the United States, but I don’t know exactly how much larger (size small, medium, large or position 1st place, 2nd place, 3rd place),
  • Nominal: Data where you can distinguish between values, but not order them.
    Example: The term football can refer to NFL football or English football, there is no way to distinguish which one is better…I think I will leave that up to my colleagues in the US and UK! (sports NFL football vs. English football or computers laptop vs. desktop)

Based on these classifications, the methods for aggregation and visualization of the data needs to adjust accordingly. For example, if you were to map car manufacturing data like the image below, and your data set included year-to-year manufacturing figures – it makes more sense to stick to an annual order. If you try to sort the values by highest value, your readers will have trouble following the order of the years (1978, 1979, 1980, etc). Ideally, ordinal data should be sorted by its order as opposed to the alphabetical sorting of the names in the values (if you were mapping month-by-month for example).

There is much more to cover but hopefully this post offers a basic guideline to help you determine what type of data you are trying to visualize. In my next installment of the Three Pillars of Data Series, I will address visual encoding and how to determine what markers to use in order to accurately display these data attributes.

This is just one example of how to classify data attributes and there are more advanced ones out there that may be even better to use. For example, it’s hard to classify data that is calculated in percentage.  But I still believe this post is a good start and easy to remember. So now you can start to think about the data and what you can do, but also what you shouldn’t do! Just following some of these guidelines will get rid of some basic mistakes in your visualizations.

In the next post I will also show how we can use the step of classifying the data to better select the appropriate method to represent the data.

VISUAL ENCODING

Once you have already identified a process to determine what data type it is you have (nominal, ordinal, interval, ratio) and the axis to map it on. Now we need to figure out how to best visually display that data using colours, shapes, sizes and position.

For proper perspective on the subject, in 1984 William S. Cleveland and Robert McGill published a landmark piece of research on graphical perception that articulated the standards that many data visualizations abide by today. Their research, which was published in the Journal of American Statistical Association, concluded that everyone has different perceptions of visualizations but there are a few simple steps that everyone can follow. Cleveland and McGill tested a series of visual encoding theories through experimentation and established a series of guidelines based on which visual marker is more accurate vs. less accurate.

For all data to be mapped to a visualization, these are your basic options of display:

The order of accuracy for these markers is this:

Position is the most accurate marker followed by length and angle, which makes sense if you are mapping data points that we identified in the prior post (cost, age).

 It is very important to use the correct encoding in each case, for example in the first chart, we see a visualization trying to indicate cars being sold across various countries, but there is a problem. In this case, a nominal attribute (country) being mapped by length, which does not help us understand the data very well.

Let’s try mapping this data another way. Below, you can see that both attributes have been mapped by position, which allows us to learn more about the data. This is much better.

For more information about How to Choose the Right Chart take a look at this post.

Here you can find some basic rules to select the correct visual encoding for your charts:

  • For Nominal data: No one value is more important than the next: while position is best, circles and squares will can be helpful to display your data.
  • For Ordinal data: Because you are trying to map data with an inherent ranking, the light and dark tones of shading will further emphasize your data’s importance.
  • For Interval/Ratio data: You are looking to map numerical values, therefore the best way to measure those values is through position or length.

USAGE

The Third Pillar of Mapping Data to Visualizations is Usage, that means, how are you going to use your data.

More accurately: it’s a question of what you want to see in your data. Without knowing what your data set looks like I would say that a bar chart is better than a line chart (unless you’re looking at trends over time). I would also say that it’s very rarely a good idea to use a pie chart – but there is more to it than just a set of guidelines without referencing the data.

Each time you map data to a visualization, the key thing you need to consider is: What are you trying to show the user? It isn’t a question of what looks pretty, anyone can do that: you want a chart that demonstrates value and achieves its purpose in an easily recognizable way.

Remember back to the infographic included in the How to Choose the Right Chart post where we break out a list of basic charts into four groups: Comparison, Composition, Distribution and Relationship?.

These are the main uses that you can give to your data, and this will guide you to choose the correct visualization. Let’s run through each group:

Comparsion Visualizations

Comparison – These visualizations relate to the time and size of your data. You are quite literally comparing multiple values: in some cases the data is timed in others it’s itemized as you can see above. Unfortunately there isn’t one chart you can use for all timed data; some situations dictate line charts while others merit bar charts or area charts (for cyclical data).

For fewer categories among items, a bar chart’s length displays the differences in your data better than its angle. That’s the reason why we prefer a bar chart to a pie chart when it comes to comparison. If you remember my blog post about encoding data you can see that length is more accurate than angle on interval/ratio data.

Composition Visualizations

Composition – These visualizations refer to data sets that change over time or include static data (do not occur over time or are non-spacial). With static data, a pie chart can work, however there are a host of other options that can tell the same story. With data that changes over time: the number of data points is a critical asset. One should also consider that the axis needs to match the order of the data (ergo in a stacked bar chart the years 1990-1999 should be listed in order as opposed to by highest value).

Distribution Visualizations

Distribution – Here you are mapping a single variable versus two variables. With this data set, you don’t want scroll bars to toggle through the data: you just want the full picture. As you can see from the choices above if you have many data points, it’s best to use a line histogram. If you are interested in mapping every point? You want a bar histogram.

Relationship Visualizations

Relationship – The fourth and final group of visualizations ups the ante from the Distribution charts. Here you are always mapping two or three variables. The best guideline to follow in this case is rather clear cut: if you have two variables and you want to add nominal or ordinal data to categorize your data, then use color. But if you’re adding a third variable that’s interval or ratio you can see that size is better. This also ties back to my second blog post about the best way to encode data.

TARGET AUDIENCE

The four and final pillar of visual analytics is to know your audience before you start to represent your data.

Haven’t you ever prepare a perfect presentation for the wrong audience? Every visual representation is done for something, and it is important to understand that visual analytics is not understood in the same way for Data Science than for CEOs.

Take this into consideration when choosing your representations, do not choose to complex charts if the audience only needs a Total Sales Summary Report. In the other hand, if you need to give detailed information for a group of Data Scients, could be interesting to choose a combined representation with different types of charts, with drill-down functionalities…etc

CONCLUSION

I hope that this post has open up a much larger world to you. I encourage you to continue discovering other factors for your charts,  like maps, slope graphs or box charts.