Listen to the podcast:
Every week, we talk about important data and analytics topics with data science leaders from around the world on Facebook Live. You can subscribe to the DataTalk podcast on iTunes, Google Play, Stitcher, SoundCloud and Spotify.
In this #DataTalk, we had a chance to talk with Dr. Alberto Cairo about data visualizations and ways to avoid displaying misleading data. Make sure to follow our Dr. Cairo on Twitter and on his website: The Functional Art.
This data science video and podcast series is part of Experian’s effort to help people understand how data-powered decisions can help organizations develop innovative solutions and drive more business. To suggest future data science topics or guests, please contact Mike Delgado.
Here is a complete transcript:
Mike Delgado: I want to welcome everyone to our weekly Data Talk where we feature the smartest people working in data science. Today we’re talking to Alberto Cairo. You can learn more about him by going to AlbertoCairo.com. I highly recommend you check out his website. Later, after this broadcast, I’ll be putting in the URL for AlbertoCairo.com in both the YouTube video and here on Facebook.
Dr. Cairo is the Knight Chair in Visual Journalism at the University of Miami. He is also the author of two books dealing with data visualization and telling the truth with data. That’s the topic of today’s chat. Alberto, it’s an honor to have you today to talk to us about data visualization and the problems that we sometimes make as individuals. I’m not a data journalist or someone who works in data science, but just working with Excel, you can easily manipulate the data to make things tell a story that aren’t true, right?
Alberto Cairo: Yeah, absolutely. That’s not even the most relevant or the greatest problem that we face. I have always been a great believer in the power of visual communication to enlighten people and to illuminate and to highlight trends and patterns in data. I have always believed that since I began a career in this field 20 years ago.
In the past, I believe five years ago, when I started writing the second book “The Truthful Art,” I started thinking about how people, and when I say “people” I mean general citizens, a public who reads graphics and produces graphics, who use graphics when they use tools like Excel, for example, or free visualization tools that are increasingly available all over the internet. How people use them and how people read those things. I started observing the many mistakes that people make when reading graphics, even if the graphic is well-designed. That worries me because it highlights a problem that I described in some of my lectures.
The problem is the following: traditionally, educational systems focus on teaching student’s literacy and articulacy. Literacy is the ability to read and the ability to write. Articulacy is the ability or the skill to express yourself using spoken language. But they focus much less, the educational systems that I’m familiar with (Spain, Brazil, and the United States) on numeracy, the ability to think critically about numbers, data, statistics, et cetera. Moreover, even less on graphicacy, which is the term that we could use to refer to visual literacy.
Students in schools, primary, elementary, middle school, they learn how to read a bar graph, perhaps a line chart. They take statistics when they get to high school. They learn how to perhaps read a scatter plot. But beyond that, they don’t learn how to interpret maps correctly or data maps correctly and maps can be extremely misleading. That worries me because all the effort that I have seen put into this field by people who write about data visualization, the books that I have read so far, most of the talks that I have witnessed and have enjoyed throughout the years, were aimed mostly at a public who already has an elementary understanding of graphicacy.
What I’m most interested in now is not that. There’s plenty of people out there who are already doing wonderful work trying to push visualization forward, creating new methods of visualization to be used by scientists, statisticians, business analytics people. What I’m interested in right now and what my upcoming lectures and writing is going to be focused on more is spreading this knowledge among the public, not amongst specialists.
Mike Delgado: Alberto, for your incoming students what are some of the common mistakes or things you’re noticing as they’re creating their own data visualizations in your classroom?
Alberto Cairo: The main mistake that they make in relationship to the visualization of data per se is what I call working in autopilot. Working in autopilot means that when my students, or people in general see a data set that has a geographical parameter in there, like a longitude, latitude, or regions, or states, or countries, et cetera, they rush to create a map. They don’t even think about whether the map is the best way to display the data or whether the map is the best way to enable the tasks that you’re supposed to enable. They don’t think critically about. What is it that I’m trying to show?
A map is obviously a wonderful way of displaying data, but only when you want to highlight geographic patterns in the data. Not when you want to enable other possible tasks like comparing the variables within each one of those states or those regions, or when you want to rank those regions. A map is completely worthless for those tasks, right? Therefore, you need to think about other ways of representing the data that is not necessarily a map.
Or they drive to make everything into a bar graph. But depending on what you want to show, the bar graph may not be the most exciting or illuminating or even functional way to display data. There are many other ways.
I encourage my students, and everybody who wants to work in this field, to not just rely on software defaults (the recommendations that the software gives you.) “Oh, it seems that you have a univariate data set, let’s create histogram.” Yeah, histogram is a great way to show the distribution of data. But there are many other ways to show the distribution of the data. You can use a box, or whisker plot, you can use a straight plot. There are many other ways and each one of them has a particular task that is good at enabling.
What I encourage people to do is to try different things and to explore different ways of displaying the data. But always thinking that the main purpose of visualization is to highlight patterns and trends in the data that are interesting to people. That’s another worthy goal, creating something that is visually appealing. But the main goal is to illuminate people.
Never trust software defaults. Never trust these kinds of handbooks or manuals that tell you, “whenever you want to compare use just a bar graph.” No. There are many other ways to compare a data. There are other ways to display the relationship between variables, explore critically and creatively other ways of displaying data in a way that enables understanding. That’s always the key thing: a way that enables understanding.
There are other mistakes. I mean, there are mistakes that are made before you even visualize. Another thing that many people do is take data uncritically. By the way, all the mistakes that I always highlight in my books, my writings, my lectures, et cetera, they are all mistakes that I have made myself. So I feel entitled to talk about them because I’ve made them all. You don’t load a beautiful data set from the United Nations or from the World Bank or from the International Monetary Fund or from the Census and then you rush to visualize it.
Another main mistake is not considering where the data comes from and whether the data is measuring what it is supposed to measure. You always verify that information, double-check that information, or those assumptions with people who either have generated the data themselves or people who are experts in that knowledge field. This verification process is important before you even visualize the data.
Mike Delgado: I was watching one of your lectures and you emphasized the importance of before you begin to create your data visualization, it’s important to be asking the right questions. You told the story about when you first moved to Florida, how you were looking at schools for kids and looking at where the better schools are. What I liked about your example, maybe you can share it again briefly, is the questioning process you were going through in your mind, because that is key to creating a great data visualization.
Alberto Cairo: Absolutely. The first one is quite funny and it’s an example of what I meant before about making sure that the thing that you’re showing is measuring what it is showing. The first variable of school quality included in the data set is the school grade. The county assigns a letter grade to each school and that letter grade is A, B, C, D, et cetera. My assumption was that we have A schools, B schools, C Schools, et cetera. If I create a bar graph or a histogram of the distribution of the grades, my guess was going to be that you will have a few As, tons of Bs, tons of Cs, a few Ds, and fewer Fs.
Mike Delgado: Yeah.
Alberto Cairo: I created a bar graph where A is this big, B is this small, C is this is small. I said there is something strange about this data. The point that I’m making in lectures is that if a graphic contradicts a pretty reasonable assumption like that, you need to ask questions about why that is happening.
The possible answers to that question could be, for example, either that public schools in Miami-Dade are all wonderful because half of them got As. Which I know isn’t true. I know that there are many not that great schools in Miami-Dade. Or there is something faulty in the way that the county was grading the schools back in 2012 or 2013 when I did the exercise. This is when verification comes in. You discover something is strange in the data, you need to verify that.
The county corrected this. In more recent data sets, the number of As has dropped and the number of Bs has increased. It’s an obvious case of grade inflation. It’s a youth problem in universities. But then I also seek to answer other questions. For instance, other variables in the data set are the percentage of students who can read at grade level and the percentage of the students who can do math at grade level. I explored the data in multiple ways.
I also split the data in several ways because the Miami-Dade County is divided into nine borough districts so when I started exploring those variables. I started asking myself, are there differences between these small borough districts. I started splitting the data and I discovered that there are huge inequalities between the districts, which match economic inequality.
The point that I made in the book and in lectures is that you can have a school in a relatively rich district on one side of the road. 50 yards away on the other side of the road that splits the district in half, you have another school which is less funded and has students who are worse prepare for the school work and that has a lower performance. Literally, 50 yards away.
Mike Delgado: Wow.
Alberto Cairo: Yeah. If you think about how unequal the United States is in terms of income and in terms of school performance, this was not that surprising. I already knew that. I already heard about all these things, but it’s quite striking when you see it visualized.
Mike Delgado: Yeah.
Alberto Cairo: Those are the kinds of little examples that anybody can reproduce at home, regardless of whether they have a training in statistics or data science or visualization that can get the public excited about the possibilities of visualization. It affects their own lives.
The point that I make in lectures is that these kinds of journalistic tools and data tools, they don’t belong to a specific field anymore. They don’t belong to journalists; they don’t belong to data scientists. If you get some basic training, read a little bit about the statistics, a couple of books about stats, and then you practice with these tools, you can start or begin becoming a data visualization designer or journalist yourself. That will increase the general knowledge about graphicacy, it will increase the understanding that you have about your own neighborhood, about your own city. It can also help you become a better communicator, because once I discover this I will share it with my neighbors or with my friends or with my colleagues at the university
This is the kind of pattern that I would like to spread. This excitement about the possibilities of visualization and make visualization on data thinking and data reasoning something that anybody and everybody will embrace in the future, not just a specialized field. This doesn’t belong to a particular kingdom. It’s a skill and it’s a craft that anybody can understand and use.
Mike Delgado: Alberto, the way that you’ve described this, what I loved is that you are showing how data visualization and doing it properly is not just the statistics part, not just the numbers part, but also the qualitative, creativity, and asking the right questions. Right?
Alberto Cairo: Absolutely. Not only that, if you think about how I described the before process of creating a visualization, which is you gather the data, you explore the data, and then you visualize it, the step in between is the verification. You can do verification quantitatively. You can extract summaries of data, you can calculate uncertainties, et cetera.
But the real verification comes when you talk to people. When you pick up the phone and talk to the person who generated the data to ask that person about, if it is a survey, the questions that were asked in the survey to see whether the questions biased people in some way or if they were asking the right questions and were they measuring what they are supposed to be measuring. That qualitative aspect of a statistical analysis and data exploration, or data journalism, is obviously fundamental. It’s essential.
Mike Delgado: Do you encourage when people are working on these data visualization projects to work with others to verify their own questioning process to help avoid bias?
Alberto Cairo: Yeah, absolutely. That’s very common. It’s becoming common practice in journalism. It has always been common practice, but even more so now that journalists are using data broadly. If you go to the best practitioners of data visualization and data journalism today, in the United States, for instance, places like ProPublica or the New York Times’ constantly partner up with experts. ProPublica has people with training in statistics on the staff, a journalistic organization that has statisticians. Same thing with the New York Times. The New Times graphics desk has people like Amanda Cox who has a master’s degree in statistics.
For projects, particularly highly ambitious and complex projects, they partner up with professors, with political scientist, with statisticians all the time just because it is impossible to be an expert on every single kind of data. You need to have domain knowledge of those fields to understand the nuances, the exceptions, the possible problems and limitations of the data that you’re having.
That’s something that a journalist like myself cannot assess. So, you need help, you need to partner up with people. I have done that in the past when creating projects about population patterns or political changes, et cetera. You need to work with demographers, you need to work with political scientists who can help you put the data in context before you visualize it.
Mike Delgado: I love this advice, Alberto. I work in social media and I see infographics created all the time. A lot of times by graphic designers and sometimes they’re working in their own silo. They’re working for a specific company and they’re not maybe working within a group of people who are analyzing the data that they’re leveraging to create that infographic. I think it’s exciting that you’re encouraging people to use data and create visualization. But you’re also saying be careful because you might draw the wrong conclusions and create the wrong data chart, right?
Alberto Cairo: Absolutely. It’s a matter of being as rigorous as you can. Assume your own limitations, that’s the first thing, and then try to be rigorous. Just pick up the phone or send an email, have a conversation.
I’m going to give you an example. You mentioned my professional website AlbertoCairo.com, but as I also have a weblog which is the title of my first book, “The Functional Art”, at TheFunctionalArt.com. A while ago, I debunked a story that was published by several news organizations like the Washington Times. It was a story in which they said that some statistical projections have shown that there are likely two million undocumented immigrants voting in American elections.
Mike Delgado: Right.
Alberto Cairo: That’s a data point. Where does this number come from?
Mike Delgado: Right?
Alberto Cairo: Where does this number come from? I did what the journalists who wrote that story did, which was to verify the data and see where the data comes from and where that number came from. I devoted an entire day just to debunk this story. I contacted the authors of the survey, I contacted the so-to-speak expert who misinterpreted the data and made the wrong assumptions and extracted the wrong projections from the data. I verified the data myself. I asked three statistician friends of mine who know obviously much more than I do about stats whether what I was writing was right or wrong.
It took me an entire day to do this. But I believe that it’s something that everybody can do. You see a data set or a news story that doesn’t work or that is using data that is looking dubious, I think that we have the responsibility to look deeply into that story. If the story is wrong, we need to write about it, we need to debunk it somehow.
These two million people were an extrapolation from 36 people out of a sample of 1,000 people. One survey, which was not even designed to detect whether there are people voting illegally in the United States, was designed for a completely different thing. It detected that some people in the survey made a mistake to say, “yes, I’m not an American citizen but I still vote.” There were 27 or 30 people who said this but they didn’t clarify whether they vote in American elections or they vote in their country of origin.
Mike Delgado: Right.
Alberto Cairo: Out of those 30 people It’s basically extrapolated to the entire population and said there’s probably around 2 million people, with a margin of error of half a million people or something like that. First, it’s completely wrong, and second, it’s unethical.
Mike Delgado: Yes.
Alberto Cairo: It’s unethical. Which leads me to another point, it’s also moral thinking and ethical thinking. That is not well taught in schools, how to think ethically in a systematic manner. How to assess whether the actions that you’re about to take are right or wrong, good or bad. One of the reasons why I think that this is so relevant is that more and more I’m seeing people who adopt an attitude of nothing matters, nothing is true. The only thing that matters is to push my own ideological agenda and persuade people to join my tribe. Rather than spreading truth, they are trying to spread their own ideology, and there’s a difference between those two things as you well know.
That’s another thing that I’m trying to possibly make part of my next book, how we can teach people to think more ethically when dealing with data, when dealing with reasoning and especially when dealing with visualization and graphics.
Mike Delgado: No doubt. I mean, you spread false claims like that and it can make people upset, especially towards immigrants, and it’s all based on false data. A misreading of the data wasn’t ethical as you pointed out because it wasn’t even surveying enough people. Then the very fact it’s a survey is also troubling.
Alberto Cairo: The survey was right. It’s already explained in the article that I wrote about. The survey, one that the extrapolation was made from, is a rigorous survey. It was designed to ask Latinos about their political opinion. They asked them who are on the right or on the left, whether they liked this candidate or that candidate, but buried in the data. There was this data point, which was not related to the survey at all. It was one of those questions that are asked in surveys, “are you an American citizen, naturalized, are you not naturalized?”. There was a mismatch between the people who said they were citizens.
Out of that mismatch, which was probably just a mistake from the people who answered to this survey, they made the extrapolation. They took those 27 people and said from the survey of 1,000 there were 27 who are not citizens but are voting anyway. If you extrapolate that to your population, there are between one and two million undocumented immigrants voting in the elections. But the survey was not designed for that.
Mike Delgado: Yeah.
Alberto Cairo: Those 27 people were probably just people who mistakenly answer to the question saying, “well, I vote.” But they didn’t clarify whether they vote in American elections or in the elections of their own countries.
Mike Delgado: That’s right.
Alberto Cairo: They just say “I voted” or “I vote.” It was like one of the worst cases of data manipulation that I have ever seen.
Mike Delgado: I’m glad you jumped all over that. It’s a lot of work.
Alberto Cairo: Yeah. I had one day off. It’s like when ideas are autonomous beams. So once an idea latches in, you’re to come to take it out, right? You need to get it out somehow.
Mike Delgado: No doubt.
Alberto Cairo: I froze everything else that I was doing and needed to do. Preparing for classes, writing, et cetera. I need to devote some time to basically look into this thing and try to see whether there is some merits in the claims that are being made.
Mike Delgado: Yeah. I get very skeptical of survey data in general. I just finished reading this book called “Everybody Lies.” I’m not sure if you’ve read it yet.
Alberto Cairo: Yeah. I have it, I haven’t read it yet. But, yeah, there is a mismatch between what people say they like and what they search for in Google.
Mike Delgado: Exactly. Alberto, we are we are coming to a close and for those who are watching, I highly recommend that you check out AlbertoCairo.com. You can find links to his books “The Truthful Art” and all the work that he’s doing to help others make better data visualizations.
Like Alberto was saying, be careful with the data that you’re using because of hidden bias that you might have. Alberto stressed the importance of working with a team. When you don’t know something or if you know somebody who’s an expert in a certain field, bring them along in your journey as you are working with that data, just like Albert did.
That’s so important because as a data journalist or somebody who is trying to make a point, you need to be careful about the data you are leveraging and that visualization because you could be spreading false information and getting people upset if you’re not doing it properly.
Alberto, thank you so much for being our guest in Data Talk. It was great chatting with you. I know you’re going to be flying somewhere, but I just want to thank you for your time today.
Alberto Cairo: Thank you. Thank you so much. Nice to be here today. Take care.
Mike Delgado: Nice to meet you, Alberto. Take care.
Dr. Alberto Cairo is the Knight Chair in Visual Journalism at the School of Communication of the University of Miami (UM), where he heads specializations in infographics and data visualization.
He’s also director of the visualization program of UM’s Center for Computational Science and Visualization Innovator-in-Residence at Univisión.
Cairo is the author of the books Infografía 2.0: Visualización interactiva de información en prensa, published exclusively in Spain, and The Functional Art: An Introduction to Information Graphics and Visualization.
Over the past two decades, Cairo has been director of infographics and visualization at news organizations in Spain and Brazil, in addition to consulting with companies and educational institutions in more than 20 countries.
Make sure to follow our Dr. Cairo on Twitter and on his website: The Functional Art.
Check out our upcoming live video big data discussions.