Listen to the podcast:
Every week, we talk about important data and analytics topics with data science leaders from around the world on Facebook Live. You can subscribe to the DataTalk podcast on iTunes, Google Play, Stitcher, SoundCloud and Spotify.
This data science video and podcast series is part of Experian’s effort to help people understand how data-powered decisions can help organizations develop innovative solutions and drive more business.
To keep up with upcoming events, join our Data Science Community on Facebook or check out the archive of recent data science videos. To suggest future data science topics or guests, please contact Mike Delgado.
In this DataTalk, we’re talking with Dr. Megan Price, Executive Director of Human Rights Data Analysis Group, about ways to use data to promote social justice and human rights around the world.
Mike Delgado: Hello, and welcome to Experian’s weekly Data Talk, a show featuring some of the smartest people working in data science today. Today we’re talking with Dr. Megan Price about how data scientists are using statistics to advance social justice and human rights.
Just a little background on Megan. She received her Bachelor of Science and Master of Science in statistics from Case Western Reserve University and then went on to get her Ph.D. in biostatistics at Emory University. Megan, it’s a pleasure to have you on our broadcast today.
Megan Price: Thank you so much. It’s fun to be here.
Mike Delgado: I always like to start off these episodes by asking our guests how they got started with data science. What did your journey look like, both academically and then what you do now?
Megan Price: I had a fairly linear approach to becoming a data scientist, because as you just described all my degrees are in statistics or biostatistics, and I was always a math nerd. I was a math nerd my whole life. My fourth-grade science fair project was about math.
Mike Delgado: That’s awesome.
Megan Price: Yeah, it was. I have to say, I still think it’s kind of neat.
So I always knew I wanted to do something with math and with data analysis. And I was very fortunate when I was in undergrad to have some great mentors who exposed me to some statistical consulting work in public health, looking at clinical trials, looking at various treatments. When it came time to look at graduate school, they advised me. They said they thought I might enjoy a public health school experience more than just a straight Ph.D. program.
And then it was in public health school that everybody I interacted with wanted to do some kind of social justice work. I was surrounded by this idea that that was the best use of our skills.
It was in public health school that I first learned about my organization, the Human Rights Data Analysis Group, and specifically the work using statistics to analyze human rights. That was the end of the story. I was very fortunate from there to get the job I have now.
I’ve been very lucky to always get to work in this field.
Mike Delgado: That’s really cool. I’m curious about when you were working on your Ph.D. What were some projects you worked on that were really exciting to you?
Megan Price: My adviser in my Ph.D. program worked primarily on clinical trials that involved stroke victims and victims who had had traumatic brain injuries. That was largely what I was studying for my methodological work. The substantive questions were: What kind of treatment are these patients getting, and is it working? And how can we use our statistical analysis to answer those questions?
Mike Delgado: There’s obviously the math component to it, but then as a researcher, there’s also the emotional side, that you want to find solutions, right?
Megan Price: Yeah. Grad school was one of the first places I learned the lesson of how important it is to care about the substantive questions and to tackle important substantive questions, but then also how to balance that with the day-to-day technical and methodological work. Because you do have to be able to come in every day and analyze that data and try to answer your questions using your best technical skills. Finding ways to balance that, and to cope with that, started in grad school.
Mike Delgado: As you’re getting into the data … I mean, these are people’s lives. How did you separate, not get too close and be objective during that whole process?
Megan Price: It is a tricky balance, because you never want to forget that these are people’s lives, but you do need to stay objective. And in grad school, I learned from my adviser, and I modeled after her work. Once I started my current job … I come back, again and again, to this quote from my colleague and HRDAG’s co-founder Dr. Patrick Ball that we have a moral obligation to do the best work that’s technically possible to honor these individual lives who are represented in our data.
That’s what I come back to every day when I’m in the weeds of a technical problem or a methodological problem — remembering that focusing on that piece of the work isn’t disrespectful, and it isn’t putting distance between me and what the data is about. It’s enabling me to do the best work I can to respect the individual lives that the data are about.
Mike Delgado: Was it difficult for you to leave academia to then begin to work for a nonprofit, or do you feel like what you were doing in academia transferred over so easily you felt like you’re doing the same thing?
Megan Price: It wasn’t difficult for me to leave because I was the kind of person who knew I was never going to stay in academia.
Mike Delgado: Okay.
Megan Price: I knew from the get-go I wasn’t on a path to become a professor. But, on the other hand, it was not a seamless transition. I definitely did not feel like I was doing the same thing.
The way I think about grad school, the skills I think grad school gives you, are ways to think and ways to learn things. So, even though the work I do now is methodologically quite different from the work I did in my graduate research, every day I’m using the skills I developed there. Learning how to read papers and figure out if some new technology or some new method that someone has developed applies to my problem and then how to apply it to my problem. So that was a direct transfer.
Mike Delgado: What’s also interesting is that on this show I’ve been able to interview lots of different data scientists. I feel like, as a data scientist, you’re constantly a student.
Megan Price: Yes.
Mike Delgado: You’re constantly learning. You’re always in grad school. It never ends.
Megan Price: That’s very true, and I think that’s part of what makes it a good fit for me. It was something I worried about a little bit when I was in grad school, because I did school for so long. I was really good at it, and I really liked it. I did worry a little bit if I was not going to formally stay in school and be a professor, how was I going to do something else? So that feeling that I’m always learning and I’m always in school is comforting.
Mike Delgado: That’s awesome. Today’s topic is around advancing social justice and human rights and how you’re leveraging statistics and data science to do this. Can you share some different use cases, some things your organization has done, or work you’ve done in the past, to help advance social justice?
Megan Price: The most visible and directly linked way that our work influences social justice is when we’re asked to testify in court cases as expert witnesses. Those are some of the clearest examples, and most of this work has been done, again, by my colleague Dr. Patrick Ball. He typically is the one who testifies and presents our analysis in court cases.
Some of the ones that have the best outcomes are in 2013 he testified in Guatemala in the case against General Efraín Ríos Montt, who was the de facto leader of Guatemala in the early ’80s and was charged with acts of genocide. And Patrick presented some analyses that our team did that showed specifically that members of the Mayan population in Guatemala during this time period in some specific regions had a five to eight times higher risk of being killed by the army than non-Mayan members of the population.
The lawyers argued that that statistical pattern to the violence is consistent with ethnic targeting and with genocide. Ultimately, the judges agreed. The judges found Ríos Montt guilty, and, in fact, they referenced Patrick’s analysis in their verdict.
Again, this is one of the most positive possible outcomes for us, because what the judges said was that the analysis Patrick presented confirmed numerically the stories the individual victims were telling. We view that as our role, to affirm and to amplify the voices of victims and victim communities.
We’re one piece of the puzzle, but we’re hopefully a piece that strengthens some of that advocacy work.
In the case of Guatemala, unfortunately that verdict only stood for 10 days. The constitutional court in Guatemala overturned the verdict on a legal technicality, and a retrial is still working its way through the court system. So, we’re all still waiting to see what’s going to come of that.
Mike Delgado: As you are working on a case like that, you’re dealing with a lot of sensitive data. You’re dealing with people’s lives. As you’re working on these very sensitive issues, how are you protecting the privacy of that data and at the same time being transparent with the courts? Can you share a little bit about that process? Because I think a lot of data scientists who are listening to you would be very curious how that works.
Megan Price: We have very strict in-house protocols around keeping the data secure, and they’re largely what you would expect. All of us have encrypted machines; we keep the data locally, on our own server that’s also encrypted. All the conversations, all the movement back and forth between our machines to do analysis, and the server where the data is stored are encrypted.
That’s our in-house data protection policy. In terms of transparency, we do worry about that a great deal, because all of our work has to be transparent, replicable, auditable. But, for us, that’s much more about the code than the data. We write all our code in open source platforms so that anyone can run it, can access it, can interpret it.
And we do all our coding and all our version controlling of code through GitHub, but all the data, again, stays locally. We separate out those two pieces.
When it comes to things like court cases and things we might need to share with lawyers or judges, that all has very specific regulations depending upon the specific jurisdiction. Similarly, each of our projects has different parameters depending upon … In most cases, we have not personally collected the data. In most cases, we partner with a local organization that’s done the data collection.
Then we’ll write a data sharing agreement with that organization that not only outlines the security measures we’re going to follow to protect their data, but also the terms under which we would share their data in circumstances like if a lawyer or a judge needed to see it.
Or, in other circumstances, we might have negotiated. Our partners might be interested in research that other groups we work with are doing, and they might be comfortable having their data used for research projects. Or they might not. It really depends on the specific case, the details of the data and what they’re being used for.
Mike Delgado: How do you prep the attorney you’re working with? You’re working in statistics, you’re working with algorithms and models, and you have to be able to translate what you’ve done to help the lawyer — who may not have a strong background in maths — build the case. But he needs to be able to defend it, because the other lawyer, the opponent, might have some criticism about what you’ve done to manipulate the data.
Megan Price: Absolutely. That’s one of our biggest challenges and something that over the years we’ve gotten significantly better at. And one piece of that is just the conversation with whatever lawyer we’re testifying for or with.
That conversation is a lot like any other consulting conversation that any data scientist has, where you have that sort of iterative conversation where a partner, a client or a colleague says, “I think I have this question” or “I think I have this data.”
And you say, “Well, you could probably answer this other question, or if this is the question you want to answer this is what the data would have to look like or the kind of data you would have to have access to.”
That very familiar in-depth conversation to help them understand the link between their substantive expertise, their substantive questions and our methodological expertise and to bring those together. That’s one piece of it.
Another piece of it, obviously, is the judge, who we don’t have a chance to have those kinds of conversations with. Preparing through practice, ourselves and the lawyer we’re working with, to present our analyses in a way that’s very accessible, very readily understandable and very specifically applicable. Again, legal questions are so specific and so narrow. So making sure our analyses are applicable to those questions is another thing we’ve worked on and gotten better at.
Then the last piece is that there is an ongoing challenge for us, and for anyone introducing evidence, specifically in a courtroom, about what gets considered to be evidence and how expert witnesses are evaluated and are determined. This was something we struggled with at the international criminal tribunal for the former Yugoslavia, where the defense brought in their own expert with a different specialty who, in our opinion, was not qualified to evaluate the analyses that Patrick was presenting. But it’s up to the judge to decide who is considered an expert witness and who gets that weight.
And that’s slightly outside of our wheelhouse, but it is something that we worry about. How can we help to strengthen this community more broadly by helping judges and lawyers to establish standards around what gets introduced as evidence and how it gets evaluated?
Mike Delgado: The terminology maybe doesn’t translate, so you’d have to do a really good job of explaining how all of this works, maybe even providing visuals, data visualizations, to help explain. And then also think like a lawyer. What arguments could there be against my data?
Megan Price: Yes.
Mike Delgado: Are there any other use cases or examples of how data science has been used to help resolve some of society’s ills?
Megan Price: I’d love to talk about one other example from our work and then a couple others that are not from our team.
One is some new work we’re doing where … Again, this is work Patrick is doing in Mexico, using machine learning models to predict municipios in Mexico that are likely to contain hidden graves that haven’t yet been discovered.
Mike Delgado: Wow.
Megan Price: The thing that’s so interesting to me about this particular project is for our partners working on the ground, when we showed them the results of the model and said, “These are the municipios that we think probably contain hidden graves.” And all the experts said, “That’s not surprising. That’s where the violence is happening.”
In this case, the most valuable result is the fact that it was not surprising, because it’s given the advocacy groups another tool to petition the government to investigate and look for these graves. Because, understandably, in the midst of upheaval, it can be very difficult, and often very dangerous, to look for and investigate these potential hidden graves.
We’re in the midst of that project. I don’t have a good conclusion for you about success on that, but I think it is a relief.
Mike Delgado: Yeah.
Megan Price: … component to the advocacy when these groups can say, “But this statistical model also says we should look here”.
Mike Delgado: No doubt.
Megan Price: Then the couple of other examples that are not our work but I would love to draw attention to. Microsoft Research actually did some really cool stuff a couple of years ago around flash flooding and looking at the gauges that are in rivers and streams and that sometimes can get overwhelmed when there’s a flash flood. Instead of triggering a warning, they instead say the water level has gone down, because they’re overwhelmed and they’re broken.
So Microsoft Research developed these machine learning models to leverage the input from neighboring streams and the sensors that were still working in the other streams to get much better data and much better warning systems about when there might be flooding. I think Microsoft Research is actually doing some really …
Mike Delgado: That’s awesome.
Megan Price: … some unexpected social good work.
Mike Delgado: Yeah. That’s wonderful. We just got a question. Let me put it up on the screen. It’s from Christina, asking about the first use case. What are hidden graves?
Megan Price: They are basically any unofficial, so not inside a cemetery or other sort of place that you would expect to find graves, and they are unmarked in some way. Essentially, to talk about them bluntly, it’s a place bodies get dumped. It’s a place where there’s more than one victim, it’s unmarked, and it is in some way undiscovered or hidden in some fashion.
Mike Delgado: Are a lot of those from wars, from illegal activity …
Megan Price: All of the above. In Mexico in particular, it’s very difficult to tease out drug-related violence and other sort of political violence and who the perpetrators might be.
Mike Delgado: It’s fascinating that you have leveraged machine learning to uncover that. Can you explain how that happened?
Megan Price: It’s one of those applications that our team excels at — recognizing that these partner organizations … Data Cívica, which is a Mexican nongovernmental organization, and Iberoamericana University, the human rights program at a university there in Mexico, had information about graves that had been found this way. Essentially locations of bodies that had been found and attributes of those municipios.
And this is the thing that machine learning excels at. Humans could maybe look at some of those attributes and subjectively try to draw some conclusions, but the machine learning algorithm doesn’t really care what any of that information is. It’s just going to do a really good job of classifying areas into likely to contain or not likely to contain.
And that’s essentially what we did. We used the information we had from these groups about graves that had been discovered and then municipios that they were pretty sure did not contain graves. That was our classification problem. Then we fed all the attributes we could possibly think of about these locations into a random forest, and we let it predict for us, and it predicted … With our testing data, it had perfect results. It exactly predicted the testing data. We’re continuing to update that. Every year we get more information from Data Cívica and Iberoamericana. We’ll see. We may tweak the model, we may use a different classifier, but for right now that’s how it’s working.
Mike Delgado: One more question about that. In that particular study, what percentage was structured data versus unstructured?
Megan Price: To be perfectly honest, I don’t know. I’d have to go back and look at it. My guess is that almost all of it was structured, but I don’t actually know.
Mike Delgado: Okay. That’s a fascinating case study. You mentioned the Microsoft one. Did you have another one you wanted share?
Megan Price: No, I think that sort of covers it.
Mike Delgado: Okay. Those are beautiful examples of how data is used. It’s beautiful when you see data scientists coming together to work on problems in society, human rights issues, using data to help solve it and uncover it. And the examples you just gave — in Mexico using machine learning to help find these unmarked graves is a beautiful example, because it helps build the case for reasons why certain groups need to go in and protect a certain area to uncover what’s going on.
Megan Price: Yes, for sure.
Mike Delgado: Are there any other issues on your mind? Things you’d like to work on in the future that you think … Maybe there’s not enough data yet, but you’re thinking it’s something on the horizon you would love to work on.
Megan Price: Yeah, a lot. It’s a weird way to think about this, but my wish list is very long. At HRDAG, a lot of our projects are in conflict and postconflict countries, and unfortunately there’s no shortage of those. So, in terms of things I want to work on, or things that I’m thinking about … Basically anywhere in the world where bad things are happening, I want us to be working. We can’t for various reasons, but that’s the short answer.
The more specific answer is we do have a couple of projects where the data we’ve used has specifically been archives; they’ve been documents left over by various bureaucratic institutions. And I happen to know that in both Egypt and the Ukraine, during various moments of political upheaval recently, documents were abandoned.
I have no connection to these. I know nothing about the details of what happened to those documents or where they’re being stored, but I’m so interested. I would love to gain access to them. That is somewhere where some of the analysis we’ve honed on these other projects could really help shed some light on things that were happening in those countries.
And then things that we don’t have enough data on right now but that I … You know, if I could wave my magic wand and get data around sexual violence and human trafficking. I think those are really, really hard questions to answer quantitatively. Even with current, really creative uses of ways to get data and creative methods, I just don’t think those are problems that have good analytical solutions yet, and I wish we did.
Mike Delgado: I know we have to go in about four minutes, but what are some of those challenges? Because that is a huge issue. What challenges, right now, do you have with gathering that really sensitive data?
Megan Price: You’ve gone right to it, especially in terms of sexual violence. One of the biggest challenges is that not only is some of the data hidden, because data is always hidden, we always have incomplete or missing data, but specifically some of the data is hidden because the victims themselves don’t want to disclose what happened for a wide variety of reasons. And there isn’t a good solution to that right now. There isn’t a way, that I know of, to use analytical methods to estimate that missing piece.
So, the sensitivity of it is the big challenge. There are a lot of methods for handling sensitive data and for trying to reach what in public health is called hard to reach populations. This is definitely a problem folks are working on, but it’s a hard one.
Mike Delgado: We did an episode awhile back based on the book Everybody Lies, and the premise of that book was around how people use Google as a confessional. How during surveys they won’t admit to things, but to Google they will search for things.
Megan Price: Oh, interesting.
Mike Delgado: It’s very fascinating to hear how, through using Google data, they were figuring out there was a lot of racism where you wouldn’t expect. Because I always thought that racism was a Northern-Southern issue, in the U.S.. The racism uncovered through Google is more of an East-West issue, which was fascinating. And they were using racist joke queries in the Northeast, and shocking things were being searched for. Now I’m wondering about people who are victims of sex crimes. Are they maybe searching for certain things in Google that might help uncover that?
Megan Price: That’s a really interesting question. I have a colleague who’s thinking about … She calls it data exhaust, and thinking about exactly that — these sort of breadcrumbs that people might leave.
Mike Delgado: That’s the hardest thing because those are issues no one wants to talk about. No one wants to admit to it, and thankfully we have a MeToo movement happening where things are starting to be discussed, which is beautiful, but there are so many other things that are not being uncovered, right?
Megan Price: Absolutely.
Mike Delgado: Just a small percentage right now is being put out there.
So before we go, we have four final questions we ask everybody.
Megan Price: Sure.
Mike Delgado: What is your favorite programming language and why?
Megan Price: I live in R and Python, and I probably should like Python better, but R is the first programming language I learned. So it’s like any other native tongue; it’s the one I think in.
Mike Delgado: Yeah, from what I’ve read, for everyone who is in statistics, R is the language. Is that fair to say?
Megan Price: I think so, and it’s definitely what I used more in school.
Mike Delgado: Cool. Second question. What advice do you have for people who want to become data scientists?
Megan Price: I love that question, and I think I have a somewhat contradictory answer, because one of the reasons I love statistics and data science is it can be used to answer so many different questions. But I think if you’re looking to get into data science, it’s worth starting with your question. What is it you want to do — not necessarily one specific thing — but what’s the category of thing you’re interested in?
Are you interested in better understanding clients and customers? Or are you interested in better understanding sports? Or some of these social justice questions? Because those are going to lead you toward fairly different skill sets, and as much as you do always need to be learning, it’s useful to start in a place that’s related to your motivation.
Mike Delgado: Nice. I love that.
Okay. Last question. What advice do you have for leaders who are looking to build a great data science team?
Megan Price: My advice to leaders is very similar. Think about what your goal and your motivation are and be
really clear about that in recruiting and forming your team. Because, at least in our experience, it’s often quite possible to teach the technical skills, but it’s very difficult to get the commitment to mission. Starting with someone who understands what it is you’re trying to achieve and then spending the time to make sure they have the technical skills to help you achieve that is the right way to go.
Mike Delgado: Wonderful. Well, Megan, thank you so much for being our guest on Data Talk. Where can everyone learn about you and your work?
Megan Price: HRDAG.org and we’re also on Facebook. We’re HRDAG on Twitter as well.
Mike Delgado: For those who are watching the video, either on Facebook or YouTube, we’ll put the URLs in the comments, so you can go there. Also, if you’re listening to the podcast, we’ll have a full transcription along with links on our Experian blog, and the short URL is just ex.pn/MeganPrice.
I want to thank everyone for tuning in to this week’s broadcast of Data Talk. We’ll be back next week. If you want to learn about upcoming and past episodes, you can always go to ex.pn/DataTalk.
Dr. Price, thank you again for your time today.
Megan Price: Awesome. Thank you so much.
As the Executive Director of the Human Rights Data Analysis Group, Megan Price designs strategies and methods for statistical analysis of human rights data for projects in a variety of locations including Guatemala, Colombia, and Syria. Her work in Guatemala includes serving as the lead statistician on a project in which she analyzes documents from the National Police Archive; she has also contributed analyses submitted as evidence in two court cases in Guatemala. Her work in Syria includes serving as the lead statistician and author on three reports, commissioned by the Office of the United Nations High Commissioner of Human Rights (OHCHR), on documented deaths in that country.
Megan is a member of the Technical Advisory Board for the Office of the Prosecutor at the International Criminal Court, on the Board of Directors for Tor, and a Research Fellow at the Carnegie Mellon University Center for Human Rights Science. She is the Human Rights Editor for the Statistical Journal of the International Association for Official Statistics (IAOS) and on the editorial board of Significance Magazine.
Megan earned her doctorate in biostatistics and a Certificate in Human Rights from the Rollins School of Public Health at Emory University. She also holds a master of science degree and bachelor of science degree in Statistics from Case Western Reserve University.
Check out our upcoming live video big data chats.