Listen to the podcast:
Every week, we talk about important data and analytics topics with data science leaders from around the world on Facebook Live. You can subscribe to the DataTalk podcast on iTunes, Google Play, Stitcher, SoundCloud and Spotify.
In this #DataTalk, we have a chance to talk with Matt Dancho (Founder of Business Science) about ways machine learning can help organizations with recruitment efforts. Make sure to follow Matt on Twitter, LinkedIn, and GitHub.
This data science video and podcast series is part of Experian’s effort to help people understand how data-powered decisions can help organizations develop innovative solutions and drive more business. To suggest future data science topics or guests, please contact Mike Delgado.
Mike Delgado: Hello and welcome to Experian’s weekly Data Talk Show, featuring some of the smartest people working in data science. Today, we’re very excited to talk with Matt Dancho. He’s the founder of Business Science LLC and has successfully implemented cutting-edge data science techniques in finance and marketing. He earned a Master of Business Administration degree from Penn State and a Master of Science degree in industrial engineering from the University of Pittsburgh. Matt, we are so excited to have you as part of our community today. How are you doing?
Matt Dancho: I’m doing great. Thanks for having me. I really appreciate it.
Mike Delgado: I’m so impressed just looking at your academic background — industrial engineering and computer science. You’re all across the board, obviously a huge fan of math and statistics, and we’re going to talk about how you fell into data science. But can you walk us through your journey, how you went from engineering to computer science to what you do now?
Matt Dancho: Sure. It’s pretty unique. I graduated from Penn State in 2006, and my first job right out of school wasn’t in data science. It was in valve engineering because I was a mechanical engineer. The data science aspect didn’t really start until I began my first management role. That wasn’t until 2011. Until then, I’d been working in Excel, doing primarily engineering calculations, a little bit of automation, some BBA. But it started with that first management role.
Imagine 2011 rolls around and you’ve got an engineer being thrust into a management role. So what do you get? You get an engineer combined with a business position is analytics. That’s how the equation goes. You’ve got me who needs to understand, figure out, analyze, and slice and dice, and really wants to understand the business side of it. Sales was completely foreign to me at that time. It started as business intelligence, trying to describe what was going on within our business and the particular areas that I oversaw.
That eventually morphed into prediction. And then you can see it goes down the line to where we are now with machine learning and all the advanced tools. That’s, in a nutshell, on the data science and how it’s related professionally. I don’t know if you want to know about my personal side of it, but I also have some roots in finance. I began managing my own portfolio back in 2011 when I graduated school. A lot of people know me for the tidyquant package.
Mike Delgado: That’s interesting.
Matt Dancho: It’s kind of funny. This was back in 2006 — again, when I graduated. These are parallel paths, the professional side and the personal side, which is my own interest in finance. It started out in 2008 with the financial bubble. I did horribly. I got smacked around. I had these portfolios that I was managing at the time just in my own personal investments. After the financial crisis happened, I thought, “I need to smarten up. I’m an engineer. I can do this.” I hunkered down and learned the mechanics of finance and time series. This is the other side of where data science gets its roots with what I was doing. I was able to turn my investments around and take them from just doing OK to being excellent performers. This whole process eventually led to the creation of the tidyquant package and R. I have an interesting relationship with R, but it’s been a great ride. Going from Excel, and actually, I initially started in Python, of all places …
Mike Delgado: Which is now the leading programming language.
Matt Dancho: … which is the leading for machine learning and analytics. I think it gets that leader spot because a lot of the colleges and universities are starting folks learning computer science in Python, and they aren’t doing that in R. Whereas in academia, it’s the statistics, the biologies, the psychologies in academics really are strong, need to have very strong statistics, statistical analyses. That’s why I gravitated to it. Coming out of engineering, I was strong in statistics, and Python was more foreign because it was more on the coding side.
Eventually, I fell in love with coding, but on the R side. It really spoke to me because at that point in time, a gentleman named Hadley Wickham was developing — It wasn’t quite called the Tidyverse yet. I think it was called the Hadleyverse. He was developing packages like ggplot2 and dplyr and eventually those programs called libraries or packages in R. They really spoke to me, so I started implementing them in business, and that was like Excel on steroids. I couldn’t get enough of it, so I was applying that both on the professional side and the financial side.
Mike Delgado: I bet your company loved the fact that you just were just killing it on Excel. It’s bringing a whole new game.
Matt Dancho: Yeah. It was really cool. It’s just a different way of thinking about being a data-driven company. And this was before data was cool. It was like, “All right, we’ve got these couple of tricks.” It wasn’t machine learning. It wasn’t XGBoost or H2O or anything we’re probably going to touch on in this talk. It was like linear regression. “Can we apply that to make a forecast better?” It turns out you can, and you start to get insights that become wow factors and then enable us to make better decisions.
Mike Delgado: Then you’re the only person in the company who can do this, and you can’t leave.
Matt Dancho: Job security, yes. It’s a natural output.
Mike Delgado: That is so funny. This is back in the day — like you said, before data was cool — and all of a sudden there are all these articles about big data. Data science is the hottest new position. What was going on in your field, in your mind, as this big data explosion was happening and you were already doing all this work?
Matt Dancho: For me, big data back then was, “OK, Excel is breaking for some reason. It’s because my file size got above 50 megabytes. Apparently, I need to do something differently.” Then you look for tools to help you do the job, and you’ve got Python, you’ve got R. R was something I was using, and once I went to R, big data for me was, “OK, pulling in a million rows, no problem.”
That wasn’t a challenge, and that was no longer big data. Now, you have these different misconceptions or conceptions of what people even call big data. How I defined it back in the day is no longer what big data is. Big data is terabytes now. You need distributed computing clusters — Hadoop, Azure, you name it — and all the buzzwords. For me and for most businesses, big data is maybe millions of or a couple million rows. That’s all manageable in a lot of these packages.
Mike Delgado: That’s fascinating. I know today’s topic is about the work you have done around helping recruiters reduce employee turnover through machine learning. This all started with this article that you wrote that just blew up, went viral. I wanted to talk with you about your process, because it’s just fascinating that this is a side project for you. You just love data science so much, and you’re like, “Oh, this is an interesting area. I want to just dig into this.” You write this lengthy blog post that goes into great detail on how you built out this model to help companies reduce attrition. Can you walk us through that article — why you started to do this and a little bit about your research?
Matt Dancho: First off, I am passionate about learning. I love to learn. That’s why I write these blog articles. If you go to our website, we have a ton of articles both on our software packages and case studies, or we’re getting really application-specific lately. But this HR analytics article actually started as a way to win a client. It was probably not how most people would think — “Oh, we’re just trying to give out information.” We were trying to win some business, and we ended up creating this article because we wanted to show off our capability. We said, “Hey, we’ve got this potential Fortune 500 customer that we need to pull in. We really want to do this job with them, and we think we can do that by writing this article.”
The client, the group within that company, was HR. So we saw a possible way to really show off our skill set if we could find a data set that was available to us. Unfortunately, there aren’t a whole lot of data sets out there in HR that companies are willing to share, so we had to do some searching. IBM’s Watson’s website had a nice data set. It was related to a very important topic, employee attrition or turnover. Initially we just wanted to beat Watson — again, it was kind of self-serving. We’re trying to outperform the competition and see if we can do a better job not only predicting but also explaining employee turnover.
What ended up happening though … With these kinds of articles and any business science or data science problem, you have to do a lot of research, and you have to understand not only the problem at hand but how that influences or applies to the business. We ended up — like you said, it’s a lengthy article that starts with a business case and introduces the reader to that. But then we go through all these very cool, cutting-edge packages — H2O and Lime are the two we really leveraged. It ended up being a beautiful thing. We got the high accuracy. We got the ability to explain, so we actually beat IBM Watson’s motto. I think they had like 85 percent accuracy, and we were around 88 percent with the H2O package. It ended up being a really cool thing.
Most importantly, though, we want to stress — and it’s in the article, the quote from Bill Gates. Doing that research, just understanding that Bill was saying that if you take away our top 20 employees, overnight Microsoft becomes a mediocre company. I think that’s what struck the biggest chord with readers, and that’s one of the reasons it went viral. We’re showing how you can apply data science and help solve a pretty important problem that most companies have.
Mike Delgado: It’s amazing the work that you did to get that data, because like you said, it’s very difficult to get any sort of company data on HR because of privacy concerns. But I love your approach, trying to solve a business problem and starting there. What were some of the initial questions you had about the data?
Matt Dancho: It’s really important to understand that most data is not perfect, especially in any business problem you’re getting ready to try to solve. There’s a big challenge, even with the data set we got from IBM. It was good. It had all sorts of features, things like work-life balance, how long the employee had been in their current position, and those are all really good things. But it’s really making sure that you have the right data to solve the question you’re trying to solve. For us, it was employee turnover. We’re trying to say, “OK, what factors influence employee turnover?”
Things like work-life balance and years in the current role were great. But the things that tend to have the most impact are things like clusters in the day, like groups, for example. When you have things that you can group people on. For example, what’s their job role? Are they a sales rep or a laboratory technician? You can group people on these things, and they typically can help provide insights into the data. These things are most interesting to the model and typically have some predictive capability, but it’s about trying to maximize those features that are going to help you understand that problem.
Mike Delgado: I remember when I was looking at system screenshots of your data set, you had a column for distance from work as one of the factors that you’re looking at. It’s amazing — the amount of data that you had to work with and then finding those clusters, those different positions, because each of those job roles could have different attrition rates.
Matt Dancho: That’s what we saw in the predictive explanation. Job role was one of the key factors. Certain roles had higher attrition rates; certain roles had lower ones. That’s an easy way to group your employees.
Mike Delgado: Do you want to share some of the insights you learned from this research that you think other organizations can find helpful?
Matt Dancho: Keep in mind that this data set from IBM is an artificial data set. So can you take it from the artificial to the actual, applying it to your particular business? You have to be careful, but there are definitely things that you can look at. Two of the features that really jumped out to us — we already mentioned one, the job role. Sales reps had a higher rate of attrition. There are a few other job roles with high rates. I think you can generalize that to companies, ways to segment your data into groups.
The other one was the overtime. Again, artificial data set, but overtime was a key factor that came out of this analysis. A significant proportion of people in that attrition group were working overtime versus the nonattrition group. As a business looking to apply something like this, you can look at certain features like that. If you can collect that data, you can potentially use those features as good indicators — or, hopefully, good predictors — of your attrition.
Mike Delgado: This is such valuable information to help businesses keep their best employees. Going back to your earlier reference to the Bill Gates quote about becoming a mediocre company when you lose your best employees. This type of model that you built, when applied properly, could help organizations identify people who may be on the edge of leaving, like they’re working overtime or are burnt out.
Matt Dancho: Absolutely.
Mike Delgado: Their drive is far away. If that can be signaled to their business leaders —like, “Hey, just an FYI.” I don’t know what that even looks like. “FYI, we think so and so …”
Matt Dancho: It’s going to seem a little big brothery if it just comes out of the …
Mike Delgado: Right. I don’t know exactly how that works.
Matt Dancho: If you have an organizational culture that says, “Hey, we’ve got a good data management strategy. We’re looking into these types of things,” you’re communicating that with your business leaders. You can certainly work out systems to help prevent, especially your top talent … The key here really is, “Are they good employees?” And if they are — I mean that quote from Bill Gates, top 20 employees, if they leave, that Microsoft becomes mediocre. This is probably a 95, and it’s probably more like 2,000 or 20,000 by now. But definitely, I think you need to underscore the fact that if you’re losing good people, there’s a big problem, and this is potentially a way to help solve it, or at least head it off.
Mike Delgado: A lot of those things that you’re looking at in your model are quantitative. It was numbers-based.
Matt Dancho: Yes.
Mike Delgado: Have you ever looked at, or considered looking at, employee survey data? Every year, there’s a people survey, like how they’re feeling. It always gets a little tricky with survey data because people can lie. So that’s a little bit harder, I think.
Matt Dancho: Right, yes. Survey data is tough. On the one hand, you can ask everybody what their level of satisfaction is, and are they going to report what they truly believe, or are they going to report what their supervisor wants to hear? That’s one challenge. What we found is really good, and I’ve seen this before. When we start collecting, for example, performance data, a lot of companies will have their performance data and say on a paper report like a performance review that’s done once a year. If you can start to pull that data out of the unstructured format and get it into a cohesive managed data set, you can start to really get good things. The survey data becomes hit or miss, but if you can get stuff that’s a little bit more concrete or objective, where people have a set of rules that they’re trying to follow, that typically works better.
Mike Delgado: As you were developing this model and developing the blog post to share the insights learned, was there anything that surprised you with the data?
Matt Dancho: Well, the biggest surprises to me were really in the analytics. H2O — I’ve used it a couple of times, but this was my first time using what they just released here. I don’t think it’s on CRAN yet, but it’s on the version that you can download from their website. When you click the install in R, it has some instructions for it, but they have this new algorithm called automated machine learning. Auto ML is the function name. Just running that … You have to understand that we normally have to do quite a bit of work to try to tune the models and make sure we’re getting the highest predictive accuracy, so this saved a ton of time, really getting an accurate model very quickly. Again, the accuracy on our model was around 88 percent, which is —
Mike Delgado: Fantastic.
Matt Dancho: Yeah, which was pretty good for this type of data set. The IBM Watson model — I don’t know if they had it tuned for optimal accuracy, because they were trying to also be able to explain the information. But their model was around 85 percent accuracy. Right out of the gate, this thing was very good, so that was a huge surprise to me. The second one was using Lime, which is relatively new. It’s available in both Python and in R, but it’s really a tool for explaining what drives a machine learning classifier. This was my first experience using Lime.
The gentleman, Thomas Lin Pedersen, just poured it over in to R here quite recently. In the post, we had to write a few functions to get H2O and Lime to play together nicely. But we got it working, and now it’s actually integrated into Lime. I work with Thomas, and we were able to get all the stuff integrated so people don’t have to write those extra functions to get them to connect up. But Lime was incredibly useful. It gave a nice feature visualization of the variable importance plots. That really enabled us to pick out many of the influential variables and really see, “Oh, wow. Job role is a huge factor here. Those sales are up, or they’re turning over a lot more frequently than some of the other roles.” That was really cool. That was a huge surprise, just two surprises in one post.
Mike Delgado: Listening to you talk, what I love is how curious you are with data, how hard you worked to clean up that data, to work with that data and then, when you can’t find certain answers, you’re going out of your way to say “What do I need to use?” H2O, Lime, and learning it in real time so you can fix things. Then you’re also contributing it to the community.
Matt Dancho: Yeah. The community is a huge thing. It’s one of the fundamental principles for business science, and I personally feel like there’s not a whole lot of data science going on out there in the business and businesses in general. It’s starting to become more prevalent. We really want to spread data science to organizations that don’t have the data science teams or the capability in-house. It’s a big problem, and we’re trying to solve it. We do a lot of stuff for free just to put it out there and expose people to some techniques, possibly at the expense of a little bit of a competitive edge. But we feel that it’s well worth the time and effort and energy that we put into it.
Mike Delgado: Well, you’re doing a phenomenal job, and I love that you’re just sharing this out. For everyone who wants to learn more about Matt, you definitely need to go to business-science.io. I’ll put it on the screen here. Check out his articles. Also make sure that you’re following him on LinkedIn. Before we go, Matt, I have four last questions. These are questions that we get a lot. The first one is: What advice do you have for people who are interested in getting started with data science?
Matt Dancho: It’s a long process but It’s a rewarding process. I hate to say it, but you’re never done so it continues in perpetuity. But just start by learning the basics, visualizing data, tidying and wrangling it, then move to the more advanced applications like modeling. When you get into modeling, just start with the basics, linear models. Really understand what drives those, and then move to the XGBoost or the H2Os or some of the other algorithms out there.
The best reference that I can provide at the moment for data science is a free book. It’s developed by Hadley Wickham and Garrett Grolemund at RStudio but I am pleased to announce too that starting in 2018, we’ve got a big push for developing some application-specific courses and really trying to help data scientists who want to apply these concepts and skills in the business environment. We’re going to try to teach those individuals how to do so.
Mike Delgado: That’s awesome. So everyone stay tuned. I’ll make sure to include a link in the about section of these videos, as well as the comments, so you can learn more about what Matt is doing in this space. Matt, another question that we get is: What program language should I learn first, and what’s your recommendation?
Matt Dancho: That’s a tough one. Naturally, I’m going to say R because that’s the one I use. I think it depends who you are and what you like more. For me and for business, I think R is a little better because it has some tools for visualization such as ggplot2. It’s got a very nice workflow with data science in the Tidyverse, which is the R version of how to slice and dice and work with data. If you want to get into computer vision and those types of things, you might be better off learning Python out of the gate, and that’s perfectly fine. You can always go to a different language. The most important thing is to start learning and start learning now.
Mike Delgado: Great advice. You mentioned that a lot of businesses aren’t implementing data science in their organizations. What advice do you have for leaders who are looking to hire a data science team?
Matt Dancho: This is a tough one. One of the reasons we developed business science as a consultancy is to help with this. There’s three things. In order to even have to a data science team, your organization has to be educated. It has to have a data management strategy, and you have to have that data science capability. For the education aspect, you really don’t want to hire data scientists without first educating senior management on what data science is, how data fits into that equation, and what kind of answers you can get and how data is integrated into that. Because if you don’t have good data, you can’t get good results. The organization needs to be educated on that first.
The other thing is the data management strategy. Unless organizations are capturing that critical data, they can’t do an effective job with a data science team. You can’t even get to that point yet. Once you have that data management strategy developed throughout the organization, that last part is the data science team. You have the culture now, you have the data management strategy, so you just need good people. You hit the nail on the head when you said curiosity early on. That’s a huge part of hiring anyone. You want to make sure they’re curious and passionate about learning, and they really buy into digital marketing or HR or whatever capacity that you’re looking at applying data science.
Get people who are excited about it. Don’t worry about getting unicorns — mythological data scientists who can do everything. They don’t exist, and if they did, you couldn’t afford them. Focus on getting individuals who are good complements to your team, have skills you’re looking for in particular areas, and are excited about learning.
Mike Delgado: Awesome. Before we go, where can everyone go to learn more about you?
Matt Dancho: We have business-science.io. If you go to that website, you’ll find all my content, and all Davis’s content. Davis is our software manager. He’s the creator of tibbletime. You’ll learn about him as well. There’s a wealth of information, so check that out. Other ways that you can find out about me — I’m pretty active on LinkedIn and Twitter. You can get me at MDancho84, which is my Twitter handle. I love connecting, especially with the R community, anything related to AI, machine learning or business. Just feel free to hit me up.
Mike Delgado: Awesome. I’ll make sure to put links to your LinkedIn profile and your Twitter handle and the businessscience.io in the about section of the YouTube video and in the comments of the Facebook Live video so everyone can connect with Matt and keep up with what he’s doing. I want to let everyone know that we do this show every week. We talk with different data scientists, and we talk about things that are happening, whether it’s machine learning, artificial intelligence, internet of things. Matt, it’s been a blast chatting with you. Thank you so much for sharing your insights with us and all the helpful work you’re doing for our community and the coolest blog post about how HR can use data science to help keep good employees.
Matt Dancho: If you’re interested in HR, definitely check out the blog post. Mike, it’s been a blast talking with you too. I’m really glad we could do this.
Mike Delgado: Awesome. For everyone here, thank you so much for watching. For the podcast listeners, thank you for listening. We’ll be back next week to learn more about upcoming broadcasts. To see past broadcasts, just go to Experian.com/datatalk. We’ll see you next week.
Matt Dancho is the founder of Business Science, LLC and has successfully implemented cutting edge data science techniques in finance and marketing. He earned a Master’s of Business Administration Degree from Penn State and Master’s of Science Degree in Industrial Engineering from the University of Pittsburgh.
Make sure to follow Matt on Twitter, LinkedIn, and GitHub.
Check out our upcoming live video big data discussions.