Carla Gentry, a self-proclaimed data nerd who even uses the term as her Twitter handle, has been working in data science for 16 years for Fortune 100 and 500 companies such as Johnson & Johnson, Hershey, Kraft, Kellogg’s, and Firestone, as well as the University of Chicago and the University of Tennessee. She has a data science blog and also curates the Analytics paper.li and scoop.it social media channels. In 2013, InformationWeek named her one of “10 IT Leaders to Follow on Twitter.” “If you met me in a bar, you’d never think I was a data scientist,” she says, describing herself as a motorcycle-riding, bird-watching nerd. “Until I sit down and talk zettabytes of data.”
What advice do you give to companies that want to start getting involved in data science or big data? Particularly people who've been focused on the collecting but aren't sure how to proceed to the analysis?
You need to get someone who has a solid analytical background, but also has business knowledge and IT knowledge. If you can get those, it’s the trifecta. You have to be able to not just program, but also be the liaison between IT and business. You need to know about things like servers, nodes, and toggling. If your server is not running efficiently, your program is going to take longer to run. People join all these databases and run this huge query and wonder why it takes three days.
It’s also important to clean and index your data. You might have a field for the date, but someone in data entry has messed up and now you have a row skewed to the right, so it’s going to come up as “NA.” In analysis, NA means “nothing.” If you’ve got 20 million records and 19 million are NAs, you can’t make any conclusions about the 1 million you have.
It’s very intensive to run analysis on billions of rows of data. Excel will handle a million rows, but if you do a pivot table and go nine rows deep, most people, unless they have a dual processor and a lot of memory, are going to see a blue screen of death. Not everyone is able to handle large amounts of data. You may even have to go to a mainframe.
I use all forms of media, since data science is not just a Twitter or Facebook thing. I want everyone to see my comments, from the CEO down to the beginning analyst. I use social media as a way to get the word out about data science and women in technology. I want to make sure the ladies out there know it is possible. I was divorced at 23, and I moved to Chicago with two little kids and a Geo Metro hooked to the back of a trailer. Knowing there are other people out there—single moms, in poverty—who came out of it and are successful, lets others know it can be done.
Social media is not as time-consuming as people think. I’ve been doing this for 20 years, so I can get a hundred stories in seconds and Tweet them to people who want to see them. I’m not one of those “follow-back” fans [those who follow people on Twitter just because they follow you]. I specifically have a target audience I’m looking at. That’s how you get listed as a “top 10 data scientist” and a “top 100 thought leader.” You give good content and you’re receptive and you respond to other people and you engage to them.
What's the biggest mistake you see companies making with data science?
Not having an experienced staff or thinking that they are a data scientist when they are far from it. It takes many years and lots of college to really be able to call yourself a data scientist!
Manipulating and running analysis on the data is hard. It’s not enough when you get in front of VPs to say the analysis was successful. That’s where the business knowledge comes in. It’s not just that “45 percent of our potential customers are female and 55 percent are male.” You have to do 9- and 10-tier levels deep analysis, like “My best customer is 45, has a dog, lives in a two-story house and drives a Cadillac.” To really understand the data you’re generating, you need to start at the very beginning of the process. These are the normal steps I follow when starting a data science project:
1. Exploratory: What is it that you are wanting answers to? Such as, “Who are my best customers? Where should I advertise my new product? Where should I place my new product (store and product analysis)?
2. Gap report: What do you have, vs. what do you need? For example, I may have sales information, but since it’s a new product, I need behavioral or demographic data to know what potential sales might be. Gap analysis forces a company to reflect on what it is and ask what it wants to be in the future.
3. Attaining data you don’t have: Open source, purchase data, build your own data from public sources. Factors include cost, time to receive, loading data, analysis, presentation, and so on. We are collecting more data than ever before, yet many organizations are still looking for better ways to obtain value from their data and compete in the marketplace.
4. Analysis: This isn’t something that’s done in Excel. You need an experienced staff and the right hardware and software. Complex analysis may run overnight, so make sure you have enough oomph to do it right. You may need to purchase additional equipment or staff.
Data available but not used is a wasted resource!
Simplicity 2.0 is where we examine the intricate and transitory world of technology—through a Laserfiche lens. By keeping an eye on larger trends, we aim to make software that’s relevant to modern day workers, rather than build technology for technology’s sake.
Subscribe to Simplicity 2.0 and follow us on Twitter. If what we’re saying piques your interest, head over to Laserfiche.com where you’ll see how we apply the lessons learned on Simplicity 2.0 to our own processes, products and industry.