Kirk Borne is a Multidisciplinary Data Scientist and an Astrophysicist. He is Professor of Astrophysics and Computational Science in the George Mason University School of Physics, Astronomy, and Computational Sciences (SPACS). Previously, he spent nearly 20 years in positions supporting NASA projects, including NASA's Data Archive Project Scientist for the Hubble Space Telescope, and as Project Manager in NASA's Space Science Data Operations Office. He has extensive experience in Big Data and Data Science, including expertise in scientific data mining. He has advised several federal agencies on data mining and Big Data applications, including the Executive Office of the President, the Library of Congress, National Weather Service, FDA Office of Drug Safety, and the NITRD Big Data Senior Steering Group. He is currently working on research, design, and development for the proposed Large Synoptic Survey Telescope (LSST). He established the field of Astroinformatics: Data Science for Astronomy Research and Education.
How do you get started with data mining? Everyone says there are all these nuggets buried in there; how do you go after finding them?
Data mining is the application of machine learning algorithms to large datasets. So, the first thing one needs to do is to learn what those algorithms are, how to use them, when to use which ones, how to prepare data for each case, how to measure the accuracy of and to validate models, and how to interpret (make sense) of the results. The data preparation phase is usually the most time-consuming, and so one must spend time with the data (or the data providers) — find out about the data — do some "data profiling"! The latter includes a lot of data visualization, previews, browse, statistical measures, and exploration functions.
Once you have some of those items understood, then you can do some experimentation. It is good to start with a plan (a strategy), so I suggest that one experiment with the above algorithms sequentially: first characterization (unsupervised learning), then categorization (pattern detection and segregation), and then classification (supervised learning).
How do you know when you need a data scientist?
In addition to the algorithms, strategies, and why/how issues that I mentioned earlier, a successful data scientist must first and foremost be a scientist. That means to be curious, to ask questions, to be a problem-solver, to learn how to develop and test hypotheses, to learn how to model a problem and test the model, and to be a tenacious skeptic! The scientist won't accept the first answer as the final answer — that is almost never the case in science.
So, you know you need a data scientist when you recognize that you don't have such a person on staff already with the above skills, knowledge, and talents. Of course, there are plenty of people with a subset of these skills and talents, but there are few people with all of them. If that were not true, then there would be no discussion today about the "data scientist talent shortage.” But, the reality today proves there is a real data scientist talent shortage, and it's okay to admit that you need help (from a data scientist) to achieve the full benefits of data science: data-to-knowledge transformation, discovery, innovation, evidence-based decision support, and deeper insights into your business.
One more critical activity of the data scientist is to learn about the Big Data owner's subject matter (their data sources, value chain, use cases, customer base, business goals, etc.), to study those things inside-out, to talk with subject matter experts (SMEs), and perhaps to become a SME yourself. Then the data scientist starts doing the data science, but at all times actively communicating and collaborating with the key stakeholders. For this to be successful, that other elusive skill of the scientist is needed: good communication and story-telling with your data!
Can you speak to the importance of recognizing patterns and acting upon them to increase efficiency? How does your research color your perception of how companies can use case studies to understand how to best leverage innovation?
The beauty of data-driven discovery and innovation is that you can automate some of the steps in the process — after you have achieved some degree of confidence that the algorithms you are using are answering the business questions, achieving the business goals, and are readily interpretable, applicable, and usable in appropriate business contexts. Once that level of confidence is reached, then one can automate many of the data processing and knowledge discovery steps, thus increasing efficiency in business processes.
However, I should point out that automation is not required to increase efficiency — the whole data-driven discovery process should lead to much better (and new) answers faster and more efficiently than ever before. Instead of having an army of analysts examining every detail of your data with standard business reporting tools, you can let a well-tuned data mining algorithm in the hands of an expert data scientist do the "walking through your data" for you.
One of the major features of the new Big Data paradigm is that we are moving from the old way of working with business data (for hindsight and oversight) to the new Predictive Analytics: for prediction and foresight! A daily dose of precise just-in-time predictions and actionable intelligence will change any CXO's perception of the power, efficiency, and efficacy of employing data mining and data scientists in their business practice.
The very same types of discoveries that I am attempting to achieve from big astronomy data sets are also applicable to companies of all kinds: finding patterns, trends, correlations, clusters, new classes of behavior, outliers (surprising, novel, unexpected events and behaviors), and more!
The list of possible discoveries and innovations that lead to discoveries is limitless — I would love to be part of that discovery and innovation process anywhere, in any context, for any type of organization. Most (probably all) data scientists feel the same way — discovery is super-exciting and addictive! We are all data-lovers!
Simplicity 2.0 is where we examine the intricate and transitory world of technology—through a Laserfiche lens. By keeping an eye on larger trends, we aim to make software that’s relevant to modern day workers, rather than build technology for technology’s sake.
Subscribe to Simplicity 2.0 and follow us on Twitter. If what we’re saying piques your interest, head over to Laserfiche.com where you’ll see how we apply the lessons learned on Simplicity 2.0 to our own processes, products and industry.