Over the last few decades, the stream of data available to life sciences companies has grown from a trickle to a tidal wave: genetic and genomic portraits of individual patients, metabolomic and proteomic profiles, real-world data from wearables measuring everything from heart rate variability to blood glucose levels, detailed patient clinical histories from electronic health records.[1] The total volume of health data in the world is expected to soar to 2,314 exabytes by 2020, 15 times what it was in 2013. By some estimates, if this data were stored in a stack of tablet computers, the stack would reach 82,000 miles high.[2]
Data analysis has flourished, too. Alongside classical statistics, powerful artificial intelligence technologies have emerged that can manipulate massive numbers of inputs and curate data stored in non-standard formats--take, for instance, the more than 700 different ways researchers have historically recorded gender in clinical trials.[3] One branch of AI called machine learning can identify patterns in data without any starting hypotheses—which means no need to make prior assumptions about what surprises might be lurking there.
The new AI tools, combined with the boom in healthcare data and the rise of personalized medicine, will transform clinical trials and drug discovery. McKinsey Global Institute estimates that AI could add $100 billion in value to the life sciences industry annually.[4] Researchers are already using machine learning tools in combination with statistical analysis to uncover new biomarkers and other patterns in vast repositories of -omics data and clinical histories. Medidata’s Rave Omics application, for example, has uncovered critical insights for rare disease research. Life sciences companies are also beginning to use AI to ensure clinical trials produce regulatory quality data, sorting and classifying data entry errors, outliers, inconsistencies and misreported adverse events, which should speed up the drug approval process.
And yet, most life sciences companies still aren’t using AI approaches for data analysis to their fullest potential. That’s partly because AI is new, and partly because the FDA hasn’t sanctioned it for drug safety and efficacy approvals. But it may also reflect a lack of understanding about what AI can do and how it differs from statistics.
One succinct way to describe the distinction between the two: statistics accomplishes what is hard for humans and easy for computers, whereas artificial intelligence tackles things that are hard for computers and easy for humans. The former spits out p-values, the latter struggles with speech recognition and image recognition. (Is that a turtle or a gun?) One field of study called machine learning, combines AI with statistics, tackling the things that are hard for both computers and for humans.
What is statistics?
Classical statistical modeling techniques were developed between the 18th and early 20th centuries to study, quantify and describe populations, economies, and moral actions.[5] They were generally adapted to much smaller datasets than those currently available, however.[6] The discipline exploded in popularity in the 1980s with the emergence of Bayesian modeling, which allows statisticians to estimate probabilities.
Statistical modeling became essential to drug development after 1962 amendments took effect that required any drugs approved for the market to show proof of efficacy. Today statistics is commonly used to evaluate how much better a therapy works than a placebo or standard of care to treat a patient population.
Statistics is designed to make inferences about the relationship between variables—to determine the input variable’s impact on the output variable. But is less suited for large data sets with vast amounts of input data where the relationship between variables is unknown. It becomes cumbersome and unwieldy to evaluate the statistical significance of each input variable. Statistical modeling requires the statistician to develop tight assumptions about the problem or question being analyzed, especially data distributions, before the models are run.
Artificial Intelligence:
Although artificial intelligence has become something of a buzzword in the past decade, it dates to the invention of modern computing, so it’s no newcomer to the field of analytical modeling. AI aims to understand human intelligence—particularly human skills such as recognizing objects and sounds, speaking, translating, performing social transactions or creative work—in order to replicate this intelligence in machines.
In life sciences, AI can be taught to differentiate cancer cells in a laboratory, to identify patterns in high quality medical images such as X-rays, and to analyze complex sets of genomic data. AI analytics can also rapidly combine consumer data, treatment data, diagnoses, lab tests, and other information stored in natural language to identify unexpected or novel patterns and to predict treatment responses and patient behavior.
Machine Learning:
Machine learning is a subfield of computer science and artificial intelligence that aims to build systems that can learn from data, rather than just follow explicitly programmed instructions. Machine learning was made possible by cheap computing power and the availability of massive amounts of data from which computers could “learn.”
Machine learning is built on a foundation of statistical inference, but it does not require preset assumptions, which allows computers to discover insights and make classifications that human analysts couldn’t anticipate and to generate predictions with superhuman accuracy.[7]
There are several types of machine learning, including supervised machine learning, unsupervised learning and reinforcement learning. With supervised machine learning, the computer is fed data that includes the answer to the problem posed by the data set. It is used to teach the computer to make predictions about future data sets.[8] With unsupervised learning, no output or answer data is included initially, but the algorithm can make decisions about patterns it finds in the data[9]. Reinforcement learning, inspired by behavioral psychology, involves providing rewards and punishments to the computer to teach it to achieve a certain objective.[10] This is the technique that was used by Google computer program AlphaGo to beat the human Go champion.
Unsupervised learning might take the form of processing omics data to generate relevant clusters, or associations in the data. For data quality applications, it could aid in association mapping—looking at an entire database, in an unassisted way, and identifying the relationships between two data points. This could be used to identify unanticipated inconsistencies in a data set that could otherwise cause compliance problems.[11]
With data volumes increasing at an exponential rate, it is becoming increasingly difficult for life sciences companies to keep up. Machine learning algorithms have huge potential to help with analyzing data and deciding which pieces of information are relevant, helping to draw insights from massive data volumes. It’s an approach already being used in other fields and industries[12] and has tremendous potential in clinical research. Expect to see a combination of statistics and machine learning powering the clinical trial of the future.
[2] https://www.cio.com/article/2860072/healthcare/how-cios-can-prepare-for-healthcare-data-tsunami.html
[6] https://www.mckinsey.com/industries/high-tech/our-insights/an-executives-guide-to-machine-learning