World of methodologies | Unstructured data and Machine learning in Real-World Evidence studies
We start this year in the world of methodologies with a hot topic of utilizing unstructured data in the generation of real-world evidence (RWE). This blogging is authored by Medaffcon‘s Junior Data Scientist Olivia Hölsä who has expertise in text mining and machine learning. In this post, we enter into special characteristics of unstructured data and more specifically into real-world text data.
In the previous World of Methodologies posts, Medaffcon’s Data Analysis Lead Iiro Toppila has explained the basic concepts of machine learning and machine learning-based classification. Briefly, a machine learning model learns by example, and therefore lots of data covering enough varying examples are needed. As real-world data (RWD) stored in registries varies in type, amount, and coverage, the data type has an important role in machine learning.
“The wide range of real-world data types challenges the conduct of RWE studies.”
Structured data is represented in a tabular or some other standardized format. Usually, structured data is utilized in RWE studies since it is cost-effectively available and applicable. For example, data on patients’ diagnoses, healthcare contacts, and medicine purchases are available in a structured format.
The potential of unstructured data
Electronic medical records, however, include lots of interesting and useful data in an unstructured format. Clinical notes written by a physician during a patient encounter are an example of unstructured text data. Clinical notes may contain information on patients’ risk factors or treatment responses. Another example of electronic medical records stored in unstructured format is medical imaging examinations.
“Unstructured data is a largely unexplored area in the national registries and hospital data lakes.”
Unstructured data is already involved in the RWE studies. However, difficulties in utilizing unstructured data begin already at the data extraction. Traditionally, unstructured data has been manually collected from electronic medical records for research purposes. Because the manual collection is time-consuming, unstructured data is often left out of the studies.
Machine learning methods have become promising in extracting data available only in an unstructured format. Automatic machine learning-based methods are increasingly replacing a manual collection of unstructured data performed by a human. For example, machine learning-based text classifiers can be applied to extract smoking status or sites of metastases from clinical notes instead of manually reviewing them.
Natural language processing (NLP) and machine learning
In the world of methodologies blog, the methods to automate data extraction from the text are called natural language processing (NLP), an umbrella term for all computational methods to process or analyze text. Over the past decade, the rapid development in machine learning has also been reflected in the development of NLP methods when the machine learning methods have become widely applied in the NLP context.
Applying machine learning-based NLP methods usually follow three steps:
1. Text processing
Text preprocessing usually precedes the analysis of the text. The most common preprocessing steps are splitting a text into reasonable pieces (e.g., words), removing special characters, converting capital letters into lower ones, converting words to their root form, and removing general words (including pronouns, conjunctions, and auxiliary verbs). These steps simplify and normalize the text because a computer does not understand the text but processes it as a set of characters. For example, a computer considers the phrases “Medicine,” “Medicines,” and “medicine” to be different words because the set of characters is different.
2. Text vectorization
In text vectorization, a text is converted into a structured format. The simplest vectorization method is to count the occurrences of words in a text. In more complex methods, machine learning is applied to vectorize the text, such as creating a vector space where similar words are close to each other.
3. Selection and training of a machine learning model
After preprocessing and vectorizing the text, it can be utilized similarly on machine learning models than other structured data. A machine learning model is selected by the problem to be solved. In other words, the model should solve a defined NLP problem as well as possible.
Commonly known machine learning applications in the NLP context are chatbots, language translators, and junk mail sorters. They do not work flawlessly, but properly used, they are excellent tools. In other words, the saying “a good servant but a bad master” is also relevant in the NLP context.
RWE, text data, and NLP
I mentioned above some examples of unstructured RWD. Clinical notes – or at least certain parts of them – are available for research purposes in Finland. Pieces of clinical notes may be provided for the researcher, and the clinical notes can be utilized to apply NLP methods, for example, to extract patient characteristics or support cohort formation.
The data controller can also apply NLP methods to create new variables from clinical notes. Converting unstructured data systematically into structured variables could enable new RWE study settings on a larger scale.
Machine learning can be widely utilized to modify real-world text data into a more easily processed and analyzed format. This could enable new variables to be included in the RWE studies and reduce manual data extraction work.
In addition to text data, machine learning may assist in converting other types of unstructured RWD – such as X-ray and MRI images – into a structured format to be included in the RWE studies. The following blog posts will write more about the other data types and their applications in RWE studies.