Skip to content

Home > All articles > Text mining helped analyse a million entries on smoking

Text mining helped analyse a million entries on smoking

Text mining decreased the work of clinical experts and accelerated the research process.

On the basis of real-world evidence, Medaffcon wanted to find out how smoking affects postoperative surgical complications. Responses to this question were sought in academic cooperation by studying the clinical notes of patients after surgery. Physicians enter a variety of information in the clinical notes, including whether the patient smokes.

However, smoking data is not found in the clinical notes as structured data. It means that there is no separate field in the health records where the physician could enter whether or not the patient smokes. Instead, physicians enter data on the patient’s smoking in free form within the wider patient records. There is no uniform manner to record it, which means that there is a huge amount of different types of formulations.

Million sentences on smoking

When the material was reviewed, a total of million different smoking-related sentences were found. How to review a million sentences and analyse them? There are different possibilities. One possibility is to recruit a huge number of clinical experts to analyse the data. Another is to limit the material drastically in order to reduce the workload.
However, neither of these was used, but Medaffcon developed a classifier based on machine-learning to help analyse the material. In order to teach it, clinical experts classified a total of 20 000 smoking-related sentences. This work was conducted in one day by two clinical experts supported by Medaffcon’s pre-processing and helpful tools. After this, the remaining million sentences were analysed and categorised with the help of an algorithm based on machine learning.

– Without the algorithm, this sort of analysis would have been impossible to conduct. The number of patients would be completely different. Previously, it was possible to include thousands of patients in similar studies, now hundreds of thousands, says Medaffcon’s Data Scientist Juhani Aakko.

The quality of entries is essential

This sort of text mining happens every day in the processing of unstructured data, enabling the use of extensive material. Juhani Aakko estimates that the use of different algorithms based on machine learning will increase in the analysis of healthcare data.

– The use of machine learning methods is limited by the fact that they would require an enormous amount of data to be taught. Healthcare has data, but clinicians should review large amounts of it in order to teach the algorithm. 

No matter how sophisticated data analysis methods would become, one old fundamental remains, which is the good quality of entries.

– We can only assess data that has been entered. It would be good to pay attention to the quality of entries and to harmonised principles of entering data. Hopefully, healthcare gains new manners to facilitate entering data, picking up some of the data into structured format already when entering them, says Aakko.

Takaisin ylös