Submitted by Paras Raj Maheshwari and Sarthak Tripathi, Research Interns on Dengue Fever Surveillance in India Using Text Mining in Public Media.
Statistics show that communicable diseases stand out to be the leading reason for death in Asian and African nations. One of the reasons why combating these diseases remains a challenge is that the current surveillance systems are not advanced enough to show accurate and timely data for the outbreak of the disease.
The inaccuracy is a result of underreporting of the cases to the national health departments. The timeline of reporting varies from institution to institution, some of them report data from the previous week while some have delayed data from as long as a year. These reporting issues turns out to be a problem for policymakers, and have a direct impact on public health.
After further deliberation into the issue, it can be said that one of the ways to get more accurate information is by creating a surveillance tool which will use data mining to gather news which will contain local and recent information for the spread of any disease [for the purpose of the article we will consider the disease to be malaria]. Using text mining would enable the analyst to
Differentiate topics being discussed in news sources
Uncover the evolution and spread of dengue to aid in monitoring the trends
Assist experts to reduce morbidity and mortality
Text mining cluster analysis of newspaper articles will help in detecting dengue trends occurring in a specific geographical area and taking preventive actions. A study of the Asian newspapers will show that the most discussed topics related to dengue are
Reported dengue cases
Prevention regarding other diseases
What is Dengue fever?
Dengue is a mosquito-borne viral disease transmitted to humans through infected Aedes mosquitoes, which are a tropical and subtropical species that can be found throughout the world. The most common symptom of Dengue is high-grade fever along with either facial ﬂushing, skin erythema, body ache, myalgia, arthralgia or severe headache.
What are public health surveillance systems?
Globalization is one of the most significant reasons as to why there is a need for global surveillance of communicable diseases is necessary for both industrialized and developing nations.
Public health surveillance is the systematic, ongoing collection, management, analysis, and interpretation of data followed by the dissemination of these data to public health programs to stimulate public health action. Public health surveillance is the cornerstone for decision-making regarding detection and control of epidemics.
There already exists surveillance systems which mine media sources to reduce information gaps between health ministries, public health institutions, non-governmental organizations, and multinational agencies.
Methods and Conclusions
It is a process of extracting and converting non-trivial and previously unknown unstructured data from different sources to structured matrix form. Feature selection is an important step in text clustering because of high dimensionality and data sparsity which is a result of data collection from different sources. There are several techniques available to identify information in text, such as classification, clustering and summarization.
An interesting method would be text mining cluster analysis, it is a combination of text mining and cluster analysis, which groups together similar topics. This will help in identifying the main topics which are discussed in the document collection. Using this method would mean a speedy collection of accurate data from various sources like different government reports, medical and scientific institutions, independent reports and research etc.
After creating a search enquiry for words like dengue, DEN-1, DEN-2, DEN-3, DEN-4 and breakbone and choosing 2014, an HTML file was created. After which is an HTML parsing tool was used to extract and process the raw HTML and to transform the file into a dataset where each row corresponds to a new article for the specific month. This method can help ease studying the statistics of the disease and can actually put an end to some false co-relations people have in their minds, for instance, that rainfall has 0.92 correlation with the reported dengue cases, this information is extracted from news articles, which defeats the peoples' notion that spreading of dengue is related to rainfall.
Next, a structured representation of information stored in text documents is created known as Text Data Matrix [hereafter TDM]. To create a TDM, the following steps are followed –
Stemming- find the stem or root form of a term, aggregating different terms with the same root as equivalent
Stop word removal – words that are common in the text but do not contribute to any useful semantic context are removed using a stop list
To make the document collection relevant to all documents, the stop word list was extended to include all city names in Asia, ‘dengue’, ‘dengue fever’ and ‘break bone’. Text parsing not only produces a TDM, but also reduces the total number of terms, improving efﬁciency, and better capturing the content of a document by aggregating terms that are semantically similar.
Term vector weighting
A weighted TDM is created next by applying term frequency-inverse document frequency. The weighted TDM becomes the underlying representation for the collection of documents. Once documents have been converted into a weighted TDM, vectors can be compared with an estimate the similarity between pairs or sets of documents; determine the optimal number of topic clusters and; perform topic clustering.
Calculating the pairwise cosine similarity matrix
Determining the optimal number of topic clusters
After determining the optimal number of topic clusters, we chose to use LDA to determine the topics in the document collection. LDA is widely adopted to infer topics from text collections and is best at learning topics from unstructured text. LDA is an unsupervised topic modelling algorithm, designed to uncover topics (sets of related words) from documents.
The method of text parsing is like a guided missile only our case the missile is for the betterment of people. The amount of specialised knowledge we can gather using this method is nothing less than a miracle, which ultimately makes the collected data more reliable and efficient in use and these qualities of the collected data will come pouring out in form of favourable results.
Cluster identification labelling
After obtaining the topics using LDA for each month, a set of describing terms was used to label each topic. This process allows us to describe a topic, and name our ﬁnal clusters. With the help of a dengue knowledge expert, a label was manually chosen for every cluster for each of the 12 months in 2014 by using the descriptive terms derived by the LDA algorithm. After calculating the optimal number of topics for each month in 2014 using the elbow method described above, a total of eight different topics were identiﬁed in the news articles throughout all of 2014.
Narrowing down an unimaginable amount of knowledge released and published every day for a whole year and segregating it into eight different topics is in itself an evidence for the power and efficiency of data mining. This should leave no doubt in any individual’s mind that how text mining in public media can actually prove to be highly beneficial. The quality of data collected by using text mining can help a lot in surveillance, containment and termination of diseases.
The benefit of using the data mining method over others is that-
topics extracted from news articles offer not only information on dengue trends in a speciﬁc geographic area but also information about other topics extracted from news articles, such as prevention, politics, prevention relative to other diseases, and emergency plans
the evolution of topics throughout the year can be used by dengue experts, health care ofﬁcials, public health policymakers, communicators, and journalists to obtain insight on relationships to a speciﬁc communicable disease
although the rainfall and Breteau Index can be used to detect patterns for dengue, this information may not be promptly available, or may not be collected in a speciﬁc region
although the interpretation of the clusters may require human input, our analysis can be automated to reduce the delay in receiving ofﬁcial data, and improve the availability of data needed to decrease morbidity and mortality of communicable diseases.
To understand the development, please refer to https://pubmed.ncbi.nlm.nih.gov/29141718/
For queries, mail at firstname.lastname@example.org.