Research area
Biomedical text mining (BioNLP) is a research area concerned with the development of new text mining and machine learning methods that address the unique challenges of biomedical text. BioNLP blends research ideas from natural language processing, machine learning, computational linguistics, and bioinformatics. BioNLP focuses (i) on new methods for extracting knowledge from biomedical texts to reduce redundancy and uncertainty, and (ii) on methods for learning and reasoning over the extracted data.
The BioNLP research leverages biomedical texts, such as biological scientific literature, physician notes in medical or health records. The most complete repository of biomedical literature is NCBI PubMed by the U.S. National Library of Medicine (NCBI). It contains more than 30 million citations for biomedical literature, life science journals, and online books. Every resource contains a title, abstract and Medical Subject Headings thematic annotations. The annotations are defined by a comprehensive vocabulary thesaurus, which is specified by expert curators at NCBI. In addition the NCBI also provides a database containing full-text of more than 10 million research articles.
Large-scale biomedical texts present two fundamental challenges: (i) Biomedical texts include missing data, repeated measurements, and contradictory observations; (ii) Extraction of concepts and their normalization as for example, different names are used to refer to the same concept. The challenge is how to computationally operationalize these data to make them amenable to analytics. BioNLP focuses on finding ways to organize and represent biomedical literature into rich knowledge graphs, and then automatically learn and reason over those knowledge graphs in an effort to provide computational solutions to biomedical problems.
Goals
Our project has two main goals, related to our previous collaboration that resulted in a large biomedical graph. (1) First, we will develop methods that leverage that biomedical graph to make interpretable predictions using vector space embeddings. Our primary focus will be to develop methods that explain concepts and their combinations in a text-generation way (e.g., automatically generated human-readable text relevant to a particular concept or an interaction). Natural applications here are the prediction of drug toxicity prediction, understanding of disease mechanisms, and transfer learning of findings from model species to humans. (2) Second, we will enrich the interpretations using a temporal component. Most of the existing knowledge bases focus only on time-unknown fact triples. We will define time-based constraints, followed by the W3C Time Ontology. Interpretable methods will be updated with time relationships recognition. This will enable identification of processes within a joint knowledge base. Apart from existing approaches, where only biological interactions are predicted, we will also extract time-based relationships that will improve our algorithm performance.
Knowledge graphs are networks and are designed to capture the structure of different biological aspects and to represent relationships among them. The nodes and links represent entities like diseases, proteins, drugs, and gene interactions. In the current wave of statistical learning on graphs, methods build ad-hoc only knowledge graphs to propose new nodes or relationships (links), classify existing ones, or uncover new hidden structures. Through our line of research, we will design transparent and explainable models for graph-structured data and develop new text-based tools for interpretation.
Organization & efficiency
The laboratory of Machine Learning for Science and Medicine at Harvard Biomedical Informatics and Broad Institute of MIT and Harvard is led by Prof. Marinka Zitnik (HARVARD Group) and develops new data science and machine learning methods for learning and reasoning over rich interaction data and translates the methods into solutions for biomedical problems. Recently, the group has pioneered network embedding methods for rich biomedical graphs. This development has led to numerous ongoing research projects related to (i) representation learning for biomedicine; (ii) network embedding methods; and (iii) fusion of diverse data into knowledge graphs.
The laboratory for Data Technologies (UL Group) focuses on data processing. A research group within the laboratory focuses on text mining and natural language processing, including relationship extraction, coreference resolution, data deduplication, semantic Web, and information retrieval. The laboratory is also involved in several industry projects, for example automating the processing of all the Slovene daily news for the needs of daily media review.
The HARVARD Group is active in biomedical data science and new machine learning methods for biomedical problems. In complement, UL Group has substantial experience in developing methods for text analysis and mining. Because the groups have strong complementary expertise, they are uniquely qualified to advance text mining research for biomedical domains by sharing and building on each others ideas. Both groups also have Ph.D. candidates, post-graduate students, and researchers that form a basis for future joint research in the area of this research proposal. By gaining knowledge about biomedical data science, the UL Group will gain the experience to potentially analyze Slovene data from this domain.
Contributions
Collaboration with a strong research group in the U.S will establish important research connections and will open up avenues for future joint research projects. Significant results and findings will be published in top-tier conferences (ACL, EMNLP, ISMB), and scientific journals (Bioinformatics, Nature Communications), where both PIs already have multiple publications.
We will make all of our datasets available for public. Further, we will release software for biomedical knowledge graph embedding. Thus, the benefits will be mutual and synergistic for both HARVARD and UL groups. Finally, the joint work will enable Ph.D. candidates at both institutions to work with their colleges remotely and meet with them in-person.