Our Research    Funded Projects

PI Projects 2021 - 2022


Sample-efficient molecular optimization and property prediction in low data settings

Connor Coley

Generative models and machine learning-based property prediction models have the potential to change how we approach the discovery and optimization of new functional molecules, including small molecule therapeutics. However, a number of challenges have relegated generative modeling techniques to the proof-of-concept stage, among them sample efficiency. In real drug discovery applications, one cannot afford (in time or financial cost) to perform the tens or hundreds of thousands of experiments they currently require to identify a promising molecule. The same issues of sample efficiency have limited the relevance of many quantitative structure-activity/property relationship (QSAR/QSPR) models to low data settings, where state of the art (SOTA) architectures only achieve SOTA performance when trained on thousands of examples. This proposal will pursue three complementary aims to improve the utility of machine learning models for molecular optimization work. Specifically, we will (1) develop a new optimization strategy for generative models to reduce the number of expensive oracle calls through an outer-loop surrogate model; (2) apply meta-learning to QSAR/QSPR modeling for efficient task-specific initialization and fine-tuning; and (3) extend our preliminary work on the use of evidential uncertainty for confidence estimation. Each of these aims will make AI techniques for molecular property prediction and optimization more practical for low data applications where computational assistance is most needed.

Representation Learning to Elucidate the Disease Mechanisms in Atrial Fibrillation

Caroline Uhler

The life sciences are in the midst of a data revolution. Inexpensive and accurate DNA sequencing is a reality, advanced molecular imaging is becoming routine, and single cell genomics is allowing us to profile millions of cells. To practitioners of machine learning (ML), these datasets represent a unique opportunity to apply the breathtaking recent advances in ML to problems in human health. But, unlike in fields such as advertising or recommender systems, in biology we care not so much about predictive accuracy, but about causal mechanisms. Our challenge is thus to integrate diverse data modalities to obtain representations that elucidate the underlying disease mechanisms. Here, we propose a focused and directed project on an area that has both widespread public health applications and is well-suited to leverage recent and deep insights of ML. Specifically, we propose a highly collaborative project on Atrial Fibrillation (AF), a common cardiovascular disease associated with an increased risk of heart failure, dementia and stroke.

TomoDRGN: Resolving heterogenous molecular machines in their native cellular environment

Joey Davis

Massive macromolecules perform essential cellular functions including DNA replication and protein synthesis. Such megadalton-scale machines like the ribosome, which drives protein synthesis and is a target of multiple clinically approved antibiotics, are composed of 10s-100s of components that the cell must assemble rapidly, efficiently, and with atomic precision. Understanding how cells assemble macromolecular complexes such as the ribosome is critical for three reasons. First, these highly regulated pathways are essential for homeostasis, and their assembly fidelity is diminished in many human diseases Second, I posit that a structural understanding of exactly how assembly goes awry in these diseased states will be vital for the rational, structure-guided design and optimization of therapeutics to ameliorate these conditions. Finally, assembly of such structures is a fundamental yet underexplored process that has undergone extensive selection to be rapid and energetically efficient. Deciphering how cells assemble these machines could facilitate the design of analogous man-made complexes with applications in chemical synthesis and drug delivery. The objective of this application is to develop a novel machine learning based cryo-ET analysis pipeline specifically designed to uncover many (10s-100s) of related structures from cellular tomograms. This work extends our previously funded J-clinic project to resolve heterogenous structural ensembles of protein complexes that had been purified away from their native cellular environment and imaged by single particle cryo-EM, a related but distinct imaging modality.

Towards Precision Cardiology: Developing A 3D Single-Cell Resolution Gene Expression Landscape of the Cardiovascular System Using Machine Learning

Elazer Edelman

Despite great advances in the understanding and available treatments cardiovascular disease (CVD) remains the leading cause of death worldwide. The pathophysiology of CVD involves complex tissue structures, a range of critical cells, versatile specified functions, and sophisticated regulatory mechanisms. Recent advancements in single cell transcriptome technologies have substantially improved our insights into the development of the cardiovascular system and the mechanisms underlying CVD. Increasingly larger masses of transcriptomic data by different methods continue to accrue, and we must now piece the biology together and harness this knowledge to benefit patients. Novel machine learning algorithms provide the opportunity to integrate these data to generate 3D transcriptomic, data-rich models and help to interpret these data to understand the role of each cell in the cascade of cardiovascular pathophysiology. Our project will provide a computational platform to reveal building blocks of spatial gene expression profiles in the cardiovascular system.

Mapping Spatiotemporal Dynamics of Single Cells During Carcinogenesis

Jonathan Weissman

Cells are dynamical entities in which transitions between cell states, including tumorigenesis, are accompanied by characteristic changes in gene expression. Determining how cancer cells transit through expression space can thus lead to fundamental insights of disease prevention, diagnosis, and treatment. While first introduced by Leonhard Euler 250 years ago, the vector field was recently used to propose the formal concept of “cancer attractors” in which cell types are “attractors” (stable cell states) in the vector field while cancer cells “abnormal attractors” that can be diverted and return back to normal state, i.e. “attractor”. However, it has never been possible to learn a high dimensional vector field directly from expressional datasets with ML algorithms. Global mapping of the velocity vector field of gene expression would thus be an outstanding achievement, as it makes it possible to predict the entire dynamics of any cell state transition, e.g. cancerogenesis, drug response and resistance, in a way analogous to using Newtonian mechanics to predict how our solar system or the galaxy evolve over time.

Clinical AI

Inadvertent Multimodal Signals As Indicators of Preclinical Cognitive Decline

Randall Davis

We want to demonstrate that multimodal analysis of simultaneous hand and eye movements can both measure a subject’s overt behavior and provide new insights into how they are thinking i.e., their methods and strategies. Insights into strategies in turn provide opportunity for early diagnosis and early intervention for both neurodegenerative disorders in adults and neurodevelopmental disorders in children. The proposed work is the next step an ongoing project exploring grapho-motor/cognitive interactions that has been active for more than a decade. The new work adds additional modalities that provide exceptional opportunities for multimodal interpretation. We have been working on digital cognitive biomarkers for more than a decade, developing novel tests that measure cognition quickly, inexpensively, and with the objectivity and repeatability that comes from high resolution measurements and software-defined analytics. Within the larger scope of this ongoing work, the primary focus of the proposed work is the collection, cleaning, ML analysis, and interpretation of multimodal (drawing motion and eye motion) subject data.

Personalized Machine Learning for Improving Mental Health

Rosalind W. Picard

The disease of Major Depressive Disorder (MDD) is one of the worldwide leading causes of disability. At MIT, approximately 22% of undergraduates suffer from severe or moderately severe depression. Depression is the main contributor to suicide, and the US has seen suicide rates increase in nearly every state over the period from 1999- 2016, with half of the states having rate increases of more than 30%. MDD is associated with behavior, mood, sleep and physiological alterations, which can be measured with multi-modal sensors from wearables and smartphones. Studies have shown that multimodal and physiological data from mobile and wearable devices can be used to reliably estimate depression severity assessed by psychiatrists, detecting depressed and euthymic (remission) phases. Today’s methods have significant limitations that we will address in this proposal by developing novel personalized machine learning techniques for monitoring disease progression and enabling personalized medicine. Our project goals are to develop new personalized Machine Learning (ML) approaches to accurately monitor changes in MDD level and to forecast illness trajectory, identifying the most influential variables and enabling personalized time-adaptive treatments. We have access to clinical data and are working closely with the #1-rated-psychiatry in the USA, MGH, to shape its clinical impact. Our research will facilitate the implementation of personalized digital medicine by allowing prescription of specific therapy on the basis of the forecasted disease profile identified from passive data, and ultimately increase the effectiveness of selective prevention strategies. This will enable delivery of optimal MDD prevention or treatment to those who currently cannot access it due to time constraints, lack of available services, trust, or cost. This work also will provide new advances in AI/machine learning, which we will disseminate through publications, media, and presentations. Our default plan is also to open source our personalized machine learning software, to expedite advances across both AI and personalized medicine.

Outsource Training of Sensitive Medical Data Secure Multi-Party Collaboration

Muriel Medard

Machine learning (ML) algorithms have attracted significant attention for classification tasks for many sensitive medical purposes, e.g., diagnosis, predictions, etc. Training a ML classification model with a high precision for medical purposes requires advanced computing resources and sufficient training data collected from diverse groups of patients, e.g., age-wise, ethnicity-wise, etc. In practice, sufficient and diverse training samples do not exist at the same time and location. Thus, many medical centers would like to collaboratively train ML models on their combined datasets for a common benefit. Moreover, advanced computing resources and software (e.g., GPUs, TPUs, CuDDN) are costly and do not usually exist at every medical center. An attractive solution is to outsource the training data from several centers to a cloud server where the ML model is trained. Cloud providers begin to offer Machine Learning as a Service (MLaaS), a new service paradigm that uses cloud infrastructures to train models and offer online prediction services to clients, e.g., Microsoft Azure, Google Cloud AI. However, sending plaintext sensitive data to third parties for processing is highly undesired due to privacy concerns and/or business competition, e.g., see General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act of 1996 (HIPAA). The proposed coding scheme allows multiple parties to train a classification ML without revealing their data. The intuition behind the proposed method is encrypting training data at each medical center via a low-complexity coding scheme, and transferring the encrypted data to the data center and training the ML model with encrypted data rather than original raw data. Our goal is to present a solution that offers encryption of the sensitive datasets such that an information-theoretic privacy bound can be guaranteed, and thus no attacker can possibly infer information about the original data from the encrypted data or from the trained model beyond a certain limit. The model that is trained with encrypted data then can be used for inference at the data center or be transferred back to the local offices. In the proposed scheme, each medical center has its own encryption scheme, aka security key, that is computationally very difficult to be guessed and does not need to be shared with other centers, organizations, cloud, etc. Moreover, since the ML model is trained with the encrypted data, only authorized users who have access to a valid security key (the local offices that provided training data) can use the devised classifier. We propose to employ Random MultiLayer Perceptron (RMLP) networks to convert original sensitive samples into a new representation that provides privacy.

Casual Inference in Large-Scale Observational Omics Data

Vivek Farias

Recent years have seen an explosion in the availability of large-scale observational datasets where the questions at hand may be viewed as ones of causal inference. As a concrete example, consider the Connectivity Map (CMap) datasets. These datasets describe the cellular impact – expressed in terms of genomic (or proteomic) expression – of systematic perturbations by a large set of genetic and pharmacologic perturbagens. For instance, CMap contains approximately 1.5M gene expression profiles across approximately 100 cell lines perturbed by approximately 5,000 small molecule compounds. This data was made possible by a relatively inexpensive, high throughput gene expression profiling technology dubbed L1000. In a similar vain, the PI has recently helped develop a high throughput platform for unbiased proteomic sampling which has the potential to provide similar data for unbiased proteomic profiling. This platform has already shown promise as a tool to generate proteomic biomarkers at scale for both non-small-cell lung cancer (NSCLC) and Alzheimers disease.

Learning optical design strategy for point-of-procedure breast cancer imaging

Sixian You

Optimizing the imaging design strategy within the constraint of low-resource clinical environments is crucial for achieving accurate cancer detection despite the severely constrained environment. This project aims to develop an AI-based framework that learns an illumination/detection strategy for optimized optical imaging system design within the constraint of point-of-procedure breast cancer imaging1. The optical imaging system (illumination and detection scheme) will be jointly optimized with a neural- network-based tumor detection method in an end-to-end (image generation to image interpretation) fashion for efficient co-design.

iBOCA - Phase 2

Kalyan Veeramachaneni

Our mission is to create a digitized early detection cognitive assessment tool that will provide accessible, noninvasive, scalable assessments that are quick and comprehensive. Currently, a combination of multiple diagnostic tools is required to pinpoint cases of cognitive impairment. These tests have reasonable specificity and sensitivity, but require a significant time to complete and cannot be used as routine screening exams. 1 The current research standard, the amyloid scan, provides a presymptomatic diagnosis, but it only has a sensitivity of 90% and specificity of 85%, 2 costs over $3000, and is not commercially available. Generally, physicians must rely on patient history, family input, physical exams, and cognitive tests to make a clinical diagnosis of this condition. While there is no known cure for dementia, treatments can slow its progression. We developed iBOCA, an accessible, culturally- competent app, that allows users to track their disease progression and clinicians to monitor their patients’ cognitive health. By providing a single, comprehensive tool for early detection, treatments can delay cognitive decline and improve overall quality of life. iBOCA will combat the public health disparity caused by expensive diagnostic procedures and reveal new findings about cognitive impairment through fine granular data collection.

Multimodal Learning for Fluid Status Assessment in Heart Failure Patients

Polina Golland

In this project we will develop novel machine learning algorithms for accurate non-invasive assessment of pulmonary edema in patients with congestive heart failure from chest x-ray images. The proposed methods will jointly model images with other available data such as radiology reports and clinical indicators during learning to improve the image-based assessment at inference time. The resulting approach to image-based edema assessment will be applicable broadly beyond the applications to congestive heart failure that motivate our research. Our deployment plan includes integration into the hospital information systems at our clinical collaborator site and other institutions.

Development of ML Algorithms for Addressing Problems Associated with Missing and Conflicting Data in Telehealth Environments

Peter Szolovits

Missing and conflicting data constitute an increasing risk to patient safety and adequate care. Alerts generated in the telehealth environment focus on surveillance of vital sign data and signals telehealth providers about parameters that exceed configurable thresholds or trends in physiological variables and specific multi-parameter combinations. For example, two consecutive average heart rate values above 150 beats per minute (bpm) will trigger a default alert as will a 10-minute median value above 130 bpm. When there are missing data, the alert may be delayed until enough complete data are available or future alerts may not occur at all. Further, the same patient health parameter may be measured by an electronic mechanism and by a human being. The latter is usually applied and the former is ignored. At one medical center, the difference has been found to be the largest in the cases of patients who soon die. We propose to leverage the concept of Bayesian networks to perform the task of data imputation using test data from MIMIC and other sources in the public domain.

The Doctor’s AIde: Machine Learning to Automatically Surface Clinically Relevant Notes and EHR Data

David Sontag

This proposal seeks to develop machine learning methods to automatically retrieve contextually-relevant clinical data (e.g., notes, labs, imaging) from the patient’s medical history to improve clinical decision making at the point of care. This project is a continuation of a collaboration between MIT and Beth Israel Deaconess Medical Center, previously funded by Jameel Clinic, that has already resulted in the development of a new AI-driven user interface for clinical documentation that has been deployed in BIDMC Emergency Department.


Fair Organ Allocation Learning

Marzyeh Ghassemi

There are a large number of well-established biases in the health system that adversely affect the quality of care received by minority populations. While the increasing use of algorithms and machine learning (ML) in healthcare holds great promise, it also risks codifying and worsening these disparities. In this project, we intend to study a process that is particularly prone to racial and other inequities: the organ transplant allocation system. Specifically, we will focus on lung transplantation, aiming both to characterize disparities in the current system and suggest ML-based approaches to improve the fairness of outcomes and access.

Quantifying Chronic Itch Using Radio Signals and Machine Learning

Dina Katabi

Itch, a basic sensation that can be evoked by a mosquito bite, becomes a torturous experience when chronic. Chronic itch is so brutal that, in Dante's Inferno, falsifiers were eternally punished by “the burning rage of fierce itching that nothing could relieve”. Today, chronic itch affects up to 13% of the population, and is associated with over $90 Billion in annual population-expenditures in the US. It is often as debilitating as chronic pain, and has a profound negative impact on quality of life. Yet, despite much interest from biotech and pharmaceutical companies and active attempts, there are currently no FDA-approved treatments for chronic itch. A key challenge for both drug development and treatment is the lack of objective measure for quantifying itch. The current clinical standard for quantifying itch is the numerical rating scale (NRS), which is a single-item self-assessment, where patients rate the severity of their itch on a scale from 0 (“no itch”) to 10 (“worst imaginable itch”), over the prior 24 hours. A number of other similar instruments exist, including the visual analog scale (VAS) and the 5-D itch scale. Although valuable tools, these methods suffer from three key limitations: 1) Such self-reported scales are highly subjective and hard to generalize across patients. 2) They also lack sensitivity because humans are not tuned to quantify small changes in their condition. This is compounded by the fact that itching and scratching are often subconscious and may occur during sleep, and hence patients may not be fully aware of the morbidity associated with their chronic itch. 3) Finally, such selfassessment is unreliable in children and older adults with cognitive impairment, yet itch is common in those populations.

Wireless Seismocardiography: Enabling Long-Term Non-Contact Cardiovascular Monitoring

Fadel Adib

The past decade has witnessed significant advances in using wireless signals to sense people and their vital signs. Novel algorithms and software-hardware systems have enabled capturing human breathing and heart rates based on the Radio Frequency (RF) signals that bounce off the human body and without requiring anybody contact. While this research has demonstrated that RF signals carry impressive information about human vitals, it still cannot capture the level of detail in typical gold standard heart recordings, which are needed to understand and monitor cardiovascular conditions. In this proposal, we ask the following question: Can we use RF signals to wirelessly monitor a person’s seismocardiogram (SCG)? The SCG is a heart recording that is analogous to the more-commonly known electrocardiogram (ECG). In contrast to the ECG which measures the heart’s electrical activity (i.e., voltage), the SCG measures the heart’s mechanical activity (i.e., vibrations). Medical literature has shown that the SCG can be used to precisely time fine-grained heart activities including the opening and closing of valves, which allow blood to flow between the heart chambers and into the blood vessels. These measurements are useful in the detection and diagnosis of several cardiovascular conditions like myocardial infarction (heart attack), coronary heart disease, ischemia, and hemorrhage. The standard approach for measuring SCG signals today relies on accelerometers that capture micro-vibrations of the chest wall. The process typically requires users to take off their shirts, lie in a supine position (i.e., on their back), and affix an accelerometer near the apex of the heart using a chest strap. As a result, measuring SCG today remains intrusive and inconvenient, and it typically needs to be administered by medical practitioners in calibrated medical settings or controlled environments. We propose Passive-SCG, a wireless, noncontact approach for measuring SCG signals that enables passive, longterm monitoring of users in everyday environments.

Special Projects

Interpretable ML for Understanding, Predicting, and Treating Sepsis in the Age of COVID

Georgia Perakis

The goal of this research project is to develop interpretable machine learning algorithms for treatment of sepsis with special consideration for sepsis in COVID-19 patients. Sepsis in a patient is defined as “overwhelming and life-threatening response to infection that can lead to tissue damage, organ failure, and death”. Although most frequently caused by bacterial infections, sepsis can be a result of viral or fungal infections as well. Even before COVID, sepsis and sepsis shock were leading causes of death in hospitalized patients – between one-third and one half depending on definitions; over 1.7 million per year in the United States alone. There are often lasting effects for patients who survive, including cognitive, psychological, and medical impairment, as well as physical/functional limitations. Sepsis treatment also represents a major medical expense. Even a small improvement in existing protocols for diagnosing and treating sepsis could therefore have substantial impact on patient health and healthcare costs.

Robust multimodal learning of prognostic biomarkers in sepsis

Joel Voldman

Sepsis is an illness characterized by an over-exuberant and systemic immune response to infection that can produce organ failure, shock, and death. In the US, sepsis is the 3rd leading cause of death, with 1.7M diagnoses each year, costing the healthcare system >$20B/year1. The uncontrolled inflammation of sepsis can progress despite well-targeted antibiotics and adequate control of the infection; the response of the host (patient) to infection is key to progression—and, ultimately, resolution—of the illness. Sepsis can progress quickly (~hrs), and thus rapid and frequent assessment of the patient state to predict disease time course is critical to inform treatment decisions and ultimately improve outcomes. Current clinical management of septic patients in the ICU is based on a wide range of sensitive but non-specific inputs (history, demographic data, vital signs, daily complete blood counts, blood chemistry), culminating in calculation of clinical scores (sequential organ failure assessment, SOFA; Acute Physiologic Assessment and Chronic Health Evaluation, APACHE II) and treatment decisions (administration/cessation of antibiotics or vasopressors, mechanical ventilation, etc.)2. Critically, while we have measures of the high- level outcomes of the pathophysiology (heart rate, etc.), our ability to assess the function of the effectors of the host response, i.e., the leukocytes (white blood cells), is limited. Given the importance of the disease, substantial work has gone into developing new diagnostics for sepsis. This ranges from new measurement modalities (T2Bacteria Panel from T2 Biosystems3, SeptiCyte Lab from Immunexpress) to classical (SVM) and deep (RNN) learning from EHR5. As important, but much less explored, is the development of prognostic biomarkers needed to assess the likely progression of the disease and thus inform treatment decisions.

PI Projects 2019-2020


Machine Learning for Home-Based Ultrasonic Ovarian Cancer Screening

Prof. Yonina Eldar,

Prof. Anthony E. Samir

This project is targeted at developing new methods and systems to detect ovarian cancer at the early curable stage before it has spread beyond the ovary in order to drastically improve ovarian cancer survival.

Shaping the Future of Disease Diagnostics

Prof. Eric Alm, Prof. David Sontag, Dr. Andrew Allegretti, Prof. Jordan Smoller

This project aims to improve healthcare decisions in challenging diagnostic areas. We envision the development of a non-invasive molecular test, effective across many disease areas, returning results rapidly.

Causal Experimentation and Modeling with Uncertain Disease Labels

Prof. Caroline Uhler, 

Dr. Ernest Fraenkel

This project will address the challenges of how data is collected and analyzed on Amyotrophic Lateral Sclerosis (ALS) using algorithms to design experiments on neurons derived from ALS patients.

iBOCA: A Machine Learning App for Early Detection of Cognitive Impairment

Dr. Kalyan Veeramachaneni, Prof. Saman Amarasinghe, Dr. Chun Lim, Dr. Dan Press, Dr. John Torous

This project will develop a machine learning based iPad app for doctors to administer cognitive tests for assessing mild cognitive impairment. Our goals are (1) to develop machine learning based models for early detection and diagnosis, and (2) to enhance the app to collect more detailed data that will in turn enable rapid detection within 6 minutes.

Disease Monitoring

Cardiovascular Data Science for Personalized Medicine

Prof. Collin M. Stultz,

Dr. Aaron D. Aguirre

This project will build and test models that identify personalized therapeutic interventions that minimize adverse outcomes in patients who are admitted with congestive heart failure.

Improving the Accuracy of Personalized Machine Learning to Monitor Depression

Prof. Rosalind W. Picard, 

Dr. Paola Pedrelli

This project will create a novel method to assess depressive symptoms for individuals by using machine learning analytics applied to objective data from phone sensors and wrist-band sensors. This machine learning work thus has the potential to significantly impact treatment of the foremost cause of disability.

Harnessing Signals of Clinical Effectiveness from Electronic Health Records (EHR) Data to Repurpose Existing Drugs for Unmet Medical Needs

Prof. Roy Welsch, Prof. Stan Finkelstein

This project has a clinical goal to identify signals of clinical effectiveness from EHR data to repurpose currently marketed drugs. And also a methodological goal to create machine learning algorithms that use observational data to compare cohorts of patients to emulate clinical trials.

Nocturnal Seizure Detection and Prediction in Epileptic Patients Using Radio Signals 

Prof. Dina Katabi, Prof. Tim Morgenthaler, Dr. Melissa Lipford, Dr. Mithri Junna

In this project, we will augment the Emerald WiFi-like device with new machine learning models to detect epilepsy seizures simply by analyzing the surrounding wireless signals. We will also use the Emerald device to monitor the relationship between seizures and the patient’s vital signs and sleep stages. We will further study the feasibility of predicting seizures before they occur.

Preventive Medicine

Deep Generative Models for Cryo-EM Reconstruction of Heterogeneous Biomolecular Structures

Prof. Joey Davis, Prof. Bonnie Berger, Prof. Bridget Carragher, Prof. Clint Potter

In this project we propose to develop revolutionary machine learning based techniques to determine structural ensembles of “misfolded” complexes. This work leverages the Berger Lab’s expertise in such computational approaches, MIT’s recent investments in cryo-EM instrumentation (MIT.nano), and Davis Lab’s expertise in producing, isolating and imaging such molecules.

Machine Learning Based Noninvasive Continuous Absolute Blood Pressure Monitoring

Prof. Hae-Seung Lee, Prof. Charles G. Sodini, Prof. Song Han, Dr. Aaron D. Aguirre

In this project, we investigate a new ML based continuous absolute blood pressure (BP) waveform monitoring method. This is based on our on-going research on pulse pressure monitoring using ultrasonography, which yields a continuous BP waveform instead of just systolic and diastolic pressure levels.

A Deep Learning System to Identify Combinatorial MEG and PET-based Biomarkers for Early Detection and Monitoring of Alzheimer’s Disease

Dr. Dimitrios Pantazis, Prof. Quanzheng Li, Prof. Fernando Maetsu

This proposal aims to first construct and monitor biomarkers using combined brain imaging modalities, then use a deep learning model that integrates connectome-structured data to discover multimodal features that predict an individual’s risk for AD.

Application of AI/Machine Learning to Clinical Mental Health: Depression, Bipolar Disorder, and Anxiety

Prof. John Gabrieli,  Prof. Peter Szolovits, Prof. Joseph Biederman, Prof. Mai Uchida, Prof. Dina Hirshfeld-Becker, Prof. Stefan Hofmann

The goal of the present project is to create a new AI / machine learning bridge between one research group in the area of the diagnosis and treatment of mental disorders and another group in the Computer Science and Artificial Intelligence Lab (CSAIL). 

Semi-Supervised Deep Learning for Red Blood Cell Morphological Classification with Applications to Sickle Cell Disease and Hyposplenism

Dr. Ming Dao, Dr. Dimitrios Papageorgiou, Dr. Pierre Buffet, Prof. Olivier Hermine

In this project we propose to develop a robust fully-automated real-time Red blood cells (RBC) classification method with applications to sickle cell disease (SCD) and hyposplenism (impaired spleen function), based on semi-supervised deep learning approach using generative adversarial network (GAN).

Project InCyte: Enhancing Histopathological Diagnosis with Digitization and Machine Learning

Prof. Pawan Sinha, Dr. Kyle Keane, Dr. Prerna Tewari

In this project we will digitize the stored collections of pathology slides in Indian hospitals, and then use the resulting collection of images to train computational classifiers to infer disease labels from histo- and cyto-morphology. Additionally, we believe this will allow us to understand the trade-off between the quality of archiving processes and the long-term predictive value of data sources.

Clinical Operations

Interpretable and Transferable Prediction and Extraction Methods for Medical Reports

Prof. Tommi S. Jaakkola, 

Prof. Kevin Hughes

In this project we focus on realizing and testing automated tools to reason about and extract information from medical reports and records. We design tools that are interpretable, verifiable, and transferable.

AI-­Driven Diagnosis with Documentation in the Emergency Department 

Prof. David Sontag, 

Prof. Steven Horng

This project seeks to re­-envision clinical documentation as cognitive aid by the creation of AI­-driven user interfaces that are tightly integrated into the clinical workflow.

Data Privacy

Private and Scalable Collaboration on Medical Images

Prof. Vinod Vaikuntanathan

In this project we propose to apply powerful methods in cryptography, called secure multiparty computation and homomorphic encryption, to design nimble solutions to the privacy-usefulness conundrum, computing together on the distributed data and models, while guaranteeing privacy to the individual stakeholders.

Drug Discovery

Deep Learning Peptide Representations for Optimal Cellular Delivery of Gene-Editing Cargo

Prof. Rafael Gomez-Bombarelli, Prof. Bradley L. Pentelute

In this project we will train convolutional neural networks on novel 2D representations of peptides that include variational topological representations of natural and unnatural amino acids.