Machine Learning (ML) models have been successfully developed to solve a wide range of clinical problems. However, most popular ML models rely on generous supplies of labeled data, making expert annotation a key bottleneck to their widespread use. Annotating vast amounts of raw clinical data is not only tedious, but also expensive and prone to error, forcing researchers to utilize old, static, yet already labeled databases in spite of evolving clinical knowledge and requirements. In this work, we will take as an example the task of labeling common clinical findings in electroencephalographic (EEG) waveform data of comatose survivors of cardiac arrest who underwent continuous EEG monitoring to guide clinical care and determine neurological prognosis. We have already created and modeled multiple expert-defined noisy heuristics to:

  1. automatically annotate data.
  2. assess the reliability of the labels on the individual instances of data provided by experts in a previous data annotation campaign.

Results on more than 7,000 hours of EEG data reveal that a model trained on inferred labels can accurately identify, with high sensitivity, most of the common EEG findings.

Our next steps include:

  1. looking at systematic ways to analyze disagreements between weak supervision-driven probabilistic labels and expert annotations on the already labeled data.
  2. comparing our methodology with state of the art methods.

We are looking for candidates who have prior programming experience and have taken a machine learning and/or stats course. Familiarity with probabilistic graphical models is a bonus.