Since the infamous viral ALS Ice Bucket Challenge, there has been an increased movement toward precision medicine programs in ALS. Personalized or precision medicine is powerful because it accounts for each individual person suffering from a disease, taking into account variations in genes, environment, and lifestyle. With these tools, researchers can learn as much as possible from each person living with rare diseases like ALS. 

The goal of the End ALS challenge was to invite the AI community to explore ALS datasets, and work together to surface insights, and findings. In the spirit of this challenge, our team wanted to reflect on our learnings in this blog post.  Anyone can review our full open source winning submission here

This challenge was led by Roche Canada’s Artificial Intelligence Centre of Excellence (AI CoE), in collaboration with Answer ALS and EverythingALS. It was administered by Kaggle, an online community of data scientists and machine learners with the support of the ALS Society of Canada, the Ontario Brain Institute, and NetraMark. 

Since this was our team’s first deep dive into personalized medicine, our first challenge was organizing a nimble multidisciplinary team to guide us through the data architecture, genomic, clinical, and machine learning components. Three team leads from Bowhead were joined by four of Bowhead’s advisors for this quest:

  • Cesar Diaz, Chief Technology Officer, Bowhead Health
  • Isay Castañeda, Data Scientist, Bowhead Health
  • Nikol Ricaño, Project Manager, Bowhead Health
  • Dr. Vu Tuan, M.D, Director ALSA Certified Treatment Center of Excellence at University of South Florida Department of Neurology, USA
  • Juan Caballero, PhD, European Bioinformatics Institute (EMBL-EBI)
  • Jeff Bruce, PhD, Princess Margaret Bioinformatics Lab, Canada 

For any precision medicine program, a large amount of information needs to be collected from each participant, and in many cases, thousands of people participate in each program. Bowhead participated in this open science challenge on the Kaggle platform using a dataset provided by John Hopkins University containing de-identified ALS patients’ genomic and clinical records. This data came from a total of 150 datasets involving 1,000 people participating in Answer ALS — reported to be the most comprehensive clinical, genetic, molecular, and biochemical assessment of ALS to date. 

Big Data vs. ALS 

ALS, or Amyotrophic lateral sclerosis, is one of more than 7,000 identified rare diseases, and impacts approximately 225,000 people across the globe. The rates of ALS are expected to increase 69 per cent by 2040 as our global population ages. This disease impacts individuals’ motor neurons which are responsible for controlling muscle movements like chewing, walking, or talking. Symptoms often develop between the ages of 55 and 75, and as ALS progresses, individuals lose their ability to do basic tasks and eventually are unable to breathe on their own. Most people with ALS die from respiratory failure, usually within 3 to 5 years from symptom onset. 

The cause of ALS is not known, and scientists do not yet know why ALS strikes some people and not others. In 1993, scientists discovered that mutations in the SOD1 gene were associated with some cases of familial ALS, which accounts for 5 to 10 percent of cases. Since then, more than a dozen additional genetic mutations have been identified.

Real world data, such as the AnswerALS datasets used in this challenge, have the potential to shed light on these challenges to better understand disease mechanisms, identify gaps in diagnosis and treatment for improved disease management, and potentially help uncover life-changing medicines. 

Learnings from our data quest

This challenge presented our team with a curated collection of datasets from a number of Answer ALS sources. We were asked to model solutions to key questions that were developed and evaluated by ALS neurologists, researchers, and patient communities. The tasks associated with this dataset were developed and evaluated by ALS neurologists, researchers, and patient communities.  These questions included (1) whether ALS has a single mechanism of action or is caused by different pathways, (2) which mechanisms underlie disease progression, and whether there are genetic and (3) symptom differences between ALS patients who progress faster versus slower.

Technical Deep Dive: The first challenge our data science team faced was making sense of a large set of unstructured clinical data. Dr. Vu brought his expertise as a clinician working with ALS patients to help our data science team see the stories these numbers were telling us. Once we crafted the right hypotheses, we started processes for data correction and normalization to develop a homogenous model. We began our approach with quantile normalization using the transcriptomic data and clinical data, but this didn’t help to homogenize the results. We then shifted to performing batch effects removal, recognizing the differentiation stages of the iPScells in the transcriptomic data which delivered a homogenous data set. With this homogenous data, we ran a hierarchical unsupervised clustering model. We used the first two groups to perform statistical tests, comparing both to assess their clinical relevance and note if the differences were associated with the prognosis of ALS.

Our team appreciated the opportunity to learn and work with this dataset graciously provided by this program. We also recognized a need for even more data to make conclusive statements. Larger scale datasets can increase the chances of novel discovery to move the ALS field forward, and this data debt continues to be a challenge when studying rare diseases like ALS. 

What’s next for personalized medicine?

This challenge left us excited about the power of data in rare disease research, and prompted some important questions to truly realize this potential:

How might we rethink the ways we capture patient data? 
How might we use new technologies like wearables to capture data remotely? 
What tools could enable patients to securely own, manage, and share this data?
How might ALS patients be included as co-creators in precision medicine research?

    Our dream destination is a world where patients are true partners in research, and are notified when their data is being used, even when it is de-identified for open-source challenges. We think this open data sharing would empower more people to share their data, and create a larger dialogue around harnessing the power of data for good in healthcare.

    We are very grateful to all the organizers, advisors, and judges (listed below) and especially the anonymous patients who allowed us to sharpen our skills on this small stepping stone towards ending ALS. We believe the future of citizen-powered science is closer than we think. 

    End ALS Kaggle Challenge Collaborators:
    Fanny Sie, Director, Roche AI Centre of Excellence 
    Paul Mooney, PhD, Developer Advocate, Kaggle 
    Terri Thompson, PhD, Multi OMIC Program Manager, Answer ALS
    Emily Baxi, PhD, Assistant Professor, Johns Hopkins University
    Andrew MacBride, Director, Covergent Genomics; Founder, Station X
    Steve Finkbeiner MD, PhD, Neurologist & Neuroscientist, Gladstone; Professor, UCSF
    Joseph Geraci PhD, CEO, NetraMark, Associate Professor, Queens University 
    Antoaneta Vladimirova, PhD, Director of Medical AI, Roche
    Indu Navar, MSc, CEO, Everything ALS, Peter Cohen Foundation
    Deb Fabricatore, Advocacy Director, Everything ALS

    Keep learning with us!

    We’re excited to present our Bowhead Futures Digest: a newsletter for leaders looking to learn, debate, and co-create a future where technology & data empower our health. 

    We've also interviewed global experts from over 7 countries to understand the trends and tensions in the future of digital health. Explore our growing content library and filter by the topics that impact your work!