Skip to main content

Research Repository

Advanced Search

Feature Weighted Models (FWM) to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences

Chang, Yu-Mei; Xia, Dong; Billows, Nina

Authors

Yu-Mei Chang

Dong Xia

Nina Billows



Abstract

Background: Tuberculosis is caused by members of the Mycobacterium tuberculosis complex (MTBC) and is the second leading infectious killer after COVID-19. The evolution of drug-resistance poses a threat to successful treatment and disease eradication. Whole genome sequencing combined with statistical and machine learning approaches is being increasingly adopted to predict drug-resistance and characterise underlying mutations. However, these approaches may not generalise well in clinical practice due to confounding from the clonal population structure of the MTBC.
Methods: To investigate how population structure affects machine learning prediction, we compare the performance between random forest (RF) models applied to a global dataset comprised of 18,396 isolates (lineages 1-7; “global”) and a subset containing isolates from two major lineages of the MTBC (lineages 2 and 4; n=10,464; “lineage-specific” (separate) or “combined”). To reduce lineage-dependency in the models we derived weights from a phylogenetic tree using Fitch’s parsimony which are used as a probability for splitting nodes in the RF. Performance of feature weighted RF models were compared to unweighted models and a traditional feature selection approach using area under the ROC curve (AUC-ROC), sensitivity, specificity and F1 score. The importance of features driving performance was measured by Gini importance and most frequent interactions in the model.
Results: All RF models achieved moderate-high performance (AUC-ROC range: 0.60-0.98). First-line drugs had higher performance than second-line drugs, but performance varied depending on the drug-resistant phenotype and lineages in the dataset. Lineage-specific models generally had higher sensitivity than global models which may be underpinned by strain specific drug-resistance mutations or sampling effects. Feature weighted RF models had comparable performance to the unweighted models and the application of feature weights and traditional feature selection approaches reduced lineage-dependency in the model.
Conclusion: We show that predictive performance differs between lineages and global predictions may not generalise well across all lineages. The application of feature weights mitigated confounding from population structure, but in some cases reduced the importance of strain specific drug-resistance mutations and increased confounding from co-occurring phenotypes. This signifies the importance of addressing confounding in machine learning prediction whilst considering the complex genetic interactions underlying drug-resistance in tuberculosis.

Citation

Chang, Y., Xia, D., & Billows, N. (2023). Feature Weighted Models (FWM) to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences. Bioinformatics, 39(7), https://doi.org/10.1093/bioinformatics/btad428

Journal Article Type Article
Acceptance Date Jul 6, 2023
Publication Date Jul 10, 2023
Deposit Date Jul 6, 2023
Publicly Available Date Jul 25, 2023
Print ISSN 1367-4803
Electronic ISSN 1460-2059
Publisher Oxford University Press
Peer Reviewed Peer Reviewed
Volume 39
Issue 7
DOI https://doi.org/10.1093/bioinformatics/btad428
Publisher URL https://academic.oup.com/bioinformatics/article/39/7/btad428/7222183

Files




You might also like



Downloadable Citations