Daily Digest | April 24, 2024

Generative models improve fairness of medical classifiers under distribution shifts | Nature Medicine

Domain generalization is a ubiquitous challenge for machine learning in healthcare. Model performance in real-world conditions might be lower than expected because of discrepancies between the data encountered during deployment and development. Underrepresentation of some groups or conditions during model development is a common cause of this phenomenon. This challenge is often not readily addressed by targeted data acquisition and ‘labeling’ by expert clinicians, which can be prohibitively expensive or practically impossible because of the rarity of conditions or the available clinical expertise. Researchers hypothesize that advances in generative artificial intelligence can help mitigate this unmet need in a steerable fashion, enriching the training dataset with synthetic examples that address shortfalls of underrepresented conditions or subgroups. They show that diffusion models can automatically learn realistic augmentations from data in a label-efficient manner.

Research paper

 

Single Cell Atlas: a single-cell multi-omics human cell encyclopedia | Genome Biology

Single-cell sequencing datasets are key in biology and medicine for unraveling insights into heterogeneous cell populations with unprecedented resolution. Here, researchers construct a single-cell multi-omics map of human tissues through in-depth characterizations of datasets from five single-cell omics, spatial transcriptomics, and two bulk omics across 125 healthy adult and fetal tissues. They construct its complement web-based platform, the Single Cell Atlas (SCA, www.singlecellatlas.org), to enable vast interactive data exploration of deep multi-omics signatures across human fetal and adult tissues.

Research paper

 

brainlife.io: a decentralized and open-source cloud platform to support neuroscience research | Nature Methods

Neuroscience is advancing standardization and tool development to support rigor and transparency. Consequently, data pipeline complexity has increased, hindering FAIR (findable, accessible, interoperable and reusable) access. brainlife.io was developed to democratize neuroimaging research. The platform provides data standardization, management, visualization and processing and automatically tracks the provenance history of thousands of data objects.

Research paper

 

Daily Digest | April 23, 2024

Pretraining a foundation model for generalizable fluorescence microscopy-based image restoration | Nature Methods

Fluorescence microscopy-based image restoration has received widespread attention in the life sciences and has led to significant progress, benefiting from deep learning technology. However, most current task-specific methods have limited generalizability to different fluorescence microscopy-based image restoration problems. Here, researchers seek to improve generalizability and explore the potential of applying a pretrained foundation model to fluorescence microscopy-based image restoration. They provide a universal fluorescence microscopy-based image restoration (UniFMIR) model to address different restoration problems, and show that UniFMIR offers higher image restoration precision, better generalization and increased versatility.

Research paper

 

Prediction of protein-RNA interactions from single-cell transcriptomic data | Nucleic Acids Research

Proteins are crucial in regulating every aspect of RNA life, yet understanding their interactions with coding and noncoding RNAs remains limited. Experimental studies are typically restricted to a small number of cell lines and a limited set of RNA-binding proteins (RBPs). Although computational methods based on physico-chemical principles can predict protein-RNA interactions accurately, they often lack the ability to consider cell-type-specific gene expression and the broader context of gene regulatory networks (GRNs). Here, researchers assess the performance of several GRN inference algorithms in predicting protein-RNA interactions from single-cell transcriptomic data, and propose a pipeline, called scRAPID (single-cell transcriptomic-based RnA Protein Interaction Detection), that integrates these methods with the catRAPID algorithm, which can identify direct physical interactions between RBPs and RNA molecules.

Research paper

 

Demographic bias in misdiagnosis by computational pathology models | Nature Medicine

Despite increasing numbers of regulatory approvals, deep learning-based computational pathology systems often overlook the impact of demographic factors on performance, potentially leading to biases. This concern is all the more important as computational pathology has leveraged large public datasets that underrepresent certain demographic groups. Using publicly available data from The Cancer Genome Atlas and the EBRAINS brain tumor atlas, as well as internal patient data, the authors show that whole-slide image classification models display marked performance disparities across different demographic groups when used to subtype breast and lung carcinomas and to predict IDH1 mutations in gliomas.

Research paper

 

Daily Digest | April 22, 2024

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms | Nature Methods

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses.

Research paper

 

DANCE: a deep learning library and benchmark platform for single-cell analysis | Genome Biology

DANCE is the first standard, generic, and extensible benchmark platform for accessing and evaluating computational methods across the spectrum of benchmark datasets for numerous single-cell analysis tasks. Currently, DANCE supports 3 modules and 8 popular tasks with 32 state-of-art methods on 21 benchmark datasets.

Research paper

 

Towards a general-purpose foundation model for computational pathology | Nature Medicine

Quantitative evaluation of tissue images is crucial for computational pathology (CPath) tasks, requiring the objective characterization of histopathological entities from whole-slide images (WSIs). The high resolution of WSIs and the variability of morphological features present significant challenges, complicating the large-scale annotation of data for high-performance applications. To address this challenge, current efforts have proposed the use of pretrained image encoders through transfer learning from natural image datasets or self-supervised learning on publicly available histopathology datasets, but have not been extensively developed and evaluated across diverse tissue types at scale. Researchers introduce UNI, a general-purpose self-supervised model for pathology, pretrained using more than 100 million images from over 100,000 diagnostic H&E-stained WSIs (>77 TB of data) across 20 major tissue types.

Research paper

 

Daily Digest | April 21, 2024

A visual-language foundation model for computational pathology | Nature Medicine

The accelerated adoption of digital pathology and advances in deep learning have enabled the development of robust models for various pathology tasks across a diverse array of diseases and patient cohorts. Researchers introduce CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text and, notably, over 1.17 million image–caption pairs through task-agnostic pretraining. Evaluated on a suite of 14 diverse benchmarks, CONCH can be transferred to a wide range of downstream tasks involving histopathology images and/or text, achieving state-of-the-art performance on histology image classification, segmentation, captioning, and text-to-image and image-to-text retrieval.

Research paper

 

Domain-specific optimization and diverse evaluation of self-supervised models for histopathology | arXiv

Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, researchers describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL).

Research paper

 

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data | Nature Communications

Long-read sequencing offers long contiguous DNA fragments, facilitating diploid genome assembly and structural variant (SV) detection. Efficient and robust algorithms for SV identification are crucial with increasing data availability. Here researchers systematically compare 14 read alignment-based SV calling methods (including 4 deep learning-based methods and 1 hybrid method), and 4 assembly-based SV calling methods, alongside 4 upstream aligners and 7 assemblers.

Research paper

 

Daily Digest | April 20, 2024

Development and validation of a new algorithm for improved cardiovascular risk prediction | Nature Medicine

QRISK algorithms use data from millions of people to help clinicians identify individuals at high risk of cardiovascular disease (CVD). Here, researchers derive and externally validate a new algorithm, QR4, that incorporates novel risk factors to estimate 10-year CVD risk separately for men and women. Health data from 9.98 million and 6.79 million adults from the United Kingdom were used for derivation and validation of the algorithm, respectively.

Research paper

 

Benchmarking bioinformatic virus identification tools using real-world metagenomic data across biomes | Genome Biology

As most viruses remain uncultivated, metagenomics is currently the main method for virus discovery. Detecting viruses in metagenomic data is not trivial. In the past few years, many bioinformatic virus identification tools have been developed for this task, making it challenging to choose the right tools, parameters, and cutoffs. As all these tools measure different biological signals, and use different algorithms and training and reference databases, it is imperative to conduct an independent benchmarking to give users objective guidance. Researchers compare the performance of nine state-of-the-art virus identification tools in thirteen modes on eight paired viral and microbial datasets from three distinct biomes, including a new complex dataset from Antarctic coastal waters.

Research paper

 

spVC for the detection and interpretation of spatial gene expression variation | Genome Biology

Spatially resolved transcriptomics technologies have opened new avenues for understanding gene expression heterogeneity in spatial contexts. However, existing methods for identifying spatially variable genes often focus solely on statistical significance, limiting their ability to capture continuous expression patterns and integrate spot-level covariates. To address these challenges, researchers introduce spVC, a statistical method based on a generalized Poisson model. spVC seamlessly integrates constant and spatially varying effects of covariates, facilitating comprehensive exploration of gene expression variability and enhancing interpretability.

Research paper

 

Daily Digest | April 19, 2024

Benchmarking spatial clustering methods with spatially resolved transcriptomics data | Nature Methods

Spatial clustering, which shares an analogy with single-cell clustering, has expanded the scope of tissue physiology studies from cell-centroid to structure-centroid with spatially resolved transcriptomics (SRT) data. Computational methods have undergone remarkable development in recent years, but a comprehensive benchmark study is still lacking. Here researchers present a benchmark study of 13 computational methods on 34 SRT data (7 datasets). The performance was evaluated on the basis of accuracy, spatial continuity, marker genes detection, scalability, and robustness.

Research paper

 

Prediction of metabolites associated with somatic mutations in cancers by using genome-scale metabolic models and mutation data | Genome Biology

Oncometabolites, often generated as a result of a gene mutation, show pro-oncogenic function when abnormally accumulated in cancer cells. Here researchers report the development of a computational workflow that predicts metabolite-gene-pathway sets. Metabolite-gene-pathway sets present metabolites and metabolic pathways significantly associated with specific somatic mutations in cancers. The computational workflow uses both cancer patient-specific genome-scale metabolic models (GEMs) and mutation data to generate metabolite-gene-pathway sets.

Research paper

 

Interrogations of single-cell RNA splicing landscapes with SCASL define new cell identities with physiological relevance | Nature Communications

RNA splicing shapes the gene regulatory programs that underlie various physiological and disease processes. Here, researchers present the SCASL (single-cell clustering based on alternative splicing landscapes) method for interrogating the heterogeneity of RNA splicing with single-cell RNA-seq data. SCASL resolves the issue of biased and sparse data coverage on single-cell RNA splicing and provides a new scheme for classifications of cell identities.

Research paper

 

Daily Digest | April 18, 2024

Improving microbial phylogeny with citizen science within a mass-market video game | Nature Biotechnology

Citizen science video games are designed primarily for users already inclined to contribute to science, which severely limits their accessibility for an estimated community of 3 billion gamers worldwide. Researchers created Borderlands Science (BLS), a citizen science activity that is seamlessly integrated within a popular commercial video game played by tens of millions of gamers. This integration is facilitated by a novel game-first design of citizen science games, in which the game design aspect has the highest priority, and a suitable task is then mapped to the game design. BLS crowdsources a multiple alignment task of 1 million 16S ribosomal RNA sequences obtained from human microbiome studies. Since its initial release on 7 April 2020, over 4 million players have solved more than 135 million science puzzles, a task unsolvable by a single individual. Leveraging these results, they show that their multiple sequence alignment simultaneously improves microbial phylogeny estimations and UniFrac effect sizes compared to state-of-the-art computational methods.

Research paper

 

Topological benchmarking of algorithms to infer Gene Regulatory Networks from Single-Cell RNA-seq Data | Bioinformatics

In recent years, many algorithms for inferring gene regulatory networks from single-cell transcriptomic data have been published. Several studies have evaluated their accuracy in estimating the presence of an interaction between pairs of genes. However, these benchmarking analyses do not quantify the algorithms’ ability to capture structural properties of networks, which are fundamental, for example, for studying the robustness of a gene network to external perturbations. Here, researchers devise a three-step benchmarking pipeline called STREAMLINE that quantifies the ability of algorithms to capture topological properties of networks and identify hubs.

Research paper

 

Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning | Nature Medicine

Cancer of unknown primary (CUP) site poses diagnostic challenges due to its elusive nature. Many cases of CUP manifest as pleural and peritoneal serous effusions. Leveraging cytological images from 57,220 cases at four tertiary hospitals, researchers developed a deep-learning method for tumor origin differentiation using cytological histology (TORCH) that can identify malignancy and predict tumor origin in both hydrothorax and ascites. They examined its performance on three internal (n = 12,799) and two external (n = 14,538) testing sets. In both internal and external testing sets, TORCH achieved area under the receiver operating curve values ranging from 0.953 to 0.991 for cancer diagnosis and 0.953 to 0.979 for tumor origin localization. TORCH accurately predicted primary tumor origins, with a top-1 accuracy of 82.6% and top-3 accuracy of 98.9%.

Research paper

 

Daily Digest | April 17, 2024

Foundation model for cancer imaging biomarkers | Nature Machine Intelligence

Foundation models in deep learning are characterized by a single large-scale model trained on vast amounts of data serving as the foundation for various downstream tasks. Foundation models are generally trained using self-supervised learning and excel in reducing the demand for training samples in downstream applications. This is especially important in medicine, where large labelled datasets are often scarce. Here, researchers developed a foundation model for cancer imaging biomarker discovery by training a convolutional encoder through self-supervised learning using a comprehensive dataset of 11,467 radiographic lesions. The foundation model was evaluated in distinct and clinically relevant applications of cancer imaging-based biomarkers.

Research paper

 

PIFiA: self-supervised approach for protein functional annotation from single-cell imaging data | Molecular Systems Biology

Fluorescence microscopy data describe protein localization patterns at single-cell resolution and have the potential to reveal whole-proteome functional information with remarkable precision. Yet, extracting biologically meaningful representations from cell micrographs remains a major challenge. Existing approaches often fail to learn robust and noise-invariant features or rely on supervised labels for accurate annotations. Researchers developed PIFiA (Protein Image-based Functional Annotation), a self-supervised approach for protein functional annotation from single-cell imaging data.

Research paper

 

Cell type signatures in cell-free DNA fragmentation profiles reveal disease biology | Nature Communications

Circulating cell-free DNA (cfDNA) fragments have characteristics that are specific to the cell types that release them. Current methods for cfDNA deconvolution typically use disease tailored marker selection in a limited number of bulk tissues or cell lines. Here, researchers utilize single cell transcriptome data as a comprehensive cellular reference set for disease-agnostic cfDNA cell-of-origin analysis. They correlate cfDNA-inferred nucleosome spacing with gene expression to rank the relative contribution of over 490 cell types to plasma cfDNA. In 744 healthy individuals and patients, they uncover cell type signatures in support of emerging disease paradigms in oncology and prenatal care. They train predictive models that can differentiate patients with colorectal cancer (84.7%), early-stage breast cancer (90.1%), multiple myeloma (AUC 95.0%), and preeclampsia (88.3%) from matched controls.

Research paper

 

Daily Digest | April 16, 2024

scGHOST: identifying single-cell 3D genome subcompartments | Nature Methods

Single-cell Hi-C (scHi-C) technologies allow for probing of genome-wide cell-to-cell variability in three-dimensional (3D) genome organization from individual cells. Here researchers present scGHOST, a single-cell subcompartment annotation method using graph embedding with constrained random walk sampling. Applications of scGHOST to scHi-C data and contact maps derived from single-cell 3D genome imaging demonstrate reliable identification of single-cell subcompartments, offering insights into cell-to-cell variability of nuclear subcompartments.

Research paper

 

BISCUIT: an efficient, standards-compliant tool suite for simultaneous genetic and epigenetic inference in bulk and single-cell studies | Nucleic Acids Research

Data from both bulk and single-cell whole-genome DNA methylation experiments are under-utilized in many ways. This is attributable to inefficient mapping of methylation sequencing reads, routinely discarded genetic information, and neglected read-level epigenetic and genetic linkage information. Researchers introduce the BISulfite-seq Command line User Interface Toolkit (BISCUIT) and its companion R/Bioconductor package, biscuiteer, for simultaneous extraction of genetic and epigenetic information from bulk and single-cell DNA methylation sequencing.

Research paper

 

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology | Nature Communications

Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here researchers propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. PLMSearch uses deep representations from a pre-trained protein language model and trains the similarity prediction model with a large number of real structure similarity. This enables PLMSearch to capture the remote homology information concealed behind the sequences.

Research paper

 

Daily Digest | April 15, 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data | Nature Biotechnology

Existing methods for gene regulatory network (GRN) inference rely on gene expression data alone or on lower resolution bulk data. Despite the recent integration of chromatin accessibility and RNA sequencing data, learning complex mechanisms from limited independent data points still presents a daunting challenge. Here researchers present LINGER (Lifelong neural network for gene regulation), a machine-learning method to infer GRNs from single-cell paired gene expression and chromatin accessibility data. LINGER incorporates atlas-scale external bulk data across diverse cellular contexts and prior knowledge of transcription factor motifs as a manifold regularization.

Research paper

 

Comprehensive transcriptome analysis reveals altered mRNA splicing and post-transcriptional changes in the aged mouse brain | Nucleic Acids Research

A comprehensive understanding of molecular changes during brain aging is essential to mitigate cognitive decline and delay neurodegenerative diseases. The interpretation of mRNA alterations during brain aging is influenced by the health and age of the animal cohorts studied. Here, researchers carefully consider these factors and provide an in-depth investigation of mRNA splicing and dynamics in the aging mouse brain, combining short- and long-read sequencing technologies with extensive bioinformatic analyses.

Research paper

 

Accurately clustering biological sequences in linear time by relatedness sorting | Nature Communications

Clustering biological sequences into similar groups is an increasingly important task as the number of available sequences continues to grow exponentially. Search-based approaches to clustering scale super-linearly with the number of input sequences, making it impractical to cluster very large sets of sequences. Approaches to clustering sequences in linear time currently lack the accuracy of super-linear approaches. Here, the author set out to develop and characterize a strategy for clustering with linear time complexity that retains the accuracy of less scalable approaches. The resulting algorithm, named Clusterize, sorts sequences by relatedness to linearize the clustering problem.

Research paper