当前位置：网站首页>Efficient integration of heterogeneous single cell transcriptome with scanorama

Efficient integration of heterogeneous single cell transcriptome with scanorama

2022-06-24 00:18:00 【tzc_ fly】

fig1

Catalog

Abstract
Main
result
summary

Abstract

Integration from multiple experiments 、 Laboratories and different technologies single-cell RNA sequencing（scRNA-seq） The data can reveal more abundant biological problems , But for now scRNA-seq Data integration methods are limited by the requirements of data sets from functionally similar cells . We proposed Scanorama Algorithm , The algorithm can identify and merge the shared cell types among all data set pairs , And accurately integrate scRNA-seq A heterogeneous collection of data . We apply Scanorama Consolidates and eliminates representatives from 9 Of different technologies 26 Different scRNA-seq Experimental 105,476 Batch effect of cells .Scanorama Sensitive to subtle temporal changes within the same cell lineage , Successfully integrated CD14+ monocyte （monocytes） They differentiate into macrophages at different stages of differentiation （macrophages） Cells with similar functions in time series data . Last , We show that Scanorama Several orders of magnitude faster than existing technology , You can make an appointment with 9 Consolidation within hours 1,095,538 Cells .

Main

independent single-cell RNA sequencing（scRNA-seq） Experiments have been used to discover new cell states and reconstruct cell differentiation trajectories . Through the efforts of global scientists , Researchers are currently generating large-scale 、 comprehensive scRNA-seq Data sets , These datasets describe a variety of cellular functions , It is expected to realize high-resolution observation of basic biology and disease processes . However , Due to the experimental batch 、 Differences in sample donors or experimental techniques , Combining large, unified reference datasets may be affected . Although recent methods have shown that , Can be integrated in multiple experiments scRNA-seq, But these methods automatically assume that all datasets share at least one common cell type , Or gene expression profiles share basically the same related structure in all data sets . therefore , These methods are prone to over correction , Especially in integrating scRNA-seq When there are very different data sets .

Here it is , We have put forward Scanorama： One way to effectively integrate multiple scRNA-seq Strategy for data sets , Even though they are composed of heterogeneous transcriptional phenotypes . Our method is similar to the computer vision algorithm for panoramic mosaic , The algorithm recognizes images with overlapping content , And merge these images into a larger panorama （ chart 1a）. Again ,Scanorama Automatic identification of cells containing similar transcriptional profiles scRNA-seq Data sets , These matches can be used for batch correction and integration （ chart 1b）.Scanorama Strong robustness to different data set sizes and sources , Data set specific content is preserved , And it is not required that all data sets share at least one cell type .
fig2

chart 1：" panorama " Schematic diagram of data set composition .
a： Panoramic mosaic algorithm finds and merges overlapping images , To create a larger composite image .
b： A similar strategy can also be used to merge heterogeneous scRNA-seq Data sets .Scanorama Search for the nearest neighbor , To determine the shared cell type between all data set pairs . Based on hyperplane locally sensitive hash LSH And the dimension reduction technology of random projection tree and approximate nearest neighbor algorithm greatly accelerate the search speed . Linked cells form a matching relationship , It can be used to correct batch effects and combine them , Thus, the data set formed by connecting on the basis of these matches becomes scRNA-seq Of " panorama ".

Our method will match each other's nearest neighbors （ A technique for finding similar elements between two data sets ） Extend to find similar elements among multiple datasets . It was originally developed for pattern matching in images , Finding the nearest neighbor is also used to identify two at a time scRNA-seq Common cell types between datasets . However , To align more than two datasets , Existing methods select a data set as a reference , All other data sets are then integrated into the reference in turn , One at a time , This may lead to suboptimal results , And depends on the order in which the data set is considered . Even though Scanorama A similar approach is used when aligning sets of two data sets , But on larger data sets , It is not sequence sensitive , And it is not easy to make excessive correction , Because it can find a match between all data set pairs .

To optimize the process of searching for matching cells in all data sets , We introduced two key steps . We don't do the nearest neighbor search in the high-dimensional gene space , Instead, the gene expression matrix is used to perform effective random singular value decomposition for each cell （SVD）, The gene expression profile of each cell was compressed into low dimensional embedding , This also helps to improve the robustness of the method to noise . Besides , We use approximate nearest neighbor search based on hyperplane locally sensitive hash and random projection tree , To greatly reduce the asymptotic and actual nearest neighbor query time .

Scanorama Can achieve scRNA-seq Data set integration and batch calibration . Even though Scanorama It will bring more computing costs , But it makes batch correction feasible for large data sets , Thus, more extensive downstream analysis can be carried out . for example , We can perform differential expression analysis on batch corrected gene expression data .

result

fig3

chart 2：Scanorama Correctly integrated a simple data set set , Other methods have failed .
a： We will Scanorama Applied to integrate three data sets ： One is completely Jurkat cells （n=3257 Cells ）、 One is completely 293T cells （n=2885 Cells ） And a 50/50 Proportionally mixed Jurkat and 293T Cell data set （n=3388 Cells ）.
b： Our method correctly puts Jurkat cells （ Orange ） and 293T cells （ Blue ） Integrate into two independent clusters .
c and d： The existing scRNA-seq The data set integration approach is sensitive to the order in which it considers the data sets , And it's possible that Jurkat Data set and 293T Datasets are incorrectly merged , Forming clusters that do not correspond to the actual cell type ：scran MNN The results of the integration are shown in c, and Seurat CCA The results of the integration are shown in d.

fig4

chart 3： Across nine different sequencing technologies 26 Panoramic integration of single cell data sets .
a： Use our method to 105476 Cells after batch correction t-SNE Distribution .
b and c： other scRNA-seq Data set integration approach （scran MNN and Seurat CCA） Not designed for heterogeneous data set integration , Therefore, there is always a naive tendency to merge all data sets into a large cluster .
d and e：Scanorama In less than 6 In minutes , Below 12 GB Of RAM in , Integrated 26 Of data sets 105476 Cells , This is better than the current scRNA-seq The integration approach is much more efficient .

fig5

chart 4：Scanorama Extensible to include 100 Integration of data sets of more than ten thousand cells .
a：Scanorama It integrates the brain and spinal cord of mice 1095538 Cells .
b To j： Use marker genes to reveal cell type specific clusters .

fig6

chart 5：Scanorama Sensitive to subtle transcriptional changes in cellular state over time .
a To c： The rows and columns of the heat map correspond to different data sets in the study of time processes （ Include data sets at the same point in time ）. Higher comparison scores （ Navy Blue ） Tends to approach the diagonal , This indicates that the transcriptional similarity between datasets from more recent time points is greater . In every time series experiment , Time differences and alignment scores were significantly correlated .
d To f： according to Monocle 2 The pseudo time allocated by the algorithm is visualized ,Scanorama Eliminates the CD14+ Batch effect of monocytes . But in the use of scran MNN After correction ,Monocle 2 The correct track can no longer be recognized . The original data are shown in d、Scanorama see e and scran MNN see f.

summary

The method proposed in this article is to integrate scRNA-seq Efficient solutions , Past ingest Different , There is no need to specify a specific reference data set , In fact, the idea of the past method is to integrate data into a larger cluster （ The domain is adapted to the reference data set ）.Scanorama Using the idea of panoramic generation , It can realize the integration of multiple data sets at the same time , It can automatically match the same cell type under different data sets , At the same time, the differences of different cell types under different data sets are preserved , Is a more reasonable integration method .

Heterogeneity refers to the integration of information that retains the real differences between the two data sets .Scanorama Belongs to the realization of the correct elimination of technology , The batch effect brought about by the experiment .

原网站

版权声明
本文为[tzc_ fly]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/175/202206232147303496.html