当前位置:网站首页>Efficient integration of heterogeneous single cell transcriptome with scanorama

Efficient integration of heterogeneous single cell transcriptome with scanorama

2022-06-24 00:18:00 tzc_ fly

fig1

Abstract

Integration from multiple experiments 、 Laboratories and different technologies single-cell RNA sequencing(scRNA-seq) The data can reveal more abundant biological problems , But for now scRNA-seq Data integration methods are limited by the requirements of data sets from functionally similar cells . We proposed Scanorama Algorithm , The algorithm can identify and merge the shared cell types among all data set pairs , And accurately integrate scRNA-seq A heterogeneous collection of data . We apply Scanorama Consolidates and eliminates representatives from 9 Of different technologies 26 Different scRNA-seq Experimental 105,476 Batch effect of cells .Scanorama Sensitive to subtle temporal changes within the same cell lineage , Successfully integrated CD14+ monocyte (monocytes) They differentiate into macrophages at different stages of differentiation (macrophages) Cells with similar functions in time series data . Last , We show that Scanorama Several orders of magnitude faster than existing technology , You can make an appointment with 9 Consolidation within hours 1,095,538 Cells .

Main

independent single-cell RNA sequencing(scRNA-seq) Experiments have been used to discover new cell states and reconstruct cell differentiation trajectories . Through the efforts of global scientists , Researchers are currently generating large-scale 、 comprehensive scRNA-seq Data sets , These datasets describe a variety of cellular functions , It is expected to realize high-resolution observation of basic biology and disease processes . However , Due to the experimental batch 、 Differences in sample donors or experimental techniques , Combining large, unified reference datasets may be affected . Although recent methods have shown that , Can be integrated in multiple experiments scRNA-seq, But these methods automatically assume that all datasets share at least one common cell type , Or gene expression profiles share basically the same related structure in all data sets . therefore , These methods are prone to over correction , Especially in integrating scRNA-seq When there are very different data sets .

Here it is , We have put forward Scanorama: One way to effectively integrate multiple scRNA-seq Strategy for data sets , Even though they are composed of heterogeneous transcriptional phenotypes . Our method is similar to the computer vision algorithm for panoramic mosaic , The algorithm recognizes images with overlapping content , And merge these images into a larger panorama ( chart 1a). Again ,Scanorama Automatic identification of cells containing similar transcriptional profiles scRNA-seq Data sets , These matches can be used for batch correction and integration ( chart 1b).Scanorama Strong robustness to different data set sizes and sources , Data set specific content is preserved , And it is not required that all data sets share at least one cell type .
fig2

  • chart 1:" panorama " Schematic diagram of data set composition .
  • a: Panoramic mosaic algorithm finds and merges overlapping images , To create a larger composite image .
  • b: A similar strategy can also be used to merge heterogeneous scRNA-seq Data sets .Scanorama Search for the nearest neighbor , To determine the shared cell type between all data set pairs . Based on hyperplane locally sensitive hash LSH And the dimension reduction technology of random projection tree and approximate nearest neighbor algorithm greatly accelerate the search speed . Linked cells form a matching relationship , It can be used to correct batch effects and combine them , Thus, the data set formed by connecting on the basis of these matches becomes scRNA-seq Of " panorama ".

Our method will match each other's nearest neighbors ( A technique for finding similar elements between two data sets ) Extend to find similar elements among multiple datasets . It was originally developed for pattern matching in images , Finding the nearest neighbor is also used to identify two at a time scRNA-seq Common cell types between datasets . However , To align more than two datasets , Existing methods select a data set as a reference , All other data sets are then integrated into the reference in turn , One at a time , This may lead to suboptimal results , And depends on the order in which the data set is considered . Even though Scanorama A similar approach is used when aligning sets of two data sets , But on larger data sets , It is not sequence sensitive , And it is not easy to make excessive correction , Because it can find a match between all data set pairs .

To optimize the process of searching for matching cells in all data sets , We introduced two key steps . We don't do the nearest neighbor search in the high-dimensional gene space , Instead, the gene expression matrix is used to perform effective random singular value decomposition for each cell (SVD), The gene expression profile of each cell was compressed into low dimensional embedding , This also helps to improve the robustness of the method to noise . Besides , We use approximate nearest neighbor search based on hyperplane locally sensitive hash and random projection tree , To greatly reduce the asymptotic and actual nearest neighbor query time .

Scanorama Can achieve scRNA-seq Data set integration and batch calibration . Even though Scanorama It will bring more computing costs , But it makes batch correction feasible for large data sets , Thus, more extensive downstream analysis can be carried out . for example , We can perform differential expression analysis on batch corrected gene expression data .

result

fig3

  • chart 2:Scanorama Correctly integrated a simple data set set , Other methods have failed .
  • a: We will Scanorama Applied to integrate three data sets : One is completely Jurkat cells (n=3257 Cells )、 One is completely 293T cells (n=2885 Cells ) And a 50/50 Proportionally mixed Jurkat and 293T Cell data set (n=3388 Cells ).
  • b: Our method correctly puts Jurkat cells ( Orange ) and 293T cells ( Blue ) Integrate into two independent clusters .
  • c and d: The existing scRNA-seq The data set integration approach is sensitive to the order in which it considers the data sets , And it's possible that Jurkat Data set and 293T Datasets are incorrectly merged , Forming clusters that do not correspond to the actual cell type :scran MNN The results of the integration are shown in c, and Seurat CCA The results of the integration are shown in d.

fig4

  • chart 3: Across nine different sequencing technologies 26 Panoramic integration of single cell data sets .
  • a: Use our method to 105476 Cells after batch correction t-SNE Distribution .
  • b and c: other scRNA-seq Data set integration approach (scran MNN and Seurat CCA) Not designed for heterogeneous data set integration , Therefore, there is always a naive tendency to merge all data sets into a large cluster .
  • d and e:Scanorama In less than 6 In minutes , Below 12 GB Of RAM in , Integrated 26 Of data sets 105476 Cells , This is better than the current scRNA-seq The integration approach is much more efficient .

fig5

  • chart 4:Scanorama Extensible to include 100 Integration of data sets of more than ten thousand cells .
  • a:Scanorama It integrates the brain and spinal cord of mice 1095538 Cells .
  • b To j: Use marker genes to reveal cell type specific clusters .

fig6

  • chart 5:Scanorama Sensitive to subtle transcriptional changes in cellular state over time .
  • a To c: The rows and columns of the heat map correspond to different data sets in the study of time processes ( Include data sets at the same point in time ). Higher comparison scores ( Navy Blue ) Tends to approach the diagonal , This indicates that the transcriptional similarity between datasets from more recent time points is greater . In every time series experiment , Time differences and alignment scores were significantly correlated .
  • d To f: according to Monocle 2 The pseudo time allocated by the algorithm is visualized ,Scanorama Eliminates the CD14+ Batch effect of monocytes . But in the use of scran MNN After correction ,Monocle 2 The correct track can no longer be recognized . The original data are shown in d、Scanorama see e and scran MNN see f.

summary

The method proposed in this article is to integrate scRNA-seq Efficient solutions , Past ingest Different , There is no need to specify a specific reference data set , In fact, the idea of the past method is to integrate data into a larger cluster ( The domain is adapted to the reference data set ).Scanorama Using the idea of panoramic generation , It can realize the integration of multiple data sets at the same time , It can automatically match the same cell type under different data sets , At the same time, the differences of different cell types under different data sets are preserved , Is a more reasonable integration method .

Heterogeneity refers to the integration of information that retains the real differences between the two data sets .Scanorama Belongs to the realization of the correct elimination of technology , The batch effect brought about by the experiment .

原网站

版权声明
本文为[tzc_ fly]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206232147303496.html