当前位置：网站首页>Single cell literature learning (Part3) -- dstg: deconvoluting spatial transcription data through graph based AI

Single cell literature learning (Part3) -- dstg: deconvoluting spatial transcription data through graph based AI

2022-06-22 06:33:00 【GoatGui】

Learning notes , For reference only , If there is a mistake, it must be corrected

key word ：spatial transcriptomics; deconvolution; graph-based artificial intelligence; single-cell RNA-seq

List of articles

- DSTG: deconvoluting spatial transcriptomics data through graph-based artificial intelligence

DSTG: deconvoluting spatial transcriptomics data through graph-based artificial intelligence

Abstract

Recently developed spatial transcriptomics （ST） Be able to slice different tissues spots Spatial information with each spots Inside RNA abundance of cells Connect , This is particularly important for understanding the structure and function of tissues and cells . However , For something like this ST data , Due to a spot Usually more than a single cell Big , At every spot The measured gene expression is from a mixture of cells with heterogenous cell types. therefore , It needs to be done for each spot Of ST Split the data , To reveal the cell composition of the space point . In this study , We have come up with a new method , That is, through graph based convolution network （DSTG） Deconstruct the spatial transcriptome data , To accurately deconstruct the observed gene expression at each point and restore its cellular composition , To achieve a high level of segmentation , And reveal the spatial structure of cell heterogeneity in tissues .DSTG It not only shows excellent performance on synthetic spatial data generated by different schemes , It also effectively recognized the mouse cortex 、 Of cells in hippocampal slices and pancreatic tumor tissue The composition of space . All in all ,DSTG Accurately reveals Cell states and subpopulations based on spatial localization .DSTG is available as a ready-to-use open source software (https://github.com/Su-informatics-lab/DSTG) for precise interrogation of spatial organizations and functions in tissues.

Introduction

Different types of cells are spatially and structurally present in tissues to perform their functions . Revealing the complex spatial structure of heterogeneous tissues is of great significance for understanding the cellular mechanisms and functions in diseases . unicellular RNA Sequencing technology （scRNA-seq） The rapid development of has attracted people to elucidate the formation of heterogeneous cells [1-4] And tracking the blood relationship within the organization [5-7]. Unfortunately , Due to the lack of spatial information ,scRNA-seq Unable to recognize the structural organization of heterogeneous cells in complex tissues . therefore , As scRNA-seq A supplement to , A method of transcriptome analysis with spatial resolution [8-10] Has been introduced . In order to reveal the spatial cellular structure in tissues , Sequencing based high-throughput spatial transcriptomics （ST） technology [11-14], Such as 10X Genomics Visium[8] and Slide-seq[15, 16], Use spatially indexed barcodes with RNA sequencing, Allows quantitative analysis of the transcriptome of a single tissue section with spatial resolution .

burgeoning ST The technology can be used to analyze transcripts spatially index And measure expression profiles, This promotes our understanding of the precise organizational structure . However ,ST The resolution of the data is much lower than the single cell level . adopt “spot”[8] or “bead”[15, 16] Captured at a specific location Transcripts It usually consists of a mixture of heterogeneous cells . for example ,10X Genomics The company developed microarray based ST One of the technologies Visium, Use a diameter of 50μm Of spot, Every spot Average coverage 10-20 Cells [17]. Even at high resolution (10 micron ) Quantifying gene expression Slide-seq[15, 16], A pixel may still overlap multiple cells . therefore , In a "spot " The gene expression measured on the reflects the cell mixture . therefore , reveal ST Every... In the data spot Cell composition is essential for high-resolution investigation of the molecular and cellular structure of tissues .

To solve this problem , At present, few customization methods have been developed . SPOTlight[18] It is a non negative matrix factorization regression and non negative least square method deconvolute Algorithm , It has been successfully applied to ST data [16]. say concretely ,SPOTlight Combined reference scRNA-seq Data to identify cell type-specific topic profiles, And further used to deconstruct spatial points . This method uses scRNA-seq Data identification cell states and subpopulations Come on deconvolute ST data , It indicates that the utilization characteristics are obvious scRNA-seq Data will help and facilitate the exploration of spatial data sets . such ST deconvolute A major limitation of the method is that it cannot be effectively learned and utilized spot The intrinsic topological information of the inner cell types , This topology information provides information about the observed gene expression patterns and associated cell types at spots.

In recent years , Figure convolution network （GCN）[19] It shows a good ability in using the inherent topology information of data to improve the performance of the model . It shows a good ability in using the inherent topology information of data to improve the performance of the model . Topological relationships within data , Such as the similarity between samples , Can be represented by a diagram . By learning the shared kernel of spectral convolution for all nodes in the graph , One and a half supervised GCN The model can capture the local graph structure and node characteristics , These two kinds of information are represented as potential space .GCN[19] And its variants [20, 21] It has been successfully applied to different scenarios , Including analysis of cancer patient subtypes using real-world evidence [22]、 Protein prediction [23] And drug design [24], And single cells and diseases [25-29]. These works show that , By effectively learning and utilizing potential representations and topological relationships between data ,GCN The model can significantly improve learning performance .

In this work , We developed a new graph-based artificial intelligence (AI) model, Through graph based convolution network （DSTG） The spatial transcriptome data were analyzed deconvoluting , Used to decompose cell mixtures in spatial resolution transcriptomic data . Based on well characterized scRNA-seq Data sets ,DSTG Able to utilize semi supervised GCN Study ST The precise composition of data . DSTG The performance of has been synthesized ST Data and different experiments with clear structure ST It is verified on the data set , Including mouse cortex 、 Hippocampal tissue and pancreatic tumor tissue . Besides , We also provide DSTG Implementation software for , As a ready to use Python software package , It is similar to the present ST Analysis datasets are compatible , Accurate cell types can be achieved decomposition .

Materials and methods

Variable gene selection

about scRNA-seq data , We first used ANOVA to determine the expression of in different cell types The genes with the greatest variability . According to the adjusted P Values and Bonferroni correction , elect scRNA-seq In the data front 2000 The most variable genetic characteristics . then , We use it top variable genes Of scRNA-seq Data to generate a cell synthesis mixture with a known cell composition pseudo-ST data. To simplify and illustrate , We have been using the term "spot" To represent the pseudo-ST data A mixture of synthetic cells , as well as real-ST data Of spot or bead.

Pseudo-ST data

from ST Detection of Real-ST Each covering a mixture of heterogeneous cells was captured spot Of gene expression. These cell mixtures can be made of the same tissue scRNA- seq Data simulation and construction . say concretely , To mimic a cellular mixture at a point , We from scRNA- seq Two to eight cells were selected in the data set , And put their transcriptomic profiles Combine as pseudo-ST. The number of selected cells is the same as the real one ST The spatial resolution of the data is similar . ad locum , Every pseudo-ST spot Of cell types The exact proportion of is available , Because the identity of the selected cells is known . In order to better imitate real- ST spot The data of , If what you get pseudo- ST Unique molecular identifier of the data （UMI） Total over real- ST data , We will perform corresponding downsampling . therefore , The pseudo- ST Data is the same as that obtained from the same organization real- ST Data similarity . To further ensure DSTG utilize pseudo- ST Data and real- ST The similarity between data , We learned a link graph to connect pseudo- ST and real- ST The similarities between , This figure is used as DSTG The input graph of .

Link graph

about pseudo-ST Data and real-ST data , We first normalized the data ： One cell One of them gene Original UMI The count is first divided by the total count of the cell (library size normalization), And then multiplied by the 10000 The size coefficient of , Finally, add one to perform logarithmic transformation . The normalized data will be converted to standard data , namely
$x_{g,i}=\frac{x_{g,i}^0-\overline{x}_g^0}{\rho_g},$
among , $x_{g,i}^0$ yes gene $g$ and spot $i$ Normalized count of , $\overline{x}_g^0$ yes $x_{g,i}^0$ The average at all points , $\rho_g$ yes $x_{g,i}^0$ Of SD. therefore , $x_{g,i}$ yes standardized gene expression.

After data standardization , We are DSTG Established a system that contains pseudo-ST and real-ST Link graph of data . The drawing is G = (V, E), among N =|V| Nodes represent spatial spots ,E Represents the edge , $A$ Is the adjacent matrix. Here it is , We use canonical correlation analysis [30-33] Yes pseudo-ST Data and real-ST Dimensionality reduction of data , Then determine the nearest neighbors in the reduced dimension space [23].

First ,pseudo-ST Data and real-ST The data are represented as $X_{pseudo}^{m \times n_p}$ and $X_{real}^{m \times n_r}$ . among , $m$ yes the number of variable genes, $n_p$ and $n_r$ It's their own spot Number , We go through $n_p$ The canonical correlation vector of dimension $\mu_s$ and $n_r$ Dimensional $v_S$ Project these two data onto a lower $S$ Dimensional space , among $\cdots ,S$ . Maximize the following mathematical expression
$\mu_s^T \left ( X_{pseudo}^{m \times n_p} \right )^T X_{real}^{m \times n_r} v_s,$
Subject to constraints $\lVert \mu_s \rVert_2^2 \le 1$ and $\lVert v_s \rVert_2^2 \le 1$ . To determine the typical correlation vector pairs , We used singular value decomposition, We get the result with $S$ Of the maximum eigenvalues S A pair of canonical correlation vectors . Each pair $\mu_s$ and $v_S$ Put the raw data $X_{pseudo}^{m \times n_p}$ and $X_{real}^{m \times n_r}$ Project to sth dimension of the low-dimension space. For DSTG, we took $S$ as 20 for the reduced dimension space.

secondly , In low dimensional space , We from pseudo-ST and real-ST data The nearest neighbors between the points are determined in . say concretely , If you pass KNN ( Default k by 200),spot $i$ by spot $j$ A nearest neighbor of , that spot $i$ by spot $j$ Is the nearest neighbor to each other . such , We are in pseudo- ST and real- ST Between the establishment of link graph . In order to be in DSTG Further use in the model real- ST Information about data , We also determined real- ST The data itself is the nearest neighbor to each other . So , We established the final link graph, And use adjacent matrix $A$ To express . in other words , If spot $i$ and spot $j$ They are the nearest neighbors to each other , $A_{ij}=1$ , otherwise , $A_{ij}=0$ . This figure captures All spot The inherent topology of similarity between .

DSTG method

We make use of GCN stay link graph $G = (V, E)$ To identify and predict ST The composition of different types of cells in the data . Every spot Seen as a node ,pseudo-ST The cell mixtures in the data are generated from known components . DSTG The goal is to predict real-ST Cell type composition of the data , Not only use each spot Characteristics of , Also use pseudo-ST Data and real-ST Data graph information, Its characteristic is adjacent matrix $A$ . To be clear ,DSTG Method requires two inputs ,One input is the spot similarity graph structure learned above (see the section Link graph). The other is the data matrix of combined pseudo-ST and real-ST data. As mentioned above , false ST Data and truth ST The data are represented as $X_{pseudo}^{m \times n_p}$ and $X_{real}^{m \times n_r}$ , $m$ yes the number of variable genes, $n_p$ and $n_r$ It's their own spot Number , The input data matrix is expressed as
$X=[X_{pseudo}, X_{real}] \in R^{ m \times N},$
among , $N=n_p + n_r$ .

With these two inputs , namely $X$ and $A$ ,DSTG It is composed of multiple convolution layers . In order to train effectively DSTG, Adjacency matrix $A$ Modified and normalized to
$\tilde{A} = \check{D}^{-1/2} \hat{A} \check{D}^{-1/2},$
among , $\hat{A} = A + I$ , $I$ is the identity matrix and D is the diagonal degree matrix of $\hat{A}$ .

Specifically, each graph convolutional layer is defined as