当前位置:网站首页>Single cell literature learning (Part3) -- dstg: deconvoluting spatial transcription data through graph based AI

Single cell literature learning (Part3) -- dstg: deconvoluting spatial transcription data through graph based AI

2022-06-22 06:33:00 GoatGui

Learning notes , For reference only , If there is a mistake, it must be corrected

key word :spatial transcriptomics; deconvolution; graph-based artificial intelligence; single-cell RNA-seq



DSTG: deconvoluting spatial transcriptomics data through graph-based artificial intelligence

Abstract

Recently developed spatial transcriptomics (ST) Be able to slice different tissues spots Spatial information with each spots Inside RNA abundance of cells Connect , This is particularly important for understanding the structure and function of tissues and cells . However , For something like this ST data , Due to a spot Usually more than a single cell Big , At every spot The measured gene expression is from a mixture of cells with heterogenous cell types. therefore , It needs to be done for each spot Of ST Split the data , To reveal the cell composition of the space point . In this study , We have come up with a new method , That is, through graph based convolution network (DSTG) Deconstruct the spatial transcriptome data , To accurately deconstruct the observed gene expression at each point and restore its cellular composition , To achieve a high level of segmentation , And reveal the spatial structure of cell heterogeneity in tissues .DSTG It not only shows excellent performance on synthetic spatial data generated by different schemes , It also effectively recognized the mouse cortex 、 Of cells in hippocampal slices and pancreatic tumor tissue The composition of space . All in all ,DSTG Accurately reveals Cell states and subpopulations based on spatial localization .DSTG is available as a ready-to-use open source software (https://github.com/Su-informatics-lab/DSTG) for precise interrogation of spatial organizations and functions in tissues.

Introduction

Different types of cells are spatially and structurally present in tissues to perform their functions . Revealing the complex spatial structure of heterogeneous tissues is of great significance for understanding the cellular mechanisms and functions in diseases . unicellular RNA Sequencing technology (scRNA-seq) The rapid development of has attracted people to elucidate the formation of heterogeneous cells [1-4] And tracking the blood relationship within the organization [5-7]. Unfortunately , Due to the lack of spatial information ,scRNA-seq Unable to recognize the structural organization of heterogeneous cells in complex tissues . therefore , As scRNA-seq A supplement to , A method of transcriptome analysis with spatial resolution [8-10] Has been introduced . In order to reveal the spatial cellular structure in tissues , Sequencing based high-throughput spatial transcriptomics (ST) technology [11-14], Such as 10X Genomics Visium[8] and Slide-seq[15, 16], Use spatially indexed barcodes with RNA sequencing, Allows quantitative analysis of the transcriptome of a single tissue section with spatial resolution .

burgeoning ST The technology can be used to analyze transcripts spatially index And measure expression profiles, This promotes our understanding of the precise organizational structure . However ,ST The resolution of the data is much lower than the single cell level . adopt “spot”[8] or “bead”[15, 16] Captured at a specific location Transcripts It usually consists of a mixture of heterogeneous cells . for example ,10X Genomics The company developed microarray based ST One of the technologies Visium, Use a diameter of 50μm Of spot, Every spot Average coverage 10-20 Cells [17]. Even at high resolution (10 micron ) Quantifying gene expression Slide-seq[15, 16], A pixel may still overlap multiple cells . therefore , In a "spot " The gene expression measured on the reflects the cell mixture . therefore , reveal ST Every... In the data spot Cell composition is essential for high-resolution investigation of the molecular and cellular structure of tissues .

To solve this problem , At present, few customization methods have been developed . SPOTlight[18] It is a non negative matrix factorization regression and non negative least square method deconvolute Algorithm , It has been successfully applied to ST data [16]. say concretely ,SPOTlight Combined reference scRNA-seq Data to identify cell type-specific topic profiles, And further used to deconstruct spatial points . This method uses scRNA-seq Data identification cell states and subpopulations Come on deconvolute ST data , It indicates that the utilization characteristics are obvious scRNA-seq Data will help and facilitate the exploration of spatial data sets . such ST deconvolute A major limitation of the method is that it cannot be effectively learned and utilized spot The intrinsic topological information of the inner cell types , This topology information provides information about the observed gene expression patterns and associated cell types at spots.

In recent years , Figure convolution network (GCN)[19] It shows a good ability in using the inherent topology information of data to improve the performance of the model . It shows a good ability in using the inherent topology information of data to improve the performance of the model . Topological relationships within data , Such as the similarity between samples , Can be represented by a diagram . By learning the shared kernel of spectral convolution for all nodes in the graph , One and a half supervised GCN The model can capture the local graph structure and node characteristics , These two kinds of information are represented as potential space .GCN[19] And its variants [20, 21] It has been successfully applied to different scenarios , Including analysis of cancer patient subtypes using real-world evidence [22]、 Protein prediction [23] And drug design [24], And single cells and diseases [25-29]. These works show that , By effectively learning and utilizing potential representations and topological relationships between data ,GCN The model can significantly improve learning performance .

In this work , We developed a new graph-based artificial intelligence (AI) model, Through graph based convolution network (DSTG) The spatial transcriptome data were analyzed deconvoluting , Used to decompose cell mixtures in spatial resolution transcriptomic data . Based on well characterized scRNA-seq Data sets ,DSTG Able to utilize semi supervised GCN Study ST The precise composition of data . DSTG The performance of has been synthesized ST Data and different experiments with clear structure ST It is verified on the data set , Including mouse cortex 、 Hippocampal tissue and pancreatic tumor tissue . Besides , We also provide DSTG Implementation software for , As a ready to use Python software package , It is similar to the present ST Analysis datasets are compatible , Accurate cell types can be achieved decomposition .

Materials and methods

Variable gene selection

about scRNA-seq data , We first used ANOVA to determine the expression of in different cell types The genes with the greatest variability . According to the adjusted P Values and Bonferroni correction , elect scRNA-seq In the data front 2000 The most variable genetic characteristics . then , We use it top variable genes Of scRNA-seq Data to generate a cell synthesis mixture with a known cell composition pseudo-ST data. To simplify and illustrate , We have been using the term "spot" To represent the pseudo-ST data A mixture of synthetic cells , as well as real-ST data Of spot or bead.

Pseudo-ST data

from ST Detection of Real-ST Each covering a mixture of heterogeneous cells was captured spot Of gene expression. These cell mixtures can be made of the same tissue scRNA- seq Data simulation and construction . say concretely , To mimic a cellular mixture at a point , We from scRNA- seq Two to eight cells were selected in the data set , And put their transcriptomic profiles Combine as pseudo-ST. The number of selected cells is the same as the real one ST The spatial resolution of the data is similar . ad locum , Every pseudo-ST spot Of cell types The exact proportion of is available , Because the identity of the selected cells is known . In order to better imitate real- ST spot The data of , If what you get pseudo- ST Unique molecular identifier of the data (UMI) Total over real- ST data , We will perform corresponding downsampling . therefore , The pseudo- ST Data is the same as that obtained from the same organization real- ST Data similarity . To further ensure DSTG utilize pseudo- ST Data and real- ST The similarity between data , We learned a link graph to connect pseudo- ST and real- ST The similarities between , This figure is used as DSTG The input graph of .

Link graph

about pseudo-ST Data and real-ST data , We first normalized the data : One cell One of them gene Original UMI The count is first divided by the total count of the cell (library size normalization), And then multiplied by the 10000 The size coefficient of , Finally, add one to perform logarithmic transformation . The normalized data will be converted to standard data , namely
x g , i = x g , i 0 − x ‾ g 0 ρ g , x_{g,i}=\frac{x_{g,i}^0-\overline{x}_g^0}{\rho_g}, xg,i=ρgxg,i0xg0,
among , x g , i 0 x_{g,i}^0 xg,i0 yes gene g g g and spot i i i Normalized count of , x ‾ g 0 \overline{x}_g^0 xg0 yes x g , i 0 x_{g,i}^0 xg,i0 The average at all points , ρ g \rho_g ρg yes x g , i 0 x_{g,i}^0 xg,i0 Of SD. therefore , x g , i x_{g,i} xg,i yes standardized gene expression.

After data standardization , We are DSTG Established a system that contains pseudo-ST and real-ST Link graph of data . The drawing is G = (V, E), among N =|V| Nodes represent spatial spots ,E Represents the edge , A A A Is the adjacent matrix. Here it is , We use canonical correlation analysis [30-33] Yes pseudo-ST Data and real-ST Dimensionality reduction of data , Then determine the nearest neighbors in the reduced dimension space [23].

First ,pseudo-ST Data and real-ST The data are represented as X p s e u d o m × n p X_{pseudo}^{m \times n_p} Xpseudom×np and X r e a l m × n r X_{real}^{m \times n_r} Xrealm×nr. among , m m m yes the number of variable genes, n p n_p np and n r n_r nr It's their own spot Number , We go through n p n_p np The canonical correlation vector of dimension μ s \mu_s μs and n r n_r nr Dimensional v S v_S vS Project these two data onto a lower S S S Dimensional space , among s = 1 , ⋯   , S s = 1, \cdots ,S s=1,,S . Maximize the following mathematical expression
μ s T ( X p s e u d o m × n p ) T X r e a l m × n r v s , \mu_s^T \left ( X_{pseudo}^{m \times n_p} \right )^T X_{real}^{m \times n_r} v_s, μsT(Xpseudom×np)TXrealm×nrvs,
Subject to constraints ∥ μ s ∥ 2 2 ≤ 1 \lVert \mu_s \rVert_2^2 \le 1 μs221 and ∥ v s ∥ 2 2 ≤ 1 \lVert v_s \rVert_2^2 \le 1 vs221. To determine the typical correlation vector pairs , We used singular value decomposition, We get the result with S S S Of the maximum eigenvalues S A pair of canonical correlation vectors . Each pair μ s \mu_s μs and v S v_S vS Put the raw data X p s e u d o m × n p X_{pseudo}^{m \times n_p} Xpseudom×np and X r e a l m × n r X_{real}^{m \times n_r} Xrealm×nr Project to sth dimension of the low-dimension space. For DSTG, we took S S S as 20 for the reduced dimension space.

secondly , In low dimensional space , We from pseudo-ST and real-ST data The nearest neighbors between the points are determined in . say concretely , If you pass KNN ( Default k by 200),spot i i i by spot j j j A nearest neighbor of , that spot i i i by spot j j j Is the nearest neighbor to each other . such , We are in pseudo- ST and real- ST Between the establishment of link graph . In order to be in DSTG Further use in the model real- ST Information about data , We also determined real- ST The data itself is the nearest neighbor to each other . So , We established the final link graph, And use adjacent matrix A A A To express . in other words , If spot i i i and spot j j j They are the nearest neighbors to each other , A i j = 1 A_{ij}=1 Aij=1, otherwise , A i j = 0 A_{ij}=0 Aij=0. This figure captures All spot The inherent topology of similarity between .

DSTG method

We make use of GCN stay link graph G = ( V , E ) G = (V, E) G=(V,E) To identify and predict ST The composition of different types of cells in the data . Every spot Seen as a node ,pseudo-ST The cell mixtures in the data are generated from known components . DSTG The goal is to predict real-ST Cell type composition of the data , Not only use each spot Characteristics of , Also use pseudo-ST Data and real-ST Data graph information, Its characteristic is adjacent matrix A A A. To be clear ,DSTG Method requires two inputs ,One input is the spot similarity graph structure learned above (see the section Link graph). The other is the data matrix of combined pseudo-ST and real-ST data. As mentioned above , false ST Data and truth ST The data are represented as X p s e u d o m × n p X_{pseudo}^{m \times n_p} Xpseudom×np and X r e a l m × n r X_{real}^{m \times n_r} Xrealm×nr, m m m yes the number of variable genes, n p n_p np and n r n_r nr It's their own spot Number , The input data matrix is expressed as
X = [ X p s e u d o , X r e a l ] ∈ R m × N , X=[X_{pseudo}, X_{real}] \in R^{ m \times N}, X=[Xpseudo,Xreal]Rm×N,
among , N = n p + n r N=n_p + n_r N=np+nr.

With these two inputs , namely X X X and A A A,DSTG It is composed of multiple convolution layers . In order to train effectively DSTG, Adjacency matrix A A A Modified and normalized to
A ~ = D ˇ − 1 / 2 A ^ D ˇ − 1 / 2 , \tilde{A} = \check{D}^{-1/2} \hat{A} \check{D}^{-1/2}, A~=Dˇ1/2A^Dˇ1/2,
among , A ^ = A + I \hat{A} = A + I A^=A+I, I I I is the identity matrix and D is the diagonal degree matrix of A ^ \hat{A} A^.

Specifically, each graph convolutional layer is defined as

 Insert picture description here

where H ( l ) H^{(l)} H(l) is the input from the previous layer, W ( l ) W^{(l)} W(l) is the weight matrix of the lth layer, σ ( ⋅ ) = R e L U ( ⋅ ) \sigma (·) = ReLU(·) σ()=ReLU() is the nonlinear activation function and the input layer H ( 0 ) = X H^{(0)} = X H(0)=X. The composition of a specific cell type f f f at a pseudo-spot i i i is represented as y i , f ∈ Y p y_{i,f} \in Y_p yi,fYp, where i ∈ { 1 , ⋯   , n p } i \in \{ 1, \cdots,n_p \} i{ 1,,np} and cell type $ f \in { 1, \cdots, F }$, F F F represents the total number of different cell types and Y p ∈ R n p × F Y_p \in R^{n_p \times F} YpRnp×F represents the known cell compositions at all spots from the pseudo-ST data.

Specifically, for a three-layer DSTG with F distinct cell types, the forward propagation is realized as Y ^ \hat{Y} Y^

 Insert picture description here

The softmax activation function below is used as the activation function in the output layer that learns the cell type proportions,
s o f t m a x ( ⋅ ) = e x p ( ⋅ ) ∑ e x p ( ⋅ ) softmax(\cdot) = \frac{exp(\cdot)}{\sum exp(\cdot)} softmax()=exp()exp()
The evaluation function is defined as the cross-entropy at pseudo-ST spots, i.e.

 Insert picture description here

During the propagation of each layer, the model will reduce the cross-entropy error on the training data. After training, we had

 Insert picture description here

In the application DSTG Model time , We will fake ST The data is randomly divided into false ST The data is the training set (80%)、 Test set (10%) And validation set (10%), And real ST The data is unlabeled , Will be predicted . For the ST data , We use Adam Algorithm [34] Yes 3 layer DSTG The model has carried out the most 200 The training of the time , The learning rate is 0.01, Window size is 10, Early stop . For the dimensions of the latent layer , We screened 32、64、128、256、528 and 1024 Dimension options , And choose the best dimension .

For the evaluation metrics, we used the Jensen–Shannon divergence (JSD) score, which is a symmetrized and smoothed version of the Kullback–Leibler divergence.

 Insert picture description here

The JSD score at spot i i i is defined as

 Insert picture description here

原网站

版权声明
本文为[GoatGui]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206220544497543.html