当前位置：网站首页>Follow the archiving tutorial to learn rnaseq analysis (IV): QC method for de analysis using deseq2

Follow the archiving tutorial to learn rnaseq analysis (IV): QC method for de analysis using deseq2

2022-06-27 22:47:00 【Wangshixiang】

Follow the archive tutorial RNAseq analysis （ One ）

Follow the archive tutorial RNAseq analysis （ Two ）

Follow the archive tutorial RNAseq analysis （ 3、 ... and ）： Use DESeq2 Perform count standardization

The quality control QC

DESeq2 The next step in the workflow is QC, It includes performing sample level and gene level on count data QC Check the steps , To help us ensure that the sample / Repetition looks good .

img

Sample level QC

RNA-seq A useful initial step in analysis is usually to assess the overall similarity between samples ：

Which samples are similar , What's different ？
Does this meet the expectations of the experimental design ？
What are the main sources of variation in the dataset ？

To explore the similarity of our samples , We will use principal component analysis （PCA） And hierarchical clustering method to perform sample level QC. Our sample level QC Let's see how our repetition comes together , And observe whether our experimental conditions represent the main source of changes in the data . Execute sample level QC Any sample outliers can also be identified , This may require further study , To determine if they need to be in DE Delete... Before analysis .

img

When using these unsupervised clustering methods , Standardized counting log2 Transformation can improve the visual distance / clustering .DESeq2 For sample level QC A regularized logarithmic transformation using normalized counts (rlog), because It adjusts the variance between the means , Thus, clustering is improved .

img

Be careful :DESeq2 Documentation suggests big data sets （100 Samples ） Use variance stable transformation (vst) instead of rlog To perform count conversion , because rlog Functions may run too long , and vst() Function has and rlog Similar properties , Faster .

Principal component analysis PCA[1]

Principal component analysis （PCA） It's a technology , Used to emphasize change , And propose a powerful model in the data set （ Dimension reduction ）. About PCA The details are as follows （ Based on coming from StatQuest The material of [2], If you want a more comprehensive description , We encourage you to explore StatQuest In the video [3] And we Longer courses [4]）.

Suppose we have a data set with two samples and four genes . Based on this expression data , We want to evaluate the relationship between these samples . We can plot the counting relationship between one sample and another sample , sample 1 stay x On the shaft , sample 2 stay y On the shaft , As shown below ：

img

about PCA analysis , The first step is to draw this diagram , And draw a line through the data in the direction representing the most changes . In this case , Changes most along the diagonal . in other words , The largest distribution in the data is between the two endpoints of the line . This is called the first principal component , or PC1. The genes at both ends of this line ( gene B And genes C) It has the greatest influence on the direction of this line .

img

After drawing this line and determining the amount of influence of each gene ,PCA The score for each sample will be calculated . Of each sample PC1 The score is calculated by multiplying the effect by the standardized count and the sum of all genes . We can express data (PC2) The second largest change in the data plot another line , Then calculate the score , Then there is the third line , And so on , Until the total number of samples in the dataset .

Sample1 PC1 score = (read count Gene A * influence Gene A) + 
                    (read count Gene B * influence Gene B) + 
                    .. for all genes

actually , Your actual dataset has a larger dimension （ More samples and more genes ）. Initial sample - Sample map , Will be in n In the dimension n The axes represent the total number of samples . The end result is a two-dimensional matrix , Where rows represent samples , The column reflects the score of each principal component . To evaluate the results of principal component analysis , We usually compare the principal components with each other , Most varied from the interpretation data pc Start .

img

If two samples are correct PC1 The level of gene expression that represents a significant contribution to the variation is similar , Then they will be in PC1 The axes are drawn closely together . therefore , We expect biological repeats to have similar scores ( Because the same genes change ), And gather in PC1 and / or PC2 On , Samples from different treatment groups had different scores . This is the easiest to understand visualization example PCA chart .

explain PCA chart

Below we have a sample data set and some related PCA chart , To understand how to interpret them . The metadata of the experiment is shown below . The main conditions of interest are treatment.

img

When in PC1 and PC2 When visualizing on , We do not see samples separated by processing , So we decided to explore other sources of variation in the data . We hope that we have included all possible known sources of variation in the metadata table , And we can use these factors for PCA Graph coloring .

img

We start from the factor cage Start , but cage Factors do not seem to explain PC1 or PC2 Changes on .

img

then , We color according to gender , It seems that PC2 Separate the sample on the . This is good information worth noting , Because we can use it to explain the changes in the model due to gender , And model it by regression .

img

Next, let's discuss strain factor , Find it can explain PC1 Changes on .

img

We have been able to determine PC1 and PC2 The source of variation , It's great . By taking this into account in our model , We should be able to detect more differentially expressed genes due to treatment .

The worrisome thing is , We see that two samples are not clustered with the correct strain . This will indicate that a sample exchange is possible , And shall be investigated , To determine whether these samples are indeed labeled strains . If we find that there is （ FALSE ） In exchange for , We can exchange samples in metadata . However , If we think they are correctly marked or uncertain , We can delete samples from the dataset .

But we still haven't found , Whether the treatment is strain And the main sources of post sex variation . therefore , We explore PC3 and PC4 Let's see if treatment drives these PC The change represented .

img

We found out PC3 Process the separated sample on the , We treat our DE The analysis is optimistic , Because the conditions we are interested in , Handle , Is in PC3 Upper separated , We can go back to driving PC1 and PC2 Variation of .

How many changes are explained according to the first several principal components , You may want to explore more ( That is, consider more ingredients and draw a pair of combinations ). Even if your sample cannot be clearly separated by the experimental variables , You can still go from DE Biologically relevant results are obtained from the analysis . If you expect a very small amount of effect , Then the signal may be submerged by external sources of change . Where you can identify these sources , It is important to consider these sources in your model , Because it is for detection DE Genetic tools provide more powerful functions .

Heat map of hierarchical clustering

Similar to principal component analysis , Hierarchical clustering is another complementary method for identifying strong patterns and potential outliers in data sets . The heat map shows the correlation of gene expression in all paired samples in the data set . Because most genes are not differentially expressed , Therefore, the correlation between samples is generally high ( Greater than 0.80). lower than 0.80 Your sample may indicate that there are outliers and / Or sample contamination .

The hierarchical tree can indicate which samples are more similar to each other according to the normalized gene expression values . Color blocks represent substructures in the data , You might see the copies of each sample group grouped together as a block . Besides , We want to see the aggregated samples similar to those in PCA The grouping observed in the figure . In the picture below , We will pay close attention to ‘Wt_3’ and ‘KO_3’ The sample of is not clustered with other duplicate samples . We want to explore principal component analysis , See if we see the same sample clustering .

img

Gene level QC

In addition to checking samples / Whether the repeated clustering is good , And more QC step . Before differential expression analysis , It is beneficial to omit genes that have little or no chance of being detected for differential expression . This will improve the ability to detect differentially expressed genes . The missing genes fall into three categories ：

In all samples, the number is 0 Genes
Genes with extreme outliers
Low mean standardized count of genes

img

By default ,DESeq2 This filtering will be performed ： Others DE Tools , Such as edgeR Will not be . Even using limma-voom and / or edgeR Quasi Likelihood method , Filtering is also a necessary step . When using other tools , Please make sure to follow the pre filtration steps , just as Bioconductor As outlined in the user guide on , Because they usually perform better .

Use DESeq2 Conduct Mov10 Quality assessment and exploratory analysis

Now we have a good understanding of what is commonly used in RNA-seq Of QC step , Let's start with the Mov10 Datasets implement them .

Use rlog Convert normalized count

To improve PCA Distance from hierarchical clustering visualization method / clustering , We need to apply... To standardized counts rlog Transform to adjust mean variance .

During the quality assessment , Standardized counting rlog Transformation is only necessary for these visualization methods . We will not use these standard counts downstream .

### Transform counts for data visualization
rld <- rlog(dds, blind=TRUE)

blind=TRUE The parameter will result in an unbiased transformation of the sample condition information . When performing a quality assessment , It is important to include this option .DESeq2[5] Documentation has more details .

rlog The function returns a DESeqTransform object , Another specific DESeq Object type of . The reason you don't just get a matrix of converted values is , Calculation rlog All parameters of the transformation ( The size factor ) Are stored in this object . We use this object to draw the principal component analysis and hierarchical clustering diagram of quality assessment .

Be careful : When you have, for example > 20 One sample ,rlog() Functions can be a little slow . In these cases ,vst() Functions are much faster , And perform a similar transformation , Fit with plotPCA() Use it together . Due to the nature of optimization and transformation , Use vst() It usually takes only a few seconds .

Principal component analysis (PCA)

DESeq2 There is one for drawing PCA Graph's built-in functions , It is used at the bottom ggplot2. It's great , Because it saves us from typing lines of code and fiddling with different things ggplot2 The time of the layer . Besides , It will directly rlog Object as input , Thus, the trouble of extracting relevant information from it is eliminated .

### Plot PCA 
plotPCA(rld, intgroup="sampletype")

plotPCA() The function takes two arguments as input rlog Objects and intgroup( The columns in the metadata we are interested in ).

img

About the similarity of samples , What does this picture tell you ? Does it meet the expectations of the experimental design ? By default , Before using this function 500 The most variable genes . You can do this by adding ntop Parameter and specify how many genes to use to chart to change this .

Be careful ：plotPCA() The function will only return PC1 and PC2 Value . If you want to explore other things in the data pc, Or if you want to be sure about these pc Genes that play a major role , You can use prcomp() function . for example , To draw any pc, We can run the following code ：

# Input is a matrix of log transformed values
rld <- rlog(dds, blind=T)
rld_mat <- assay(rld)
pca <- prcomp(t(rld_mat))

# Create data frame with metadata and PC3 and PC4 values for input to ggplot
df <- cbind(meta, pca$x)
ggplot(df) + geom_point(aes(x=PC3, y=PC4, color = sampletype))

have access to resources [6] Learn how to use PC Make more complex queries .

Hierarchical clustering

Because in DESeq2 There are no built-in functions for heat maps in , We will use pheatmap In bag pheatmap() function . This function requires a matrix of values / Data frame as input , So the first thing we need to do is start from rld Object ：

### Extract the rlog matrix from the object
rld_mat <- assay(rld)    
## assay() is function from the "SummarizedExperiment" package that was loaded when you loaded DESeq2

Then we need to calculate the pairwise correlation value of the sample . We can use cor() Function to do this ：

### Compute pairwise correlation values
rld_cor <- cor(rld_mat)    ## cor() is a base R function

head(rld_cor)   ## check the output of cor(), make note of the rownames and colnames

Now plot the relevant values as a heat map ：

### Load pheatmap package
library(pheatmap)

### Plot heatmap
pheatmap(rld_cor, annotation = meta)

img

Overall speaking , We observed a fairly high overall correlation (> 0.999), Indicates that there are no outliers . Besides , And PCA The picture is similar to , You can see that the samples are clustered by sample group . All in all , These graphs show us that the data quality is good , We can do differential expression analysis .

Be careful ：pheatmap Functions have many different arguments , We can change the default value to enhance the beauty of the graphics . If you are curious and want to know more , Try running the following code . How your graphics change ？ View the help page (?pheatmap) And determine the contribution of each added parameter to the graph .

heat.colors <- brewer.pal(6, "Blues")
pheatmap(rld_cor, annotation = meta, color = heat.colors, border_color=NA, fontsize = 10, 
   fontsize_row = 10, height=20)

Want to know RColorBrewer Package provides all available palettes ？ Try typing... In the console display.brewer.all(), See what happens ！

Reference material

[1]

Principal component analysis PCA: https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html

[2]

StatQuest The material of : https://www.youtube.com/watch?v=_UVHneBUBW0

[3]

StatQuest In the video : https://www.youtube.com/watch?v=_UVHneBUBW0

[4]

Longer courses : https://hbctraining.github.io/scRNA-seq/

[5]

DESeq2: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#blind-dispersion-estimation

[6]

resources : http://www.sthda.com/english/wiki/principal-component-analysis-in-r-prcomp-vs-princomp-r-software-and-data-mining

原网站

版权声明
本文为[Wangshixiang]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206271955180185.html