当前位置:网站首页>Follow the archiving tutorial to learn rnaseq analysis (IV): QC method for de analysis using deseq2
Follow the archiving tutorial to learn rnaseq analysis (IV): QC method for de analysis using deseq2
2022-06-27 22:47:00 【Wangshixiang】
Follow the archive tutorial RNAseq analysis ( One )
Follow the archive tutorial RNAseq analysis ( Two )
Follow the archive tutorial RNAseq analysis ( 3、 ... and ): Use DESeq2 Perform count standardization
The quality control QC
DESeq2 The next step in the workflow is QC, It includes performing sample level and gene level on count data QC Check the steps , To help us ensure that the sample / Repetition looks good .
img
Sample level QC
RNA-seq A useful initial step in analysis is usually to assess the overall similarity between samples :
- Which samples are similar , What's different ?
- Does this meet the expectations of the experimental design ?
- What are the main sources of variation in the dataset ?
To explore the similarity of our samples , We will use principal component analysis (PCA) And hierarchical clustering method to perform sample level QC. Our sample level QC Let's see how our repetition comes together , And observe whether our experimental conditions represent the main source of changes in the data . Execute sample level QC Any sample outliers can also be identified , This may require further study , To determine if they need to be in DE Delete... Before analysis .
img
When using these unsupervised clustering methods , Standardized counting log2 Transformation can improve the visual distance / clustering .DESeq2 For sample level QC A regularized logarithmic transformation using normalized counts (rlog), because It adjusts the variance between the means , Thus, clustering is improved .
img
Be careful :DESeq2 Documentation suggests big data sets (100 Samples ) Use variance stable transformation (vst) instead of rlog To perform count conversion , because rlog Functions may run too long , and vst()
Function has and rlog Similar properties , Faster .
Principal component analysis PCA[1]
Principal component analysis (PCA) It's a technology , Used to emphasize change , And propose a powerful model in the data set ( Dimension reduction ). About PCA The details are as follows ( Based on coming from StatQuest The material of [2], If you want a more comprehensive description , We encourage you to explore StatQuest In the video [3] And we Longer courses [4]).
Suppose we have a data set with two samples and four genes . Based on this expression data , We want to evaluate the relationship between these samples . We can plot the counting relationship between one sample and another sample , sample 1 stay x On the shaft , sample 2 stay y On the shaft , As shown below :
img
about PCA analysis , The first step is to draw this diagram , And draw a line through the data in the direction representing the most changes . In this case , Changes most along the diagonal . in other words , The largest distribution in the data is between the two endpoints of the line . This is called the first principal component , or PC1. The genes at both ends of this line ( gene B And genes C) It has the greatest influence on the direction of this line .
img
After drawing this line and determining the amount of influence of each gene ,PCA The score for each sample will be calculated . Of each sample PC1 The score is calculated by multiplying the effect by the standardized count and the sum of all genes . We can express data (PC2) The second largest change in the data plot another line , Then calculate the score , Then there is the third line , And so on , Until the total number of samples in the dataset .
Sample1 PC1 score = (read count Gene A * influence Gene A) +
(read count Gene B * influence Gene B) +
.. for all genes
actually , Your actual dataset has a larger dimension ( More samples and more genes ). Initial sample - Sample map , Will be in n In the dimension n The axes represent the total number of samples . The end result is a two-dimensional matrix , Where rows represent samples , The column reflects the score of each principal component . To evaluate the results of principal component analysis , We usually compare the principal components with each other , Most varied from the interpretation data pc Start .
img
If two samples are correct PC1 The level of gene expression that represents a significant contribution to the variation is similar , Then they will be in PC1 The axes are drawn closely together . therefore , We expect biological repeats to have similar scores ( Because the same genes change ), And gather in PC1 and / or PC2 On , Samples from different treatment groups had different scores . This is the easiest to understand visualization example PCA chart .
explain PCA chart
Below we have a sample data set and some related PCA chart , To understand how to interpret them . The metadata of the experiment is shown below . The main conditions of interest are treatment
.
img
When in PC1 and PC2 When visualizing on , We do not see samples separated by processing , So we decided to explore other sources of variation in the data . We hope that we have included all possible known sources of variation in the metadata table , And we can use these factors for PCA Graph coloring .
img
We start from the factor cage
Start , but cage
Factors do not seem to explain PC1 or PC2 Changes on .
img
then , We color according to gender , It seems that PC2 Separate the sample on the . This is good information worth noting , Because we can use it to explain the changes in the model due to gender , And model it by regression .
img
Next, let's discuss strain
factor , Find it can explain PC1 Changes on .
img
We have been able to determine PC1 and PC2 The source of variation , It's great . By taking this into account in our model , We should be able to detect more differentially expressed genes due to treatment .
The worrisome thing is , We see that two samples are not clustered with the correct strain . This will indicate that a sample exchange is possible , And shall be investigated , To determine whether these samples are indeed labeled strains . If we find that there is ( FALSE ) In exchange for , We can exchange samples in metadata . However , If we think they are correctly marked or uncertain , We can delete samples from the dataset .
But we still haven't found , Whether the treatment is strain
And the main sources of post sex variation . therefore , We explore PC3 and PC4 Let's see if treatment drives these PC The change represented .
img
We found out PC3 Process the separated sample on the , We treat our DE The analysis is optimistic , Because the conditions we are interested in , Handle , Is in PC3 Upper separated , We can go back to driving PC1 and PC2 Variation of .
How many changes are explained according to the first several principal components , You may want to explore more ( That is, consider more ingredients and draw a pair of combinations ). Even if your sample cannot be clearly separated by the experimental variables , You can still go from DE Biologically relevant results are obtained from the analysis . If you expect a very small amount of effect , Then the signal may be submerged by external sources of change . Where you can identify these sources , It is important to consider these sources in your model , Because it is for detection DE Genetic tools provide more powerful functions .
Heat map of hierarchical clustering
Similar to principal component analysis , Hierarchical clustering is another complementary method for identifying strong patterns and potential outliers in data sets . The heat map shows the correlation of gene expression in all paired samples in the data set . Because most genes are not differentially expressed , Therefore, the correlation between samples is generally high ( Greater than 0.80). lower than 0.80 Your sample may indicate that there are outliers and / Or sample contamination .
The hierarchical tree can indicate which samples are more similar to each other according to the normalized gene expression values . Color blocks represent substructures in the data , You might see the copies of each sample group grouped together as a block . Besides , We want to see the aggregated samples similar to those in PCA The grouping observed in the figure . In the picture below , We will pay close attention to ‘Wt_3’ and ‘KO_3’ The sample of is not clustered with other duplicate samples . We want to explore principal component analysis , See if we see the same sample clustering .
img
Gene level QC
In addition to checking samples / Whether the repeated clustering is good , And more QC step . Before differential expression analysis , It is beneficial to omit genes that have little or no chance of being detected for differential expression . This will improve the ability to detect differentially expressed genes . The missing genes fall into three categories :
- In all samples, the number is 0 Genes
- Genes with extreme outliers
- Low mean standardized count of genes
img
By default ,DESeq2 This filtering will be performed : Others DE Tools , Such as edgeR Will not be . Even using limma-voom and / or edgeR Quasi Likelihood method , Filtering is also a necessary step . When using other tools , Please make sure to follow the pre filtration steps , just as Bioconductor As outlined in the user guide on , Because they usually perform better .
Use DESeq2 Conduct Mov10 Quality assessment and exploratory analysis
Now we have a good understanding of what is commonly used in RNA-seq Of QC step , Let's start with the Mov10 Datasets implement them .
Use rlog Convert normalized count
To improve PCA Distance from hierarchical clustering visualization method / clustering , We need to apply... To standardized counts rlog Transform to adjust mean variance .
During the quality assessment , Standardized counting rlog Transformation is only necessary for these visualization methods . We will not use these standard counts downstream .
### Transform counts for data visualization
rld <- rlog(dds, blind=TRUE)
blind=TRUE
The parameter will result in an unbiased transformation of the sample condition information . When performing a quality assessment , It is important to include this option .DESeq2[5] Documentation has more details .
rlog The function returns a DESeqTransform object , Another specific DESeq Object type of . The reason you don't just get a matrix of converted values is , Calculation rlog All parameters of the transformation ( The size factor ) Are stored in this object . We use this object to draw the principal component analysis and hierarchical clustering diagram of quality assessment .
Be careful : When you have, for example > 20 One sample ,rlog()
Functions can be a little slow . In these cases ,vst()
Functions are much faster , And perform a similar transformation , Fit with plotPCA()
Use it together . Due to the nature of optimization and transformation , Use vst()
It usually takes only a few seconds .
Principal component analysis (PCA)
DESeq2 There is one for drawing PCA Graph's built-in functions , It is used at the bottom ggplot2. It's great , Because it saves us from typing lines of code and fiddling with different things ggplot2 The time of the layer . Besides , It will directly rlog Object as input , Thus, the trouble of extracting relevant information from it is eliminated .
### Plot PCA
plotPCA(rld, intgroup="sampletype")
plotPCA()
The function takes two arguments as input rlog Objects and intgroup( The columns in the metadata we are interested in ).
img
About the similarity of samples , What does this picture tell you ? Does it meet the expectations of the experimental design ? By default , Before using this function 500 The most variable genes . You can do this by adding ntop Parameter and specify how many genes to use to chart to change this .
Be careful :plotPCA()
The function will only return PC1 and PC2 Value . If you want to explore other things in the data pc, Or if you want to be sure about these pc Genes that play a major role , You can use prcomp()
function . for example , To draw any pc, We can run the following code :
# Input is a matrix of log transformed values
rld <- rlog(dds, blind=T)
rld_mat <- assay(rld)
pca <- prcomp(t(rld_mat))
# Create data frame with metadata and PC3 and PC4 values for input to ggplot
df <- cbind(meta, pca$x)
ggplot(df) + geom_point(aes(x=PC3, y=PC4, color = sampletype))
have access to resources [6] Learn how to use PC Make more complex queries .
Hierarchical clustering
Because in DESeq2 There are no built-in functions for heat maps in , We will use pheatmap In bag pheatmap()
function . This function requires a matrix of values / Data frame as input , So the first thing we need to do is start from rld Object :
### Extract the rlog matrix from the object
rld_mat <- assay(rld)
## assay() is function from the "SummarizedExperiment" package that was loaded when you loaded DESeq2
Then we need to calculate the pairwise correlation value of the sample . We can use cor()
Function to do this :
### Compute pairwise correlation values
rld_cor <- cor(rld_mat) ## cor() is a base R function
head(rld_cor) ## check the output of cor(), make note of the rownames and colnames
Now plot the relevant values as a heat map :
### Load pheatmap package
library(pheatmap)
### Plot heatmap
pheatmap(rld_cor, annotation = meta)
img
Overall speaking , We observed a fairly high overall correlation (> 0.999), Indicates that there are no outliers . Besides , And PCA The picture is similar to , You can see that the samples are clustered by sample group . All in all , These graphs show us that the data quality is good , We can do differential expression analysis .
Be careful :pheatmap Functions have many different arguments , We can change the default value to enhance the beauty of the graphics . If you are curious and want to know more , Try running the following code . How your graphics change ? View the help page (?pheatmap) And determine the contribution of each added parameter to the graph .
heat.colors <- brewer.pal(6, "Blues")
pheatmap(rld_cor, annotation = meta, color = heat.colors, border_color=NA, fontsize = 10,
fontsize_row = 10, height=20)
Want to know RColorBrewer Package provides all available palettes ? Try typing... In the console display.brewer.all()
, See what happens !
Reference material
[1]
Principal component analysis PCA: https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html
[2]
StatQuest The material of : https://www.youtube.com/watch?v=_UVHneBUBW0
[3]
StatQuest In the video : https://www.youtube.com/watch?v=_UVHneBUBW0
[4]
Longer courses : https://hbctraining.github.io/scRNA-seq/
[5]
DESeq2: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#blind-dispersion-estimation
[6]
resources : http://www.sthda.com/english/wiki/principal-component-analysis-in-r-prcomp-vs-princomp-r-software-and-data-mining
边栏推荐
- Oracle obtains the beginning and end of the month time, and obtains the beginning and end of the previous month time
- OpenSSL Programming II: building CA
- 资深猎头团队管理者:面试3000顾问,总结组织出8大共性(茅生)
- Web Worker介绍及使用案例
- Penetration learning - shooting range chapter - detailed introduction to Pikachu shooting range (under continuous update - currently only the SQL injection part is updated)
- Teach you how to print your own log -- how to customize log4j2 components
- Is it safe for GF futures to open an account?
- Livox Lidar+APX15 实时高精度雷达建图复现整理
- The karsonzhang/fastadmin addons provided by the system reports an error
- 《7天學會Go並發編程》第7天 go語言並發編程Atomic原子實戰操作含ABA問題
猜你喜欢
Livox lidar+ Hikvision camera real-time 3D reconstruction based on loam to generate RGB color point cloud
从学生到工程师的蜕变之路
资深猎头团队管理者:面试3000顾问,总结组织出8大共性(茅生)
雪糕还是雪“高”?
Ellipsis after SQLite3 statement Solutions for
使用sqlite3语句后出现省略号 ... 的解决方法
初识C语言 第二弹
Fill in the blank of rich text test
信通院举办“业务与应用安全发展论坛” 天翼云安全能力再获认可
Do280openshift access control -- Security Policy and chapter experiment
随机推荐
Hiplot 在线绘图工具的本地运行/开发库开源
DCC888 :Register Allocation
爬虫笔记(1)- urllib
结构化机器学习项目(一)- 机器学习策略
Do you know the full meaning of log4j2 configurations? Take you through all the components of log4j2
Management system itclub (medium)
average-population-of-each-continent
Using the cucumber automated test framework
Introduction to MySQL operation (IV) -- data sorting (ascending, descending, and multi field sorting)
[随笔]ME53N 增加按钮,调用URL
Day 7 of "learning to go concurrent programming in 7 days" go language concurrent programming atomic atomic actual operation includes ABA problem
Learn to go concurrent programming in 7 days go language sync Application and implementation of cond
Azure Kinect DK 实现三维重建 (PC非实时版)
How to prioritize the contents in the queue every second
CUDA error:out of memory caused by insufficient video memory of 6G graphics card
Is it safe for GF futures to open an account?
Introduce you to ldbc SNB, a powerful tool for database performance and scenario testing
网易云“情怀”底牌失守
Ellipsis after SQLite3 statement Solutions for
The karsonzhang/fastadmin addons provided by the system reports an error