当前位置:网站首页>Follow the archiving tutorial to learn rnaseq analysis (III): count standardization using deseq2
Follow the archiving tutorial to learn rnaseq analysis (III): count standardization using deseq2
2022-06-27 22:45:00 【Wangshixiang】
Follow the archive tutorial RNAseq analysis ( One )
Follow the archive tutorial RNAseq analysis ( Two )
Standardization
The first step in the workflow of differential expression analysis is count standardization , This is necessary for accurate comparison of gene expression between samples .
img
Apart from many factors of indifference , Comparison of each gene reads Counting and RNA In direct proportion to the expression of . Standardization is the process of scaling the original count value to account for unrelated factors . In this way , The expression level is between samples and / Or more comparable within the sample .
The main factors often considered in the standardization process are :
- Sequencing depth : The comparison of gene expression among samples needs to consider the sequencing depth . In the following example , Each gene in the sample A It seems that the expression in B The expression of has doubled , But this is a sample A The result of doubling the sequencing depth .
img
Be careful : In the diagram above , Each pink and green rectangle represents a gene aligned read. Connected by dashed lines across introns read.
- Gene length : To compare the expression of different genes in the same sample , Gene length needs to be considered . In this case , gene X And genes Y Have a similar level of expression , But mapping to genes X Of reads Numbers will be more than mapped to genes Y Of reads Much more , Because of genes X Longer .
img
- RNA form : There are some highly different gene expression between samples , Differences in the number of genes expressed between samples , Or there is pollution , These may lead to the deviation of some normalization methods . In order to accurately compare the expression between samples , Suggested consideration RNA form , This is particularly important in differential expression analysis . In this case , Suppose the sample A And the sample B The sequencing depth is similar , except DE Extragene , The expression levels of each gene were similar among samples . because DE Genes make up the sample B Most of , sample B There will be a big bias in the count of . therefore ,B Other genes in the sample seem to be better than A There is less expression of the same gene in the sample .
img
Although standardization is necessary for differential expression analysis , But for exploratory data analysis 、 It is also necessary to visualize data and to study or compare counts between or within samples .
Common standardization methods
There are several common standardized methods to explain these differences :
Standardization method | describe | Considerations | Use recommendations |
---|---|---|---|
CPM (counts per million) | counts scaled by total number of reads | sequencing depth | gene count comparisons between replicates of the same sample group; NOT for within sample comparisons or DE analysis |
TPM (transcripts per kilobase million) | counts per length of transcript (kb) per million reads mapped | sequencing depth and gene length | gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis |
RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped) | similar to TPM | sequencing depth and gene length | gene count comparisons between genes within a sample; NOT for between sample comparisons or DE analysis |
DESeq2’s median of ratios | counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene | sequencing depth and RNA composition | gene count comparisons between samples and for DE analysis; NOT for within sample comparisons |
EdgeR’s trimmed mean of M values (TMM) | uses a weighted trimmed mean of the log expression ratios between samples | sequencing depth, RNA composition | gene count comparisons between samples and for DE analysis; NOT for within sample comparisons |
RPKM/FPKM ( Not recommended for comparison between samples )
although TPM and RPKM/FPKM Both normalization methods consider sequencing depth and gene length , but RPKM/FPKM Not recommended . as a result of RPKM/FPKM The normalized count value output by the method is not comparable between samples .
Use RPKM/FPKM normalization , Of each sample RPKM/FPKM The total number of normalized counts will be different . therefore , You can't compare the standardized count of each gene equally between samples .
RPKM-normalized The counting table
gene | sampleA | sampleB |
---|---|---|
XCR1 | 5.5 | 5.5 |
WASHC1 | 73.4 | 21.8 |
… | … | … |
Total RPKM-normalized counts | 1,000,000 | 1,500,000 |
for example , In the above table , Even if RPKM The count values are the same ,SampleA And XCR1(5.5/1,000,000) The relative counting ratio is also higher than that of sampleB(5.5/1,500,000) The relevant count proportion is higher . therefore , We can't directly compare sampleA and sampleB in XCR1( Or other genes ) Count of , Because the total number of normalized counts between samples is different .
notes :StatQuest This video of [1] Shows in more detail why you should use TPM Instead of RPKM/FPKM, If it is necessary to standardize the sequencing depth and gene length .
DESeq2- Normalized count : Ratio median method
Because the tool of differential expression analysis is to compare the counts between sample groups of the same gene , The analysis tool does not need to consider gene length . However , Analysis does need to consider sequencing depth and RNA form .
For sequencing depth and RNA The composition is normalized ,DESeq2 The median ratio method was used . There is only one step on the client side , But there are multiple steps on the back end , As follows .
Be careful : The following steps describe in detail when you run a single function to get DE When genes are used DESeq2 Some of the steps performed . Basically , For typical RNA-seq analysis , You won't run these steps alone .
step 1: Create pseudo reference samples ( Row geometric mean )
For each gene , Create a pseudo reference sample , It is equal to the geometric mean of all samples .
https://zhuanlan.zhihu.com/p/23809612
gene | sampleA | sampleB | pseudo-reference sample |
---|---|---|---|
EF2A | 1489 | 906 | sqrt(1489 * 906) = 1161.5 |
ABCD1 | 22 | 13 | sqrt(22 * 13) = 17.7 |
… | … | … | … |
step 2: Calculate the ratio of each sample to the reference sample
For each gene in the sample , Calculate the ratio (sample/ref)( As shown in the figure below ). This will be done for each example in the dataset . Because most genes are not differentially expressed , So the proportion of most genes in each sample should be similar .
gene | sampleA | sampleB | pseudo-reference sample | ratio of sampleA/ref | ratio of sampleB/ref |
---|---|---|---|---|---|
EF2A | 1489 | 906 | 1161.5 | 1489/1161.5 = 1.28 | 906/1161.5 = 0.78 |
ABCD1 | 22 | 13 | 16.9 | 22/16.9 = 1.30 | 13/16.9 = 0.77 |
MEFV | 793 | 410 | 570.2 | 793/570.2 = 1.39 | 410/570.2 = 0.72 |
BAG1 | 76 | 42 | 56.5 | 76/56.5 = 1.35 | 42/56.5 = 0.74 |
MOV10 | 521 | 1196 | 883.7 | 521/883.7 = 0.590 | 1196/883.7 = 1.35 |
… | … | … | … |
step 3: Calculate the normalization factor for each sample ( Scale factor ,size factor)
The median of all ratios in a given sample ( Calculated by column in the above table ) As the standardization factor of the sample ( Scale factor ), Calculated as follows . Note that differentially expressed genes should not affect the median :
normalization_factor_sampleA <- median(c(1.28, 1.3, 1.39, 1.35, 0.59))
normalization_factor_sampleB <- median(c(0.78, 0.77, 0.72, 0.74, 1.35))
The figure below illustrates the median distribution of all gene ratios in a single sample ( Frequency at y On the shaft ).
img
The median ratio method assumes that not all genes are differentially expressed ; therefore , The normalization factor should take into account the sequencing depth and RNA form ( A large outlier gene does not represent a median ratio value ). This method is applicable to up regulation / Downregulation of imbalance and a large number of differentially expressed genes has a strong antagonistic effect .
Usually these size factors are in 1 about , If you see a big difference between the samples , It's important to note that , Because this may indicate the existence of extreme outliers .
step 4: Use the normalization factor to calculate the normalized count value
This is achieved by dividing each original count value in a given sample by the standardization factor of the sample to generate a standardized count value . This is for all counts ( Every gene in every sample ) Executive . for example , If SampleA The median ratio of 1.3,SampleB The median ratio of 0.77, The normalized count can be calculated as follows :
SampleA median ratio = 1.3
SampleB median ratio = 0.77
Raw Counts
gene | sampleA | sampleB |
---|---|---|
EF2A | 1489 | 906 |
ABCD1 | 22 | 13 |
… | … | … |
Normalized Counts
gene | sampleA | sampleB |
---|---|---|
EF2A | 1489 / 1.3 = 1145.39 | 906 / 0.77 = 1176.62 |
ABCD1 | 22 / 1.3 = 16.92 | 13 / 0.77 = 16.88 |
… | … | … |
Please note that , The normalized count value is not an integer .
Use DESeq2 Yes Mov10 Data sets are counted and standardized
Now we have learned about the count normalization theory , Next we will use DESeq2 normalization Mov10 Count of data sets . This takes several steps :
- Make sure that the metadata data frame appears with a row name , And in the same order as the column names of the count data frame .
- Create a
DESeqDataSet
object . - Generate normalized count
1. Match metadata and count data
We should always ensure that the example names match between the two files , And the order of the examples is correct . If not ,DESeq2 Will output an error .
### Check that sample names match in both files
all(colnames(txi$counts) %in% rownames(meta))
all(colnames(txi$counts) == rownames(meta))
If the data does not match , have access to match()
Function to rearrange them to match .
2. establish DESeq2 object
Bioconductor Software packages are usually in R Define and use a custom class to store data ( input data 、 Intermediate data and results ). These custom data structures are similar to lists , Because they can contain many different data types / structure . however , Unlike the list , They have pre - specified data slots , Used to store certain types of / Class . The data stored in these pre specified slots can be accessed by using specific package definition functions .
Let's start with creating DESeqDataSet Object start , Then we can talk more about what is stored in it . To create objects , We will need count matrix and metadata table as input . We also need to specify a design formula . The design formula specifies the columns in the metadata table , And how these columns should be used in the analysis . For our dataset , We are only interested in one column , namely ~sampletype
. There are three factor levels in this column , It tells DESeq2, For each gene , We want to evaluate these different levels of gene expression .
Our count matrix input is stored in txi
In the list object , So we use DESeqDataSetFromTximport()
The function passes it , This function extracts the count part and rounds the value to the nearest integer .
## Create DESeq2Dataset object
dds <- DESeqDataSetFromTximport(
txi,
colData = meta,
design = ~ sampletype)
Be careful : Because we created a data variable containing count in the previous chapter , We can also use it as input . however , under these circumstances , We want to use DESeqDataSetFromMatrix()
function .
img
If you like , You can use a specific DESeq To access different data slots and retrieve information . for example , Suppose we want the original count matrix , We will use counts()
( Be careful : We nest it in View() Function , So we can see it in the script editor , Instead of printing in the console ):
View(counts(dds))
As we complete the workflow , We will use correlation functions to check what information is stored in the object .
3. Generate Mov10 Standardized count
The next step is to standardize the count data , So that a fair genetic comparison can be made between samples .
img
To perform the normalized ratio median method ,DESeq2 There is one estimateSizeFactors()
function , It will generate a size factor for us . We will use this function in the following example , But in a typical RNA-seq Analyzing , This step is made up of DESeq()
Function automatic , We'll see in the back .
dds <- estimateSizeFactors(dds)
By assigning the result to dds object , We will fill in with the appropriate information DESeqDataSet The slot of the object . We can use the following method to view the normalization factor of each sample :
sizeFactors(dds)
#Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_kd_2 Mov10_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3
# 1.1150371 0.9606366 0.7493552 1.5634128 0.9359082 1.2257749 1.1406863 0.6541689
Now? , From you to dds Retrieve the normalized count matrix , We use counts()
Function and add arguments normalized=TRUE
.
normalized_counts <- counts(dds, normalized=TRUE)
We can save this standardized data matrix to a file for future use :
write.table(normalized_counts, file="data/normalized_counts.txt",
sep="\t", quote=F, col.names=NA)
Be careful :DESeq2 Standardized counting is not actually used , Instead, use raw counts , And in the generalized linear model (GLM) Modeling standardization in . These standardized counts are useful for downstream visualization of results , But not as DESeq2 Or any other tool that uses a negative binomial model for differential expression analysis .
Reference material
[1]
StatQuest This video of : http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
边栏推荐
- 【mysql实战】查询语句实战演示
- Start the start php
- MySQL greater than less than or equal to symbol representation
- 7 jours d'apprentissage de la programmation simultanée go 7 jours de programmation simultanée go Language Atomic Atomic Atomic actual Operation contains ABA Problems
- Day8 - cloud information project introduction and creation
- 批量处理-Excel导入模板1.1-支持多Sheet页
- Azure Kinect DK 实现三维重建 (jetson实时版)
- Memoirs of actual combat: breaking the border from webshell
- Login credentials (cookie+session and token token)
- 二维数组中修改代价最小问题【转换题意+最短路径】(Dijkstra+01BFS)
猜你喜欢
《7天学会Go并发编程》第7天 go语言并发编程Atomic原子实战操作含ABA问题
go语言切片Slice和数组Array对比panic: runtime error: index out of range问题解决
crontab定时任务常用命令
Penetration learning - shooting range chapter -dvwa shooting range detailed introduction (continuous updating - currently only the SQL injection part is updated)
解决本地连接不上虚拟机的问题
结构化机器学习项目(二)- 机器学习策略(2)
Introduction to MySQL operation (IV) -- data sorting (ascending, descending, and multi field sorting)
信通院举办“业务与应用安全发展论坛” 天翼云安全能力再获认可
Ellipsis after SQLite3 statement Solutions for
元气森林的5元有矿之死
随机推荐
Is flush stock trading software reliable?? Is it safe?
Kill the general and seize the "pointer" (Part 2)
[MySQL practice] query statement demonstration
C # QR code generation and recognition, removing white edges and any color
OpenSSL Programming II: building CA
Re recognize G1 garbage collector through G1 GC log
Beijing University of Posts and Telecommunications - multi-agent deep reinforcement learning for cost and delay sensitive virtual network function placement and routing
Gao fushuai in the unit testing industry, pytest framework, hands-on teaching, will do this in the future test reports~
关于davwa的SQL注入时报错:Illegal mix of collations for operation ‘UNION‘原因剖析与验证
Crawler notes (3) -selenium and requests
Basics of operators
[microservices] (16) -- distributed transaction Seata
元气森林的5元有矿之死
月薪3万的狗德培训,是不是一门好生意?
Exclusive interview with millions of annual salary. What should developers do if they don't fix bugs?
同花顺炒股软件可靠吗??安全嘛?
Transformation from student to engineer
网易云“情怀”底牌失守
Typescript learning
Record a list object traversal and determine the size of the float type