当前位置：网站首页>Follow the archiving tutorial to learn rnaseq analysis (III): count standardization using deseq2

Follow the archiving tutorial to learn rnaseq analysis (III): count standardization using deseq2

2022-06-27 22:45:00 【Wangshixiang】

Follow the archive tutorial RNAseq analysis （ One ）

Follow the archive tutorial RNAseq analysis （ Two ）

Standardization

The first step in the workflow of differential expression analysis is count standardization , This is necessary for accurate comparison of gene expression between samples .

img

Apart from many factors of indifference , Comparison of each gene reads Counting and RNA In direct proportion to the expression of . Standardization is the process of scaling the original count value to account for unrelated factors . In this way , The expression level is between samples and / Or more comparable within the sample .

The main factors often considered in the standardization process are ：

Sequencing depth ： The comparison of gene expression among samples needs to consider the sequencing depth . In the following example , Each gene in the sample A It seems that the expression in B The expression of has doubled , But this is a sample A The result of doubling the sequencing depth .

img

Be careful : In the diagram above , Each pink and green rectangle represents a gene aligned read. Connected by dashed lines across introns read.

Gene length ： To compare the expression of different genes in the same sample , Gene length needs to be considered . In this case , gene X And genes Y Have a similar level of expression , But mapping to genes X Of reads Numbers will be more than mapped to genes Y Of reads Much more , Because of genes X Longer .

img

RNA form ： There are some highly different gene expression between samples , Differences in the number of genes expressed between samples , Or there is pollution , These may lead to the deviation of some normalization methods . In order to accurately compare the expression between samples , Suggested consideration RNA form , This is particularly important in differential expression analysis . In this case , Suppose the sample A And the sample B The sequencing depth is similar , except DE Extragene , The expression levels of each gene were similar among samples . because DE Genes make up the sample B Most of , sample B There will be a big bias in the count of . therefore ,B Other genes in the sample seem to be better than A There is less expression of the same gene in the sample .

img

Although standardization is necessary for differential expression analysis , But for exploratory data analysis 、 It is also necessary to visualize data and to study or compare counts between or within samples .

Common standardization methods

There are several common standardized methods to explain these differences ：

Standardization method	describe	Considerations	Use recommendations
CPM (counts per million)	counts scaled by total number of reads	sequencing depth	gene count comparisons between replicates of the same sample group; NOT for within sample comparisons or DE analysis
TPM (transcripts per kilobase million)	counts per length of transcript (kb) per million reads mapped	sequencing depth and gene length	gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis
RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped)	similar to TPM	sequencing depth and gene length	gene count comparisons between genes within a sample; NOT for between sample comparisons or DE analysis
DESeq2’s median of ratios	counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene	sequencing depth and RNA composition	gene count comparisons between samples and for DE analysis; NOT for within sample comparisons
EdgeR’s trimmed mean of M values (TMM)	uses a weighted trimmed mean of the log expression ratios between samples	sequencing depth, RNA composition	gene count comparisons between samples and for DE analysis; NOT for within sample comparisons

RPKM/FPKM ( Not recommended for comparison between samples ）

although TPM and RPKM/FPKM Both normalization methods consider sequencing depth and gene length , but RPKM/FPKM Not recommended . as a result of RPKM/FPKM The normalized count value output by the method is not comparable between samples .

Use RPKM/FPKM normalization , Of each sample RPKM/FPKM The total number of normalized counts will be different . therefore , You can't compare the standardized count of each gene equally between samples .

RPKM-normalized The counting table

gene	sampleA	sampleB
XCR1	5.5	5.5
WASHC1	73.4	21.8
…	…	…
Total RPKM-normalized counts	1,000,000	1,500,000

for example , In the above table , Even if RPKM The count values are the same ,SampleA And XCR1(5.5/1,000,000) The relative counting ratio is also higher than that of sampleB(5.5/1,500,000) The relevant count proportion is higher . therefore , We can't directly compare sampleA and sampleB in XCR1( Or other genes ) Count of , Because the total number of normalized counts between samples is different .

notes :StatQuest This video of [1] Shows in more detail why you should use TPM Instead of RPKM/FPKM, If it is necessary to standardize the sequencing depth and gene length .

DESeq2- Normalized count ： Ratio median method

Because the tool of differential expression analysis is to compare the counts between sample groups of the same gene , The analysis tool does not need to consider gene length . However , Analysis does need to consider sequencing depth and RNA form .

For sequencing depth and RNA The composition is normalized ,DESeq2 The median ratio method was used . There is only one step on the client side , But there are multiple steps on the back end , As follows .

Be careful ： The following steps describe in detail when you run a single function to get DE When genes are used DESeq2 Some of the steps performed . Basically , For typical RNA-seq analysis , You won't run these steps alone .

step 1： Create pseudo reference samples （ Row geometric mean ）

For each gene , Create a pseudo reference sample , It is equal to the geometric mean of all samples .

https://zhuanlan.zhihu.com/p/23809612

gene	sampleA	sampleB	pseudo-reference sample
EF2A	1489	906	sqrt(1489 * 906) = 1161.5
ABCD1	22	13	sqrt(22 * 13) = 17.7
…	…	…	…

step 2： Calculate the ratio of each sample to the reference sample

For each gene in the sample , Calculate the ratio (sample/ref)（ As shown in the figure below ）. This will be done for each example in the dataset . Because most genes are not differentially expressed , So the proportion of most genes in each sample should be similar .

gene	sampleA	sampleB	pseudo-reference sample	ratio of sampleA/ref	ratio of sampleB/ref
EF2A	1489	906	1161.5	1489/1161.5 = 1.28	906/1161.5 = 0.78
ABCD1	22	13	16.9	22/16.9 = 1.30	13/16.9 = 0.77
MEFV	793	410	570.2	793/570.2 = 1.39	410/570.2 = 0.72
BAG1	76	42	56.5	76/56.5 = 1.35	42/56.5 = 0.74
MOV10	521	1196	883.7	521/883.7 = 0.590	1196/883.7 = 1.35
…	…	…	…

step 3： Calculate the normalization factor for each sample （ Scale factor ,size factor）

The median of all ratios in a given sample ( Calculated by column in the above table ) As the standardization factor of the sample （ Scale factor ）, Calculated as follows . Note that differentially expressed genes should not affect the median ：

normalization_factor_sampleA <- median(c(1.28, 1.3, 1.39, 1.35, 0.59))

normalization_factor_sampleB <- median(c(0.78, 0.77, 0.72, 0.74, 1.35))

The figure below illustrates the median distribution of all gene ratios in a single sample （ Frequency at y On the shaft ）.

img

The median ratio method assumes that not all genes are differentially expressed ; therefore , The normalization factor should take into account the sequencing depth and RNA form （ A large outlier gene does not represent a median ratio value ）. This method is applicable to up regulation / Downregulation of imbalance and a large number of differentially expressed genes has a strong antagonistic effect .

Usually these size factors are in 1 about , If you see a big difference between the samples , It's important to note that , Because this may indicate the existence of extreme outliers .

step 4： Use the normalization factor to calculate the normalized count value

This is achieved by dividing each original count value in a given sample by the standardization factor of the sample to generate a standardized count value . This is for all counts ( Every gene in every sample ) Executive . for example , If SampleA The median ratio of 1.3,SampleB The median ratio of 0.77, The normalized count can be calculated as follows ：

SampleA median ratio = 1.3

SampleB median ratio = 0.77

Raw Counts

gene	sampleA	sampleB
EF2A	1489	906
ABCD1	22	13
…	…	…

Normalized Counts

gene	sampleA	sampleB
EF2A	1489 / 1.3 = 1145.39	906 / 0.77 = 1176.62
ABCD1	22 / 1.3 = 16.92	13 / 0.77 = 16.88
…	…	…

Please note that , The normalized count value is not an integer .

Use DESeq2 Yes Mov10 Data sets are counted and standardized

Now we have learned about the count normalization theory , Next we will use DESeq2 normalization Mov10 Count of data sets . This takes several steps ：

Make sure that the metadata data frame appears with a row name , And in the same order as the column names of the count data frame .
Create a DESeqDataSet object .
Generate normalized count

1. Match metadata and count data

We should always ensure that the example names match between the two files , And the order of the examples is correct . If not ,DESeq2 Will output an error .

### Check that sample names match in both files
all(colnames(txi$counts) %in% rownames(meta))
all(colnames(txi$counts) == rownames(meta))

If the data does not match , have access to match() Function to rearrange them to match .

2. establish DESeq2 object

Bioconductor Software packages are usually in R Define and use a custom class to store data ( input data 、 Intermediate data and results ). These custom data structures are similar to lists , Because they can contain many different data types / structure . however , Unlike the list , They have pre - specified data slots , Used to store certain types of / Class . The data stored in these pre specified slots can be accessed by using specific package definition functions .

Let's start with creating DESeqDataSet Object start , Then we can talk more about what is stored in it . To create objects , We will need count matrix and metadata table as input . We also need to specify a design formula . The design formula specifies the columns in the metadata table , And how these columns should be used in the analysis . For our dataset , We are only interested in one column , namely ~sampletype. There are three factor levels in this column , It tells DESeq2, For each gene , We want to evaluate these different levels of gene expression .

Our count matrix input is stored in txi In the list object , So we use DESeqDataSetFromTximport() The function passes it , This function extracts the count part and rounds the value to the nearest integer .

## Create DESeq2Dataset object
dds <- DESeqDataSetFromTximport(
  txi,
  colData = meta,
  design = ~ sampletype)

Be careful ： Because we created a data variable containing count in the previous chapter , We can also use it as input . however , under these circumstances , We want to use DESeqDataSetFromMatrix() function .

img

If you like , You can use a specific DESeq To access different data slots and retrieve information . for example , Suppose we want the original count matrix , We will use counts()( Be careful : We nest it in View() Function , So we can see it in the script editor , Instead of printing in the console )：

View(counts(dds))

As we complete the workflow , We will use correlation functions to check what information is stored in the object .

3. Generate Mov10 Standardized count

The next step is to standardize the count data , So that a fair genetic comparison can be made between samples .

img

To perform the normalized ratio median method ,DESeq2 There is one estimateSizeFactors() function , It will generate a size factor for us . We will use this function in the following example , But in a typical RNA-seq Analyzing , This step is made up of DESeq() Function automatic , We'll see in the back .

dds <- estimateSizeFactors(dds)

By assigning the result to dds object , We will fill in with the appropriate information DESeqDataSet The slot of the object . We can use the following method to view the normalization factor of each sample ：

sizeFactors(dds)
#Irrel_kd_1 Irrel_kd_2 Irrel_kd_3 Mov10_kd_2 Mov10_kd_3 Mov10_oe_1 Mov10_oe_2 Mov10_oe_3 
# 1.1150371  0.9606366  0.7493552  1.5634128  0.9359082  1.2257749  1.1406863  0.6541689

Now? , From you to dds Retrieve the normalized count matrix , We use counts() Function and add arguments normalized=TRUE.

normalized_counts <- counts(dds, normalized=TRUE)

We can save this standardized data matrix to a file for future use ：

write.table(normalized_counts, file="data/normalized_counts.txt", 
            sep="\t", quote=F, col.names=NA)

Be careful ：DESeq2 Standardized counting is not actually used , Instead, use raw counts , And in the generalized linear model (GLM) Modeling standardization in . These standardized counts are useful for downstream visualization of results , But not as DESeq2 Or any other tool that uses a negative binomial model for differential expression analysis .