当前位置:网站首页>RNA SEQ introduction practice (I): upstream data download, format conversion and quality control cleaning

RNA SEQ introduction practice (I): upstream data download, format conversion and quality control cleaning

2022-06-28 00:09:00 Shengxin skill tree

Two successive orders of seeking talents : Once I brought you 100000 users , But now I wish you closure , as well as Student letter skill tree knowledge sorting Intern Recruitment , Let me get lucky and get acquainted with several excellent friends ! Everyone began to follow my ngs Omics video is used to analyze a series of public data sets , Some of my friends surprised me very much , There is no need for communication and guidance , I finished a real battle silently !

His previous share is :

Here's what he said to us b Detailed notes from the station transcriptome video course

Overview of this section :

  • 1. Find in the article GEO accession number, from NCBI get data SRR Number
  • 2. stay linux Use in prefetch Command basis SRR No. Download SRA file
  • 3. Use fasterq-dump/fastq-dump The order will SRA The file to FASTQ Format ,pigz Software multithreading compression ( Optional )
  • 4. Use fastqc and multiqc Check the quality control of sequencing data 5. Use trim-galore Remove low quality bases and splices

Take on the last section RNA-seq Introduction of actual combat ( zero ):RNA-seq Preparation before the process ——Linux And R Create a new environment

One 、 from NCBI get data SRR Number

Article source of data : Formative pluripotent stem cells show features of epiblast cells poised for gastrulation | Cell Research (nature.com) In the article Data availability Find below GEO accession number: GSE154290

Get into NCBI Official website search GSE154290, Select the corresponding result to enter

find Supplementary file Under the SRA Run Select Options

Common Fields The following describes the basic information of the data , For example, in the table PAIRED Represents double ended sequencing data . In this actual battle, check Found 27 Items Under the RNA_mESCs and RNA_EpiSCs Two data each , Choose again Select Under the Selected Options , download Accession List After copying data SRR Number


Two 、SRA Data download

1. Create and enter test Project folder , take SRR No. paste and import idname file

mkdir test ;cd test 
cat > idname 
SRR12207279 
SRR12207280 
SRR12207283 
SRR12207284 
^C 

2. establish SRA Script file for data download

vim 00_prefetch.sh  

Mainly used sra-tools Medium prefetch Command download sra data

#sh Content ################################ 
echo -e "\n \n \n prefetch sra !!! \n \n \n " 
date 
mkdir -p ~/test/raw/sra/ 
cd ~/test/raw/sra/ 
pwd 

cat  ~/test/idname | while read id ; \ 
do       
  ( prefetch -O ./ $id & ) 
done   

3. Background suspend running script , Operation import log_00 Log files

nohup bash 00_prefetch.sh >log_00 2>&1 &

Check the system task operation and test File structure under the project

The task is running smoothly , Wait for the data download to complete , Go temporarily relax Let's go. ヽ( ̄▽ ̄)ノ When cat log_00 The following appears downloaded successfully Word means that the download is complete , Then check the data download , After confirming that the download is complete, you can proceed to the next step of file format conversion

prefetch.log


3、 ... and 、 SRA The file to FASTQ Format

Mainly used sra-tool Medium fasterq-dump The command is converted to fastq, After use pigz Software multithreading is compressed into .gz File saves space ( Skipping ), Reuse fastqc and multiqc Perform quality control and quality control summary of original data ~

fasterq-dump/fastq-dump Common parameters

ditto , First create 01_sra2fq_qc1.sh Script files

vim 01_sra2fq_qc1.sh  
########################################### 
# Move sra Files under subfolders and delete subfolders  
date 
echo  -e "\n \n \n  111#  move files !!! \n \n \n  " 
cd ~/test/raw/sra/
cat ~/test/idname | while read id do 
mv $id/*  ./ 
rm -rf $id/ 
done 
date 


echo  -e "\n \n \n  111#  sra>>>fq !!! \n \n \n  "
mkdir -p ~/test/raw/fq/
cd ~/test/raw/fq/
pwd
ls  ~/test/raw/sra/*.sra |while read id 
do
echo " PROCESS $(basename $id) "
fasterq-dump -3 -e 12 -O ./ $id
pigz -p 12   ~/test/raw/fq/*q
done
date


echo -e " \n \n \n  111# qc 1 !!! \n \n \n "       
mkdir ~/test/raw/qc1/
cd  ~/test/raw/qc1/
pwd
ls ~/test/raw/fq/* | xargs fastqc -t 12 -o  ./
multiqc ./

echo -e  " \n 111#  ALL  Work Done!!! \n "
date

function 01_sra2fq_qc1.sh Script files

nohup bash 01_sra2fq_qc1.sh >log_01 2>&1 &

Waiting for the task to complete , Check it out. raw Data under folder

tree raw


Four 、 Quality control cleaning

1. Raw data quality view

View the previous step qc1 Under folder multiqc_report.html QC summary web page file , It mainly focuses on sequencing quality and sequencing connector , It can be found that the data quality is good , The average mass is 30 above , The joint content is also very low . For detailed content analysis, see : 20160410 Sequencing analysis —— Use FastQC Do quality control - You know (zhihu.com)

Small L Student letter learning diary -3 How to judge the quality of original data ?- On (qq.com)

Small L Student letter learning diary -4 How to judge the quality of original data ?- Next - You know (zhihu.com)

2. QC cleaning data

The main use of trim-galore Remove low quality bases and splices , For detailed usage, see lncRNA Introduction to the software of assembly process trim-galore Common parameters are as follows :

trim-galore Common parameters

vim 2_cleanfq_qc2.sh  
##############################################
echo -e " \n \n \n 222# Clean ! trim_galore !!! \n \n \n"
date
mkdir ~/test/clean/
cd ~/test/clean/
pwd

##########single###########################################################################
#ls ~/test/raw/fq/*.f* | while read id 
#do 
#       trim_galore -q 25 -j 4 --phred33 --length 35 --stringency 3 \
#               --gzip  -o ~/test/clean/  $id 
#done
#
##########paired###########################################################################
#1) First of all, put the papers _1、_2 The path and file name of are stored separately , And then merge it into two columns , Save as config#########
  ls ~/test/raw/fq/*_1*  >1
  ls ~/test/raw/fq/*_2*  >2
  paste 1 2 >config
  cat config | while read id
  do
    arr=($id)
    fq1=${arr[0]}
    fq2=${arr[1]}
   trim_galore  -j 4  -q 25  --phred33 --length 35 --stringency 3 \
               --paired --gzip -o ~/test/clean/  $fq1 $fq2
  done
###########################################################################################                  


echo -e "\n \n \n 222# qc2  Check clean Cleaning results !!! \n \n \n"
mkdir  ~/test/clean/qc2
cd ~/test/clean/qc2
pwd
ls ~/test/clean/*f*.gz | xargs fastqc -t 12 -o   ~/test/clean/qc2
multiqc   ./

echo -e " \n 222# ALL  Work Done !!! \n "
date
nohup bash  2_cleanfq_qc2.sh >log_2 2>&1 &

3. Check the data quality after cleaning

see ~/test/clean/qc2 Under the multiqc_report.html QC summary web page file , The base quality is better


Here we are , We finished RNAseq Download of raw data 、 Format conversion and QC cleaning steps , After quality control, it is stored in clean Under folder fastq file , Then you can use these cleaned fastq File for next comparison 、 Count (hisat2+feature_counts or salmon), And finally get what we want counts file

Reference material

20160410 Sequencing analysis —— Use FastQC Do quality control - You know (zhihu.com)

Small L Student letter learning diary -3 How to judge the quality of original data ?- On (qq.com)

Small L Student letter learning diary -4 How to judge the quality of original data ?- Next - You know (zhihu.com)

lncRNA Introduction to the software of assembly process trim-galore

This practical tutorial is based on the video shared by the following student letter skill tree :

【 Shengxin skill tree 】 Analysis of transcriptome sequencing data _ Bili, Bili _bilibili

【 Shengxin skill tree 】GEO Database mining _ Bili, Bili _bilibili

原网站

版权声明
本文为[Shengxin skill tree]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/179/202206272135066388.html