当前位置:网站首页>RNA SEQ introduction practice (I): upstream data download, format conversion and quality control cleaning
RNA SEQ introduction practice (I): upstream data download, format conversion and quality control cleaning
2022-06-28 00:09:00 【Shengxin skill tree】
Two successive orders of seeking talents : Once I brought you 100000 users , But now I wish you closure , as well as Student letter skill tree knowledge sorting Intern Recruitment , Let me get lucky and get acquainted with several excellent friends ! Everyone began to follow my ngs Omics video is used to analyze a series of public data sets , Some of my friends surprised me very much , There is no need for communication and guidance , I finished a real battle silently !
His previous share is :
- Counts FPKM RPKM TPM CPM The transformation of
- To obtain the effective length of genes N Seed formula
Here's what he said to us b Detailed notes from the station transcriptome video course
Overview of this section :
- 1. Find in the article GEO accession number, from NCBI get data SRR Number
- 2. stay linux Use in prefetch Command basis SRR No. Download SRA file
- 3. Use fasterq-dump/fastq-dump The order will SRA The file to FASTQ Format ,pigz Software multithreading compression ( Optional )
- 4. Use fastqc and multiqc Check the quality control of sequencing data 5. Use trim-galore Remove low quality bases and splices
Take on the last section RNA-seq Introduction of actual combat ( zero ):RNA-seq Preparation before the process ——Linux And R Create a new environment
One 、 from NCBI get data SRR Number
Article source of data : Formative pluripotent stem cells show features of epiblast cells poised for gastrulation | Cell Research (nature.com) In the article Data availability Find below GEO accession number: GSE154290
Get into NCBI Official website search GSE154290, Select the corresponding result to enter
find Supplementary file Under the SRA Run Select Options
Common Fields The following describes the basic information of the data , For example, in the table PAIRED Represents double ended sequencing data . In this actual battle, check Found 27 Items Under the RNA_mESCs and RNA_EpiSCs Two data each , Choose again Select Under the Selected Options , download Accession List After copying data SRR Number
Two 、SRA Data download
1. Create and enter test Project folder , take SRR No. paste and import idname file
mkdir test ;cd test
cat > idname
SRR12207279
SRR12207280
SRR12207283
SRR12207284
^C
2. establish SRA Script file for data download
vim 00_prefetch.sh
Mainly used sra-tools Medium prefetch Command download sra data
#sh Content ################################
echo -e "\n \n \n prefetch sra !!! \n \n \n "
date
mkdir -p ~/test/raw/sra/
cd ~/test/raw/sra/
pwd
cat ~/test/idname | while read id ; \
do
( prefetch -O ./ $id & )
done
3. Background suspend running script , Operation import log_00 Log files
nohup bash 00_prefetch.sh >log_00 2>&1 &
Check the system task operation and test File structure under the project
The task is running smoothly , Wait for the data download to complete , Go temporarily relax Let's go. ヽ( ̄▽ ̄)ノ When cat log_00 The following appears downloaded successfully Word means that the download is complete , Then check the data download , After confirming that the download is complete, you can proceed to the next step of file format conversion
prefetch.log
3、 ... and 、 SRA The file to FASTQ Format
Mainly used sra-tool Medium fasterq-dump The command is converted to fastq, After use pigz Software multithreading is compressed into .gz File saves space ( Skipping ), Reuse fastqc and multiqc Perform quality control and quality control summary of original data ~
fasterq-dump/fastq-dump Common parameters
ditto , First create 01_sra2fq_qc1.sh Script files
vim 01_sra2fq_qc1.sh
###########################################
# Move sra Files under subfolders and delete subfolders
date
echo -e "\n \n \n 111# move files !!! \n \n \n "
cd ~/test/raw/sra/
cat ~/test/idname | while read id do
mv $id/* ./
rm -rf $id/
done
date
echo -e "\n \n \n 111# sra>>>fq !!! \n \n \n "
mkdir -p ~/test/raw/fq/
cd ~/test/raw/fq/
pwd
ls ~/test/raw/sra/*.sra |while read id
do
echo " PROCESS $(basename $id) "
fasterq-dump -3 -e 12 -O ./ $id
pigz -p 12 ~/test/raw/fq/*q
done
date
echo -e " \n \n \n 111# qc 1 !!! \n \n \n "
mkdir ~/test/raw/qc1/
cd ~/test/raw/qc1/
pwd
ls ~/test/raw/fq/* | xargs fastqc -t 12 -o ./
multiqc ./
echo -e " \n 111# ALL Work Done!!! \n "
date
function 01_sra2fq_qc1.sh Script files
nohup bash 01_sra2fq_qc1.sh >log_01 2>&1 &
Waiting for the task to complete , Check it out. raw Data under folder
tree raw
Four 、 Quality control cleaning
1. Raw data quality view
View the previous step qc1 Under folder multiqc_report.html QC summary web page file , It mainly focuses on sequencing quality and sequencing connector , It can be found that the data quality is good , The average mass is 30 above , The joint content is also very low . For detailed content analysis, see : 20160410 Sequencing analysis —— Use FastQC Do quality control - You know (zhihu.com)
Small L Student letter learning diary -3 How to judge the quality of original data ?- On (qq.com)
Small L Student letter learning diary -4 How to judge the quality of original data ?- Next - You know (zhihu.com)
2. QC cleaning data
The main use of trim-galore Remove low quality bases and splices , For detailed usage, see lncRNA Introduction to the software of assembly process trim-galore Common parameters are as follows :
trim-galore Common parameters
vim 2_cleanfq_qc2.sh
##############################################
echo -e " \n \n \n 222# Clean ! trim_galore !!! \n \n \n"
date
mkdir ~/test/clean/
cd ~/test/clean/
pwd
##########single###########################################################################
#ls ~/test/raw/fq/*.f* | while read id
#do
# trim_galore -q 25 -j 4 --phred33 --length 35 --stringency 3 \
# --gzip -o ~/test/clean/ $id
#done
#
##########paired###########################################################################
#1) First of all, put the papers _1、_2 The path and file name of are stored separately , And then merge it into two columns , Save as config#########
ls ~/test/raw/fq/*_1* >1
ls ~/test/raw/fq/*_2* >2
paste 1 2 >config
cat config | while read id
do
arr=($id)
fq1=${arr[0]}
fq2=${arr[1]}
trim_galore -j 4 -q 25 --phred33 --length 35 --stringency 3 \
--paired --gzip -o ~/test/clean/ $fq1 $fq2
done
###########################################################################################
echo -e "\n \n \n 222# qc2 Check clean Cleaning results !!! \n \n \n"
mkdir ~/test/clean/qc2
cd ~/test/clean/qc2
pwd
ls ~/test/clean/*f*.gz | xargs fastqc -t 12 -o ~/test/clean/qc2
multiqc ./
echo -e " \n 222# ALL Work Done !!! \n "
date
nohup bash 2_cleanfq_qc2.sh >log_2 2>&1 &
3. Check the data quality after cleaning
see ~/test/clean/qc2 Under the multiqc_report.html QC summary web page file , The base quality is better
Here we are , We finished RNAseq Download of raw data 、 Format conversion and QC cleaning steps , After quality control, it is stored in clean Under folder fastq file , Then you can use these cleaned fastq File for next comparison 、 Count (hisat2+feature_counts or salmon), And finally get what we want counts file
Reference material
20160410 Sequencing analysis —— Use FastQC Do quality control - You know (zhihu.com)
Small L Student letter learning diary -3 How to judge the quality of original data ?- On (qq.com)
Small L Student letter learning diary -4 How to judge the quality of original data ?- Next - You know (zhihu.com)
lncRNA Introduction to the software of assembly process trim-galore
This practical tutorial is based on the video shared by the following student letter skill tree :
【 Shengxin skill tree 】 Analysis of transcriptome sequencing data _ Bili, Bili _bilibili
【 Shengxin skill tree 】GEO Database mining _ Bili, Bili _bilibili
边栏推荐
猜你喜欢
【无标题】
Structure de stockage des graphiques
[microservices sentinel] sentinel data persistence
安全省油环保 骆驼AGM启停电池魅力十足
MySQL enterprise parameter tuning practice sharing
本地可视化工具连接阿里云centOS服务器的redis
MongoDB-在windows电脑本地安装一个mongodb的数据库
Sell notes | brief introduction to video text pre training
技术的极限(11): 有趣的编程
A summer party
随机推荐
One step forward is excellent, one step backward is ignorant
[AI application] detailed parameters of NVIDIA Tesla v100s-pcie-32gb
Halcon's region: features of multiple regions (6)
Build an open source and beautiful database monitoring system -lepus
How to use raspberry pie (and all kinds of pies)
吴恩达《机器学习》课程总结(11)_支持向量机
Systematic learning + active exploration is the most comfortable way to get started!
本地可视化工具连接阿里云centOS服务器的redis
[tinyriscv verilator] branch transplanted to Da Vinci development board of punctual atom
[读书摘要] 学校的英文阅读教学错在哪里?--经验主义和认知科学的PK
golang使用mongo-driver操作——查(进阶)
MySQL分表查询之Merge存储引擎实现
Msp430f5529 MCU reads gy-906 infrared temperature sensor
ASP. Net warehouse purchase, sales and inventory ERP management system source code ERP applet source code
[paper reading | deep reading] sdne:structural deep network embedding
虽然TCGA数据库有33种癌症
Pytorch Foundation (1)
Character interception triplets of data warehouse: substrb, substr, substring
技术的极限(11): 有趣的编程
A summer party