当前位置:网站首页>RNA SEQ introduction practice (I): upstream data download, format conversion and quality control cleaning
RNA SEQ introduction practice (I): upstream data download, format conversion and quality control cleaning
2022-06-28 00:09:00 【Shengxin skill tree】
Two successive orders of seeking talents : Once I brought you 100000 users , But now I wish you closure , as well as Student letter skill tree knowledge sorting Intern Recruitment , Let me get lucky and get acquainted with several excellent friends ! Everyone began to follow my ngs Omics video is used to analyze a series of public data sets , Some of my friends surprised me very much , There is no need for communication and guidance , I finished a real battle silently !
His previous share is :
- Counts FPKM RPKM TPM CPM The transformation of
- To obtain the effective length of genes N Seed formula
Here's what he said to us b Detailed notes from the station transcriptome video course
Overview of this section :
- 1. Find in the article GEO accession number, from NCBI get data SRR Number
- 2. stay linux Use in prefetch Command basis SRR No. Download SRA file
- 3. Use fasterq-dump/fastq-dump The order will SRA The file to FASTQ Format ,pigz Software multithreading compression ( Optional )
- 4. Use fastqc and multiqc Check the quality control of sequencing data 5. Use trim-galore Remove low quality bases and splices
Take on the last section RNA-seq Introduction of actual combat ( zero ):RNA-seq Preparation before the process ——Linux And R Create a new environment
One 、 from NCBI get data SRR Number
Article source of data : Formative pluripotent stem cells show features of epiblast cells poised for gastrulation | Cell Research (nature.com) In the article Data availability Find below GEO accession number: GSE154290
Get into NCBI Official website search GSE154290, Select the corresponding result to enter
find Supplementary file Under the SRA Run Select Options
Common Fields The following describes the basic information of the data , For example, in the table PAIRED Represents double ended sequencing data . In this actual battle, check Found 27 Items Under the RNA_mESCs and RNA_EpiSCs Two data each , Choose again Select Under the Selected Options , download Accession List After copying data SRR Number
Two 、SRA Data download
1. Create and enter test Project folder , take SRR No. paste and import idname file
mkdir test ;cd test
cat > idname
SRR12207279
SRR12207280
SRR12207283
SRR12207284
^C 2. establish SRA Script file for data download
vim 00_prefetch.sh Mainly used sra-tools Medium prefetch Command download sra data
#sh Content ################################
echo -e "\n \n \n prefetch sra !!! \n \n \n "
date
mkdir -p ~/test/raw/sra/
cd ~/test/raw/sra/
pwd
cat ~/test/idname | while read id ; \
do
( prefetch -O ./ $id & )
done 3. Background suspend running script , Operation import log_00 Log files
nohup bash 00_prefetch.sh >log_00 2>&1 &Check the system task operation and test File structure under the project
The task is running smoothly , Wait for the data download to complete , Go temporarily relax Let's go. ヽ( ̄▽ ̄)ノ When cat log_00 The following appears downloaded successfully Word means that the download is complete , Then check the data download , After confirming that the download is complete, you can proceed to the next step of file format conversion
prefetch.log
3、 ... and 、 SRA The file to FASTQ Format
Mainly used sra-tool Medium fasterq-dump The command is converted to fastq, After use pigz Software multithreading is compressed into .gz File saves space ( Skipping ), Reuse fastqc and multiqc Perform quality control and quality control summary of original data ~
fasterq-dump/fastq-dump Common parameters
ditto , First create 01_sra2fq_qc1.sh Script files
vim 01_sra2fq_qc1.sh ###########################################
# Move sra Files under subfolders and delete subfolders
date
echo -e "\n \n \n 111# move files !!! \n \n \n "
cd ~/test/raw/sra/
cat ~/test/idname | while read id do
mv $id/* ./
rm -rf $id/
done
date
echo -e "\n \n \n 111# sra>>>fq !!! \n \n \n "
mkdir -p ~/test/raw/fq/
cd ~/test/raw/fq/
pwd
ls ~/test/raw/sra/*.sra |while read id
do
echo " PROCESS $(basename $id) "
fasterq-dump -3 -e 12 -O ./ $id
pigz -p 12 ~/test/raw/fq/*q
done
date
echo -e " \n \n \n 111# qc 1 !!! \n \n \n "
mkdir ~/test/raw/qc1/
cd ~/test/raw/qc1/
pwd
ls ~/test/raw/fq/* | xargs fastqc -t 12 -o ./
multiqc ./
echo -e " \n 111# ALL Work Done!!! \n "
datefunction 01_sra2fq_qc1.sh Script files
nohup bash 01_sra2fq_qc1.sh >log_01 2>&1 &Waiting for the task to complete , Check it out. raw Data under folder
tree raw
Four 、 Quality control cleaning
1. Raw data quality view
View the previous step qc1 Under folder multiqc_report.html QC summary web page file , It mainly focuses on sequencing quality and sequencing connector , It can be found that the data quality is good , The average mass is 30 above , The joint content is also very low . For detailed content analysis, see : 20160410 Sequencing analysis —— Use FastQC Do quality control - You know (zhihu.com)
Small L Student letter learning diary -3 How to judge the quality of original data ?- On (qq.com)
Small L Student letter learning diary -4 How to judge the quality of original data ?- Next - You know (zhihu.com)
2. QC cleaning data
The main use of trim-galore Remove low quality bases and splices , For detailed usage, see lncRNA Introduction to the software of assembly process trim-galore Common parameters are as follows :
trim-galore Common parameters
vim 2_cleanfq_qc2.sh ##############################################
echo -e " \n \n \n 222# Clean ! trim_galore !!! \n \n \n"
date
mkdir ~/test/clean/
cd ~/test/clean/
pwd
##########single###########################################################################
#ls ~/test/raw/fq/*.f* | while read id
#do
# trim_galore -q 25 -j 4 --phred33 --length 35 --stringency 3 \
# --gzip -o ~/test/clean/ $id
#done
#
##########paired###########################################################################
#1) First of all, put the papers _1、_2 The path and file name of are stored separately , And then merge it into two columns , Save as config#########
ls ~/test/raw/fq/*_1* >1
ls ~/test/raw/fq/*_2* >2
paste 1 2 >config
cat config | while read id
do
arr=($id)
fq1=${arr[0]}
fq2=${arr[1]}
trim_galore -j 4 -q 25 --phred33 --length 35 --stringency 3 \
--paired --gzip -o ~/test/clean/ $fq1 $fq2
done
###########################################################################################
echo -e "\n \n \n 222# qc2 Check clean Cleaning results !!! \n \n \n"
mkdir ~/test/clean/qc2
cd ~/test/clean/qc2
pwd
ls ~/test/clean/*f*.gz | xargs fastqc -t 12 -o ~/test/clean/qc2
multiqc ./
echo -e " \n 222# ALL Work Done !!! \n "
datenohup bash 2_cleanfq_qc2.sh >log_2 2>&1 &3. Check the data quality after cleaning
see ~/test/clean/qc2 Under the multiqc_report.html QC summary web page file , The base quality is better
Here we are , We finished RNAseq Download of raw data 、 Format conversion and QC cleaning steps , After quality control, it is stored in clean Under folder fastq file , Then you can use these cleaned fastq File for next comparison 、 Count (hisat2+feature_counts or salmon), And finally get what we want counts file
Reference material
20160410 Sequencing analysis —— Use FastQC Do quality control - You know (zhihu.com)
Small L Student letter learning diary -3 How to judge the quality of original data ?- On (qq.com)
Small L Student letter learning diary -4 How to judge the quality of original data ?- Next - You know (zhihu.com)
lncRNA Introduction to the software of assembly process trim-galore
This practical tutorial is based on the video shared by the following student letter skill tree :
【 Shengxin skill tree 】 Analysis of transcriptome sequencing data _ Bili, Bili _bilibili
【 Shengxin skill tree 】GEO Database mining _ Bili, Bili _bilibili
边栏推荐
猜你喜欢
![[idea] idea formatting code skills](/img/06/38079517e901bc48dc4ca0f8cc63fe.jpg)
[idea] idea formatting code skills

【PCL自学:Segmentation3】基于PCL的点云分割:区域增长分割

Safe, fuel-efficient and environment-friendly camel AGM start stop battery is full of charm

【PCL自学:Segmentation4】基于Min-Cut点云分割

【PCL自学:PCLVisualizer】点云可视化工具PCLVisualizer

什么是cookie,以及v-htm的安全性隐患
![[PCL self study: Segmentation3] PCL based point cloud segmentation: region growth segmentation](/img/9e/f08ce0729c89b0205c0ac47c523ad7.png)
[PCL self study: Segmentation3] PCL based point cloud segmentation: region growth segmentation

【PCL自学:PCLPlotter】PCLPlotter绘制数据分析图

吴恩达《机器学习》课程总结(14)_降维
![[paper reading | deep reading] sdne:structural deep network embedding](/img/6a/b2edf326f6e7ded83deb77219654aa.png)
[paper reading | deep reading] sdne:structural deep network embedding
随机推荐
C language character pointer and string initialization
炼金术(4): 程序员的心智模型
golang使用mongo-driver操作——查(基础)
零基础自学SQL课程 | IF函数
[黑苹果系列] M910x完美黑苹果系统安装教程 – 2 制作系统U盘-USB Creation
apipost脚本使用讲解一~全局变量
Are the registered accounts of the top ten securities companies safe and risky?
SCU|通过深度强化学习进行微型游泳机器人的步态切换和目标导航
翻译(5): 技术债务墻:一种让技术债务可见并可协商的方法
Is the securities registration account safe? Is there any risk?
MySQL character set
炼金术(6): 可迭代的模型和用例
ASP. Net warehouse purchase, sales and inventory ERP management system source code ERP applet source code
华泰证券在网上开户安全吗?
炼金术(7): 何以解忧,唯有重构
Thread pool implementation: semaphores can also be understood as small waiting queues
使用cef3开发的浏览器不支持flash问题的解决
MongoDB-在windows电脑本地安装一个mongodb的数据库
现代编程语言:Rust (铁锈,一文掌握钢铁是怎样生锈的)
Golang uses Mongo driver operation - query (basic)