当前位置：网站首页>Analysis of orthofinder lineal homologous proteins and result processing

Analysis of orthofinder lineal homologous proteins and result processing

2022-06-27 08:24:00 【Neptuneyut】

install

(gtdbtk) [[email protected] Genome_integration]$ mamba create -n orthofinder -c bioconda orthofinder

Use

(orthofinder) [[email protected] 137MIMAG_High_quality_MAGs]$ ulimit -n 20000 #  Temporarily set the maximum file opening amount 
 nohup time orthofinder -f test -t 100 -a 100 -o test_orthofinder &>of.log&
# -t  yes blast all2all The thread of 
# -a  It's the thread that does the next step

test Is a directory containing protein sequences , One file per genome

Use a single copy OG Series connection

Select single copy genes , Carry out multiple sequence alignment respectively , Then the phylogenetic tree was constructed by serial comparison

(base) [[email protected] Results_Mar04]$ ls Single_Copy_Orthologue_Sequences
OG0000577.fa  OG0000583.fa  OG0000603.fa  OG0000613.fa  OG0000631.fa  OG0000643.fa  OG0000649.fa
OG0000580.fa  OG0000584.fa  OG0000609.fa  OG0000614.fa  OG0000639.fa  OG0000644.fa
OG0000581.fa  OG0000591.fa  OG0000610.fa  OG0000615.fa  OG0000641.fa  OG0000646.fa
OG0000582.fa  OG0000593.fa  OG0000611.fa  OG0000625.fa  OG0000642.fa  OG0000647.fa

#  Single OG mafft Comparison and trimal
$ for i in *fa;do  mafft --ep 0 --genafpair --maxiterate 1000 --thread 30 $i > ../single_copy_OG_msa/${i}_mafft.msa && trimal  -in  ../single_copy_OG_msa/${i}_mafft.msa -out ../single_copy_OG_msa/${i}_mafft_trimal.msa -gt 0.95 -cons 50 &>../single_copy_OG_msa/${i}_trimal.log  ;done &

#  Modify name , Easy to connect in series 
# Single_Copy_Orthologue_Sequences Under the OG The sequence order is the same , That is, the order of species is the same , So you can choose a sequence of id As a standard , Other sequences are uniformly named this , Then you can go to AMAS Connected together ( It's not done )
#  Directly from prokka Protein sequence search ,137.id  Order matters 
(base) [[email protected] single_copy_OG_msa]$  cat 137.id | while read id; do bioawk -c fastx -v id=$id '$name == id{n=split(FILENAME,a,"/");print id,a[n];exit}' /share/pasteur/yutao/Methanosarcinales_order_phylogene/Genome_integration/137MIMAG_High_quality_MAGs/137MAGs_faas/*faa;done|sed "s/_genomic.faa//g" > 137OG_ID_2_Genome.tsv &
#  The second column here is all OG New sequence name for

Will protein id Change to species name

gtdbtk) [[email protected] single_copy_OG_msa]$ cat rename_by_mapping_prtoid_with_taxid.sh
# (gtdbtk) [[email protected] single_copy_OG_msa]$ head 137OG_ID_2_Genome.tsv
#NADBEILJ_00247 GB_GCA_000306725.1
#CJPDHIMO_00953 GB_GCA_001602645.1
#CABJFEME_02449 GB_GCA_001714685.2
#DMFLKMLA_01936 GB_GCA_001896725.1

#(gtdbtk) [[email protected] single_copy_OG_msa]$ head OG0000609.fa_mafft_trimal.msa
#>NADBEILJ_01445
#GDSVIFVVKTTANQERSVANMITQVARKDHLDIRAIIAPDELKGYVLIEAPEPGVVEQAI
#QTVPHARTVVKGRSSIEEISHFLTPKPTVTGITEGAIIEITSGPFKGERARVKRVDEGHE
#EITVELFDAVVPIPITIRGDTVRVLRKEEE
#>CJPDHIMO_02580
#AETSIFVVKTTANQERSVANLIAQLARKEKYDITALLVPDVLKGYVLVEAKAPEIVEEAI
#QGVPHARTVIKGASSFGEVEHFLTPKPAVVGINEGAIVELISGPFKGEMARVKRVDIGKE

for msa in OG0000*_mafft_trimal.msa
do
        line_num=1
        cat 137OG_ID_2_Genome.tsv       |while read i genome
                do
                        bioawk -c fastx -v genome=$genome -v line_num=$line_num \
                                'FNR==line_num{print ">"genome"\n"$seq}' $msa
                        line_num=$(( $line_num + 1 ))
        done >rename_trimal_msa/$msa
done

(base) [[email protected] single_copy_OG_msa]$ AMAS.py concat -i rename_trimal_msa/*msa -f fasta -d aa  -t 137MAGs_25OGs.msa -p 137MAGs_25OGs.msa.stat

#  You can trim it again 
(base) [[email protected] single_copy_OG_msa]$ time  trimal  -in  137MAGs_25OGs.msa -out 137MAGs_25OGs.msa_trimal.msa -gt 0.95 -cons 50 &>trimal.log
base) [[email protected] single_copy_OG_msa]$ ll 137MAGs_25OGs.msa_trimal.msa
-rw-r--r-- 1 yutao husn 683K Mar  8 19:45 137MAGs_25OGs.msa_trimal.msa

#  paper mulberry 
(base) [[email protected] single_copy_OG_msa]$ nohup time raxmlHPC-PTHREADS -T 50 -x 888 -p 888  -f a -n boot -m PROTGAMMAAUTO -c 4 -e 0.001 -# 1000 -s 137MAGs_25OGs.msa_trimal.msa &>137MAGs_25OGs.msa_trimal.msa_raxml_bootstrap1000.log&

The order that comes with you
Species tree: Using 329 orthogroups with minimum of 91.2% of species having single-copy genes in any orthogroup

(orthofinder) [[email protected] 137MAGs_orthofinder]$ nohup time orthofinder -M msa -fg  Results_Mar04/  &>msa_fg_option.log&

Output results

Orthogroups.tsv

The line is each OG, The column is in a genome that OG The name of the protein , Multiple are separated by commas

(metawrap) [[email protected] Results_Mar04]$ awk -F"\t" '{print $1,$2}' Orthogroups/Orthogroups.tsv|less  -S

Orthogroup GB_GCA_000306725.1_genomic
OG0000000 NADBEILJ_00029, NADBEILJ_00050
OG0000009 NADBEILJ_02333

Get a single copy OG The tandem sequence of

ref

Use the option, "-M msa -fg PREVIOUS_RESULTS_DIRECTORY". It will start from your already inferred orthogroups from the previous run but it will get OrthoFinder to select orthogroups that are single-copy in most species and then for each of these orthogroups use only the genes from those species that are single-copy to construct a concatenated MSA. This is what most studies do to infer a phylogeny, but OrthoFinder will do it automatically. It will use this to infer a species tree:

Species_Tree/SpeciesTree_rooted.txt
It will also produce two files to tell you the data used:

Species_Tree/Orthogroups_for_concatenated_alignment.txt
MultipleSequenceAlignments/SpeciesTreeAlignment.fa

You can get a fasta file for each orthogroup by running:
orthofinder -fg RESULTS_DIR -M msa -os
where RESULTS_DIR is the directory containing your orthogroups results files (e.g. Orthogroups.csv). I'll release an update soon so that you get these files from a default OrthoFinder run.

In RESULTS_DIR there is a file, SingleCopyOrthogroups.txt, which tells you which of these have all species present with exactly one gene per species. I think that answers your final question.

If you did want to find all the orthogroups with all species present (but no restriction on copy-number) then you can open the Orthogroups.GeneCount.csv file in excel and use that to identify them. E.g. for the example dataset I used the excel commands =COUNTIF(B3,">0"), =SUM(G2:J2) and =K2=4 as shown here: Orthogroups.GeneCount.xlsx

common problem

System file restriction problem

(orthofinder) [[email protected] 137MIMAG_High_quality_MAGs]$  ulimit -n 20000
#  If you are prompted for permission , Just exit and log in

Insert picture description here

brief introduction

Insert picture description here
OrthoFinder: phylogenetic orthology inference for comparative genomics
Genome Biology, 2019

Abstract

Here it is , We proposed OrthoFinder A major advance in methodology . It expands OrthoFinder High precision direct homologous cluster inference , It provides phylogenetic inference of lineal homologous clusters 、 There is a gene tree 、 Gene replication events 、 Rooted tree species and comparative genomics statistics . Each output is benchmarked on an appropriate real or simulated data set , If there are comparable methods ,OrthoFinder Equal to or better than these methods . Besides ,OrthoFinder yes Quest for Orthologs The most accurate direct homologous gene inference method in benchmarking . Last ,OrthoFinder The speed and scalability of integrated phylogenetic analysis with the fastest 、 Score based heuristics are quite .OrthoFinder Can be found in https://github.com/davidemms/OrthoFinder.

Background introduction

Determining the phylogenetic relationship between gene sequences is the basis of comparative biology . It provides a framework for understanding the evolution and diversity of life on earth , And the biological knowledge between organisms can be inferred . Considering the central importance of this process to many areas of biological research , A number of different software tools have been developed , Try to determine these relationships after giving the set of gene sequences provided by the user [1,2,3]. Most of these software tools try to use heuristic analysis from all to all BLAST[4] The pairwise sequence similarity score obtained in the search （ Or expectations ）, or BLAST An accelerated alternative to , Such as DIAMOND[5] or MMseqs2[6], Infer phylogenetic relationships between gene sequences . Widely used methods include InParanoid[7]、OrthoMCL[8]、OMA[9] and OrthoFinder[10], They all adopt different methods to query the sequence similarity score , And they all produce different output results – Some identify orthologous clusters , Some identify direct and collateral systems , Some have both . Because they each adopt different methods to analyze the sequence similarity score , Therefore, each method shows different performance characteristics on the commonly used benchmark database [1, 11].

Heuristic analysis of sequence similarity scores has always been used to estimate phylogenetic relationships between genes , Because it is easy to calculate . The core premise of using these methods is , Sequence pairs with higher scores may have diverged more recently than sequence pairs with lower scores . therefore , Heuristic analysis of pairwise sequence similarity scores can be used to estimate phylogenetic relationships between genomes [7,8,9, 12,13]. However , The estimation of the phylogenetic relationship between genes based on scores is disturbed by many factors . for example , The variable sequence evolution rate between genes often leads to false positive and false negative errors [14,15]（ chart 1）. This error can be mitigated by analyzing the phylogenetic tree of genes [17], Because phylogenetic trees can change the rate of sequence evolution （ Branch length ） The order of differentiation from the sequence （ Tree topology ） Distinguish , So as to clarify the direct and collateral relationship （ chart 1）.
Insert picture description here

Proper body inference based on pairing similarity score may be misled by variable sequence evolution rate . a A phylogenetic tree of a typical gene family .b species A- gene 2 Mistaken for a species B and C The lineal relationship of the gene . left ： The branch proportion of the gene tree and the true lineal relationship , It can be determined from the gene tree . Right hand side . The best hit rate of Swap Based on the gene similarity score with monotonous branch length （RBH）, And the orthogonal relationships inferred from these scores using standard Heuristics （ Use RBH The inferred orthogonal relation is the closest to the common orthogonal relation determined from the hit rate within the species RBH Better [8,16]）.FP, False positive ;FN, false negative

Insert picture description here

OrthoFinder workflow . The method used in each step is indicated by an arrow . Published algorithms are shown in italics , There is an asterisk behind it . The blue dashed line connected to the solid arrow indicates additional data , Used for the conversion indicated by the filled arrow .MSA, Tree inference based on multi sequence arrangement ;DLC, Copy - The loss of - condensation .(a) Use the original OrthoFinder Algorithm for orthogonal group inference （ An orthogonal group is a collection of descendants of a gene from the last common ancestor of all species under consideration ）.(b) Gene tree inference . Species tree inference .(d) Species tree inference (e) Gene tree inference (f) Mix and overlap the rooted gene trees +DLC analysis , To infer orthonormal and gene duplication events .(g) Description of the orthogonal result table of the genes of each input species （ Four main frames ）. The horizontal division shows the orthogonality of each species pair .(h) Description of gene replication event table , Shows the location of gene replication events mapped to the species tree , In the gene tree , Percentage of duplicate genes retained in the sampled species , And genes that are descendants of gene replication events .(i) Comparative genomics statistics

Insert picture description here

Of a group of chordate species OrthoFinder Analysis summary .Ciona intestinalis, Danio rerio, Oryzias latipes, Xenopus tropicalis, Gallus gallus, Monodelphis domestica, Mus musculus, Rattus norvegicus, Pan troglodytes, and Homo sapiens. Bar and heat charts contain data for each species , And a Align the corresponding species in the tree . a from STAG Inferred species of trees , from STRIDE take root .e A heat map containing the number of orthogonal groups per species pair （ The upper right ） And orthogonal groups between each species （ The lower left ）. f Two species , namely C. intestinalis and H. sapiens, Orthogonal group multiples relative to all other species .g Number of gene replication events on each terminal branch of the species tree .h The number of species that replicate and remain on each branch of the species tree in all offspring .OG, Positive group ;sp., species ;spp., species （ The plural ）;dups., Gene replication events

Many tree based lineages have been developed ton Online database , Include PhylomeDB[18]、Ensembl-Compara[19]、EggNOG[20] and TreeFam[21]. These highly used resources provide users with the ability to use phylogenetic trees to explore the history of gene evolution , It provides a more complete picture than the simple pairwise orthogonal relation and collateral relation . Use standard benchmark methods to compare and analyze these methods , It is found that there is no significant difference in the accuracy of orthogonal detection between these online databases and score based software tools [1], It indicates that the advantages of the phylogenetic approach have not been fully realized . Besides , The pipelines and methods behind these online databases are generally not provided to users to run their own analysis . therefore , Need an automated software tool , Effective use of phylogenetic methods to improve accuracy , But it also has the usability of the score based heuristic method 、 Speed and scalability .

Although automatic software tools for phylogenetic orthogonal inference from gene sequences are an important goal , However, the implementation of this method has several technical challenges . These challenges include the following .(1) Within the time scale that competes with score based heuristics , Infer all the gene trees of a complete set of species ;(2) Automatically rooting these gene trees , So that it can be correctly explained [22], Users do not need to know the rooting trees in advance ;(3) Interpreting gene trees to identify gene replication events 、 Orthonormal and collateral , At the same time, gene replication 、 The loss of 、 Incomplete pedigree sequencing and inaccurate gene tree and other processes remain robust . If these challenges can be addressed in a resource and time efficient manner , Then such a phylogenetic approach will provide a step change for orthogonal inference , It can be changed from the estimation of phylogenetic relationship based on similarity score to the division of phylogenetic relationship among genes .

Some of the above challenges have been solved in isolation by a series of bioinformatics methods . for example , There are a series of methods to identify gene orthogonal groups from gene sequences provided by users [8,9,10,12,23], And various gene tree inference methods that can infer trees from these orthogonal groups [24,25,26,27,28]. Again , There are a number of ways to infer lineal relatives from the gene tree , These methods also differ in scalability and accuracy [29,30,31,32]. However , Other key challenges do not have existing solutions . for example , It will be a complex task to infer a complete set of rooted gene trees from the proteome of a group of species 、 A multi-step process , Generally, you need to know the species and trees in advance . Again , There is no way to infer orthogonal objects from the gene tree , These methods are robust to incomplete pedigree sequencing and gene tree inference errors , At the same time, it can also be extended to large-scale analysis required for genome-wide orthogonal inference of hundreds of species . therefore , A number of technical challenges need to be addressed , In order to realize the full automation of the phylogenetic relationship between genes 、 Accurate and efficient phylogenetic delineation .

Here it is , We are right. OrthoFinder A major update , Solved these challenges , And greatly expand the scope of the original method . Updated OrthoFinder And the original implementation method [10] equally , The orthogonal group is determined , But then we use these orthogonal groups to infer the gene tree of all orthogonal groups , And these gene trees were analyzed to determine the rooted species tree . And then , This method identifies all gene replication events in the complete gene tree , And analyze this information in the context of species tree , To provide gene tree and species tree level analysis of gene replication events . Last , This method analyzes all these phylogenetic information , To determine the complete orthogonal set of all species , And provide a wide range of comparative genomics statistics . complete OrthoFinder The method of phylogenetic orthogonal inference is accurate 、 fast 、 Extensible and customizable , It can be done with just one command , Just use the protein sequence as input .

Detailed output results

Example of detailed output results

A preliminary understanding of the results

By default ,OrthoFinder In the input proteome directory, a new one named "OrthoFinder " Results directory for , And put the results here . My results directory looks like this .
Insert picture description here

The quality control ：Percentage of genes in orthogroups

The first thing I like to check is how many genes are assigned to orthogroups. stay OrthoFinder In the output of ,Comparative_Genomics_Statistics/Statistics_Overall.tsv, It prints this text .

OrthoFinder assigned 121743 genes (92.9% of total) to 17981 orthogroups.

This is very good , Generally speaking , See you have at least 80% Of genes are assigned to orthogroups It's good . Less than that means you may have missed some of the remaining genes that actually exist orthogroups Relationship , Poor sampling of species is the most likely cause . Let's also check the percentage of each species . There is one named Comparative_Genomics_Statistics/Statistics_PerSpecies.tsv Label delimited file . And OrthoFinder Other ".tsv " file , This document is best placed in Excel or LibreOffice Calc Wait for the spreadsheet program to view . These files may be automatically processed correctly on your computer , Or you may need to tell it explicitly that they are tab delimited . Here's how you use LibreOffice or Excel To deal with .
Insert picture description here
Once you open this file , You will see that vertebrates have more than 90% Of genes are assigned to OG, The fruit fly has about 76% Of genes are assigned to OG. This may be due to species sampling . All four vertebrate species are relatively closely related , The species sampling of fruit fly and Caenorhabditis elegans is very poor . To improve the situation , We need to include some species , To break the long branches of the species tree that separate these species from all other species .

Plant trees

Now let's look at the species tree .Dendroscope Is a tree viewer that can be downloaded and run locally , If you want to view more than a few trees , It's the best choice . in addition , There are also options to run from your web browser , for example ETE Tree viewer for toolkit . Use one of them , open Species_Tree/SpeciesTree_rooted.txt file . Because this file has boot values ,Dendroscope You need to choose " Interpreted as edge labels " To view them correctly . The tree looks like this .
Insert picture description here

This tree is made of OrthoFinder Use STAG The algorithm infers , And use STRIDE The algorithm is rooted , Therefore, it can be interpreted at any time （ Usually , You have to root a tree first ）. You can see here that the branches of fruit flies are longer than other species , As mentioned above . If you know what a tree should look like , You should check whether the tree is in line with your expectations .OrthoFinder The tree inferred here is correct .

If the species tree is not correct , Then this will not affect OG To infer , But it may affect some gene trees with gene replication events OG infer . under these circumstances , You may want to run the tree again with the corrected species OrthoFinder The last point of the analysis （-ft and -s Options ）. This is usually very fast , Because all the calculation costs （ Inference of orthogonal groups and gene trees, etc ） It's all done . See OrthoFinder Best practices .
Insert picture description here
137MAGs Orthofinder Inferred species of trees

By the way . You will notice that in this tree , The support rate values are not all 100%, And you might expect them to be 100%. Under the default option , Tree species are inferred by STAG On going , It uses the proportion of species trees that support each two regions from the single focus gene tree as a measure of their support . This is a more rigorous measure of bootstrap support than the standard bootstrap support from multiple sequence permutations . If used "-M msa " Options , Then, when inferring the species tree, the tandem multi sequence arrangement will be used , The support rate of all dual zones will reach 100%. under these circumstances , The support values correspond to bootstrap replications extracted from the complete multigene permutation , It's totally different . This is the most common measure of support , For the same data , Higher support values are always reported .

Orthologues

function OrthoFinder One of the most common reasons is to look for genes that interest you orthologue. Let's take a look at the genes of fruit flies FBgn0005648 Of orthologue, It is involved in nuclear fission / Cleavage and polyadenylation steps of polyadenylation reaction （ see FlyBase）. This is an interesting gene , Because there were two gene duplication events in the lineage leading to human beings , We will see this . Let's take a look at its homologue in humans .

stay Orthologues Directory , Each species has a subdirectory . Open... In a spreadsheet program Orthologues/Orthologues_Drosophila_melanogaster/Drosophila_melanogaster__v__Homo_sapiens.tsv（ Specify tab separator if necessary ）. The document has three columns ,“Orthogroup”、 “Drosophila_melanogaster” and “Homo_sapiens”. Found in table “FBgn0005648”, You will see that the gene belongs to orthogroup OG0001189, It has three in the human race orthologues ENSG00000205022,ENSG00000100836,ENSG00000258643.

Gene trees

We have found FBgn0005648 There are three homologues in humans . Next we will look at the gene tree , See if we agree with that , And see how these three orthogonal things come into being . use Dendroscope Or a web-based viewer opens Gene_Trees/OG0001189_tree.txt.
Insert picture description here
If you are used to inferring and observing the gene tree , You'll notice OrthoFinder The tree root has been automatically created for you – The roots are in the Drosophila gene FBgn0005648 Upper . This makes it very convenient to quickly check the gene tree , For more complex 、 Gene trees that are more difficult to explain are particularly useful .
Besides , The default gene tree has no support value . After all ,OrthoFinder Have already put ~121,000 Genes were assigned to orthogonal groups , And around 15 Within minutes, nearly... Were deduced for these genes 18,000 A gene tree ! We will discuss how to obtain the support rate value in a later tutorial .

Look at the gene tree , We can see two gene replication events , One is common to vertebrates , The other is shared by humans and mice . This leads to a one to three lineal homology , That is, all three human genes are equally closely related to one Drosophila gene . Usually , Lineal homology is not one-to-one , It's important to understand that – You don't want to spend months doing " Lineal homology " The experiment of , But it was later discovered that there were actually three direct homologous genes ！

We can have a look FlyBase Go to the gene page ：http://flybase.org/reports/FBgn0005648.html. If you go down to " Immediate family " part , And then look “ Human lineal relatives ”, You will find that the only way to identify all three immediate family members is the tree based method Compara、 eggNOG、OrthoFinder and TreeFam.OrthoFinder Is the only way you can run on your own data . Score based approach , Such as Hieranoid、Inparanoid、OMA and OrthoMCL Only one was identified , Or do not recognize these orthogonal objects . Gene trees are particularly important for identifying and resolving these complex relationships .

Gene replication events

Having a gene tree means OrthoFinder Can identify all gene replication events that occur . Let's look at the events that occurred in the common ancestor of vertebrates .OrthoFinder stay Species_Tree/SpeciesTree_rooted_node_labels.txt The node of the species tree is marked in the document , use Dendropscope Open this file , Tell it “ Interpreted as node labels ”. We will look at the nodes “N1”, The common ancestor of vertebrates （ namely D. rerio、T. rubripes、X. tropicalis、H. sapiens and M. musculus）. There are two documents that provide details about gene replication events . We first in Dendroscope Open in Gene_Duplication_Events/SpeciesTree_Gene_Duplications_0.5_Support.txt.
Insert picture description here
This gives a summary of a gene replication event . Each node displays the node name , Followed by an underscore , Then there is the number of well supported gene replication events mapped to each node in the species tree . If at least 50% The progeny species retained two copies of the replication gene , Then the gene replication event is considered to be " Good support " Of . For the common ancestor of quadrupeds N1, Yes 2458 This is a well supported gene replication event . We can do it in Gene_Duplication_Events/Duplications.tsv A list of these gene replication events can be seen in the document . Here are a few lines in the file , Sort by the tree node of the species .
Insert picture description here
Each gene replication event is associated with the species tree node 、 What happened orthogroup/ The gene tree and the nodes in the gene tree are cross referenced . It also lists the offspring genes of each of the two copies produced by the gene replication event . We can do it for our FBgn0005648orthogroup Check this out . So , We need to check Resolved_Gene_Trees/OG0001189_tree.txt Gene tree in . This directory contains the gene tree marked with nodes . These files show OrthoFinder Interpretation of the gene tree in inferring lineal relationships and gene replication events . They may be slightly different from the original gene tree directly from the tree inference step （ Can be found in Gene_Trees/）. To get the solved gene tree ,OrthoFinder Made copies - The loss of - Cohesion analysis , To determine a more concise explanation of the tree .
Insert picture description here
From the watch , At the node n1 There is a gene duplication event , Both copies are present in all progeny species 100% Retain . Look at this tree , The second gene replication event is in n10 Node , If we go back to the table , We can see this event and Danio rerio The terminal gene replication events are also listed .
Insert picture description here

If you are interested in gene duplication , So this table contains a lot of data . Of these six species ,OrthoFinder Found out 34,065 A gene duplication event , All these events are cross referenced with the nodes on the species tree and gene tree where they occur stay Duplications_per_Orthogroup.tsv and Duplications_per_Species_Tree_Node.tsv In file , These events are also summarized by orthogonal groups and species tree nodes , Both of these files are in Comparative_Genomics_Statistics/ Under the table of contents .

Orthogroups

We are usually interested in colony species , That is, to compare between a species branch rather than between a pair of species . For multiple species orthogly The summary of is orthogroup. It's like orthologues Is the offspring of a gene in the last common ancestor of a pair of species ,orthogroup A collection of offspring of a gene in a group of species .OrthoFinder Every gene tree , For example, the one above , Are aimed at one orthogroup Of . If we want to orthogroup The gene tree includes all pairs orthogroup, that orthogroup The gene tree is the tree we need to see . Even if one orthogroup Some of these genes may be collateral to each other , If we try to take out any genes , Then we will put orthogroup Take it out .

therefore , If we want to talk about... In a group of species " Equivalent " Compare genes , We need to check one orthogroup Compare the genes in .orthogroup In the file Orthogroups/Orthogroups.tsv in . This table has one in each row orthogroup, Each column has one spcies, From the biggest orthogroup To the smallest orthogroup Sort . There is also a traditional OrthoMCL File format .Orthogroups/Orthogroups.txt.

orthogroup sequence

For each orthogroup, stay Orthogroup_Sequences/ One of them FASTA file , Contains the orthogroup The gene sequence of .
every last OG The contained protein sequences are recorded in Orthogroups/Orthogroups.txt In file , Each corresponding OG fasta The sequence is Orthogroup_Sequences/OGxxxxxx.fa

(base) [[email protected] Results_Mar04]$ grep OG0011840 Orthogroups/Orthogroups.txt
OG0011840: PJADIDJD_01233 PJADIDJD_01234
(base) [[email protected] Results_Mar04]$ cat  Orthogroup_Sequences/OG0011840.fa
>PJADIDJD_01233
MFGFFGGFRDFDRFKCGCGGWGGGCGGWGGRWNCDDWGGKWRNKWRCKRRDWC
>PJADIDJD_01234
MFGFFGGFRDFDRFKCGCGGWGGGCGGGWGGRWDFDCGGNWRNKWRCKRRDWC

OG Function notes

Every OG There are usually multiple proteins from multiple species , There are two ways to correct OG Make functional comments ： The first simple way is to choose OG The longest one in the list represents , As the final annotation result ; The other is right OG Each of them is annotated , Then select the most as the annotation result . The annotation for each protein here can be used eggnog Result .

1. Choose each OG The longest protein as a note

Protein length file press

(base) [[email protected] Results_Mar04]$ bioawk -c fastx '{print $name"\t"length($seq)}' ../../137MAGs_faas/*faa >137MAGs_protein_faa_length.tsv &

eggnog Annotate Document

(base) [[email protected] Results_Mar04]$ cat ../../137MAGs_eggnogs/annotations/*emapper.annotations >137MAGs_protein_eggnog_annotations.tsv

Eggnog_Function_annotation_for_each_OrthoGroup_by_Max_Length_Protein.py

# select protein of max length OG member function anotation of each OrthoGroup

import sys,os
from collections import Counter
def helpDoc():
    print('Usage: python3 $0 <Orthogroups.txt> <Each protein length> <Merged eggnog table> <out>')     
    print('\tOrthogroups.txt:')
    print('\tOG0011839: AJKKFJJO_01576 AJKKFJJO_01819')
    print('\tOG0011840: PJADIDJD_01233')
    print('\tEach protein length: protein_id\tlength')
    print('\tMerged eggnog table: Merged eggnog annotation table')


if len(sys.argv) != 5:
    helpDoc()
    quit('error input!')


f_og  = sys.argv[1]
f_pro_len = sys.argv[2]
f_ant = sys.argv[3]
f_out = sys.argv[4]


#  Read in protein id And length tables , Stored in the dictionary 
Prot_len_dict = {
    }
with open(f_pro_len,'r') as f:
    for line in f:
        line = line.strip()
        protid, protlen = line.split('\t')
        if protid not in  Prot_len_dict:
            Prot_len_dict[protid] = protlen

#print(Annt_dict)
# 2. Read in OG The file of , Include each OG id And the protein to which it belongs 
##  take OG id As key, The protein in the back id Save as value as dictionary 
OG_dict = {
    }
with open(f_og, 'r') as f:
    for line in f:
        line = line.strip()
        (ogid, prots) = (line.split(":")[0], line.split(":")[1].split())
        OG_dict[ogid] = prots

#print(OG_dict)

#  Get every OG The length of the corresponding protein , Take out the longest and save it 

#  Go through each one OG, Take out each OG Members of , Take the maximum length as OG The representative of the 
OG_rep_prot = {
    } #  Store every OG Representative proteins id

for ogid in OG_dict:
    prots = OG_dict[ogid] #  Take out OG Every proid
    prot_num = len(prots)
    if prot_num == 1: #  If there is only one, directly specify 
        OG_rep_prot[ogid] = ''.join(prots) #  The list becomes a string 
    else:
        #  Through each protid Get the corresponding annotation information 
        prot_lens = {
    } #  Will be the same OG Save the length of all proteins in a dictionary 
        for protid in prots:
            if protid in Prot_len_dict:
                prot_len = Prot_len_dict[protid]
                prot_lens[protid] = prot_len
            else:
                print('%s not in protein length table!')
        sorted_x = sorted(prot_lens.items(), key=lambda x: x[1], reverse=True)
        rep_prot = sorted_x[0][0] #  The first is the longest 
        OG_rep_prot[ogid] = rep_prot
#print(OG_rep_prot)

#  Put all the protein eggnog Save notes as a dictionary 
Annt_dict = {
    }
with open(f_ant, 'r') as f:
    for line in f:  #  Formally for each query The annotation result of 
        line = line.strip()  #  Remove the blank symbols at both ends 
        if not line.startswith('#'):
            (query,seed_ortholog,evalue,score,eggNOG_OGs,max_annot_lvl,
            COG_category,Description,Preferred_name,GOs,EC,KEGG_ko,KEGG_Pathway,
            KEGG_Module,KEGG_Reaction,KEGG_rclass,BRITE,KEGG_TC,CAZy,BiGG_Reaction,PFAMs) = line.split('\t')  #  Tabs separate into different columns 
            if query not in Annt_dict:
                Annt_dict[query] = line


#  To every one. OG The longest protein found notes 
if os.path.exists(f_out):
    os.remove(f_out)

with open(f_out, 'a') as f:
    header = 'OG\trep_query\tseed_ortholog\tevalue\tscore\teggNOG_OGs\tmax_annot_lvl\tCOG_category\tDescription\tPreferred_name\tGOs\tEC\tKEGG_ko\tKEGG_Pathway\tKEGG_Module\tKEGG_Reaction\tKEGG_rclass\tBRITE\tKEGG_TC\tCAZy\tBiGG_Reaction\tPFAMs\n'
    f.write(header)
    for ogid in OG_rep_prot:
        protid = OG_rep_prot[ogid]
        if protid in Annt_dict:
            annt = Annt_dict[protid]
        else:
            annt = 'eggnog_NA'
        #  Format output  
        line = ('%s\t%s\n' % (ogid, annt))
        f.write(line)