Difference between revisions of "Graham Reference Dataset Repository"

From SHARCNETHelp
Jump to navigationJump to search
Line 41: Line 41:
<div class="mw-collapsible-content">
<div class="mw-collapsible-content">
<pre>
<pre>
/datashare/1000genomes
├── CHANGELOG
├── CHANGELOG
├── data_collections
├── data_collections
Line 382: Line 383:
==== Consideration when using these databases ====
==== Consideration when using these databases ====
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].
==== Usage ====
==== Usage ====
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:
Line 409: Line 411:
<div class="mw-collapsible-content">
<div class="mw-collapsible-content">
<pre>
<pre>
/datashare/EggNog
├── e5.level_info.tar.gz
├── e5.level_info.tar.gz
├── e5.og_annotations.tsv
├── e5.og_annotations.tsv
Line 810: Line 813:


The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz  *_hmms.tar *_hmms.tar.gz  *_members.tsv.gz  *_raw_algs.tar  *_stats.tsv  *_trees.tsv.gz  *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz  *_hmms.tar *_hmms.tar.gz  *_members.tsv.gz  *_raw_algs.tar  *_stats.tsv  *_trees.tsv.gz  *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.
=== hg38 === 


=== kraken2_dbs ===   
=== kraken2_dbs ===   
Line 817: Line 818:
=== NCBI_taxonomy ===
=== NCBI_taxonomy ===
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.
==== Directory structure ====
==== Directory structure ====
<div class="toccolours mw-collapsible mw-collapsed">
<div class="toccolours mw-collapsible mw-collapsed">
Line 822: Line 824:
<div class="mw-collapsible-content">
<div class="mw-collapsible-content">
<pre>
<pre>
/datashare/NCBI_taxonomy
├── accession2taxid
├── accession2taxid
│   ├── dead_nucl.accession2taxid.gz
│   ├── dead_nucl.accession2taxid.gz
Line 882: Line 885:
</div>
</div>
</div>
</div>
==== Usage with TaxonKit ===
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:
<pre>
mkdir -p ~/.taxonkit
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/
</pre>
Then you can use taxonkit directly


=== PANTHER ===   
=== PANTHER ===   
Line 893: Line 907:
<div class="mw-collapsible-content">
<div class="mw-collapsible-content">
<pre>
<pre>
/datashare/PANTHER/
├── hmm_classifications
├── hmm_classifications
│   ├── LICENSE
│   ├── LICENSE
Line 919: Line 934:
│   ├── SequenceAssociationPathway3.6.4.txt
│   ├── SequenceAssociationPathway3.6.4.txt
│   └── SequenceAssociationPathway3.6.5.txt
│   └── SequenceAssociationPathway3.6.5.txt
└──  
└── sequence_classifications
 
     ├── LICENSE
     ├── LICENSE
     ├── PANTHER_Sequence_Classification_files
     ├── PANTHER_Sequence_Classification_files
Line 964: Line 978:
=== SILVA ===
=== SILVA ===


=== UNIPROT ===
=== UNIPROT ===
 
 


== AI  ==
== AI  ==

Revision as of 11:24, 16 August 2021

Since May 2021 we have been testing a Network File System (NFS) data mount to provide our users with some commonly used datasets in Bioinformatics and AI. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on /datashare/. You can explore the top directories by listing the mount:

[jshleap@gra-login1 ~]$ ls -lL /datashare/
total 152
drwxrwxr-x 9 jshleap sn_staff        4096 Jul  6 11:14 1000genomes
drwxrwxr-x 2 jshleap sn_staff       94208 Jun  4 15:30 BLASTDB
drwxrwxr-x 2 jshleap sn_staff         107 Jun  4 15:30 BLAST_FASTA
drwxrwxr-x 5 jshleap sn_staff         229 Jun  4 18:49 CIFAR-10
drwxrwxr-x 5 jshleap sn_staff         221 Jun  4 18:49 CIFAR-100
drwxrwxr-x 6 jshleap sn_staff         115 Apr 27 10:00 COCO
drwxrwxr-x 2 jshleap sn_staff         135 Jun 10 18:23 DIAMONDDB_2.0.9
drwxrwxr-x 6 jshleap sn_staff         321 Feb  4 17:39 EggNog
drwxrwxr-x 3 jshleap sn_staff          46 Mar 23 14:23 hg38
drwxrws--- 9 jshleap imagenet-optin   244 Jun 16 09:22 ImageNet
drwxrwxr-x 8 jshleap sn_staff        4096 Jun  7 16:58 kraken2_dbs
drwxrwxr-x 2 jshleap sn_staff         191 Jun  4 18:49 MNIST
drwxrwxr-x 2 jshleap sn_staff          50 Jun  4 18:51 MPI_SINTEL
drwxrwxr-x 2 jshleap sn_staff        4096 Jun  9 17:09 NCBI_taxonomy
drwxrwxr-x 6 jshleap sn_staff         145 Feb  4 22:44 PANTHER
drwxrwxr-x 5 jshleap sn_staff        4096 Apr 19 17:24 PFAM
drwxrwxr-x 7 jshleap sn_staff        4096 Mar 29 09:52 SILVA
drwxrwxr-x 6 jshleap sn_staff         257 Feb  4 22:46 SVHN
drwxrwxr-x 4 jshleap sn_staff         189 Apr 19 17:59 UNIPROT
drwxrwx--- 5 jshleap voxceleb-optin    98 Apr 23 15:15 VoxCeleb


Below a detailed description of each dataset and how to access them.

Bioinformatics

Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:

1000 Genomes

In human genetics, the 1000 genomes project (1KGP) was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their FTP site, and will be checked for updates twice a year (June and December).

Directory structure

1000 Genomes directory tree (up to level 2):

/datashare/1000genomes
├── CHANGELOG
├── data_collections
│   ├── 1000G_2504_high_coverage
│   ├── 1000G_2504_high_coverage_SV
│   ├── 1000_genomes_project
│   ├── gambian_genome_variation_project
│   ├── gambian_genome_variation_project_GRCh37
│   ├── geuvadis
│   ├── han_chinese_high_coverage
│   ├── HGDP
│   ├── HGSVC2
│   ├── hgsv_sv_discovery
│   ├── HLA_types
│   ├── illumina_platinum_pedigree
│   ├── index.html
│   ├── README_data_collections.md
│   └── simons_diversity_data
├── historical_data
│   ├── former_toplevel
│   ├── index.html
│   └── README_historical_data.md
├── index.html
├── phase1
│   ├── analysis_results
│   ├── data
│   ├── index.html
│   ├── phase1.alignment.index
│   ├── phase1.alignment.index.bas.gz
│   ├── phase1.exome.alignment.index
│   ├── phase1.exome.alignment.index.bas.gz
│   ├── phase1.exome.alignment.index.HsMetrics.gz
│   ├── phase1.exome.alignment.index.HsMetrics.stats
│   ├── phase1.exome.alignment.index_stats.csv
│   ├── README.phase1_alignment_data
│   └── technical
├── phase3
│   ├── 20130502.phase3.analysis.sequence.index
│   ├── 20130502.phase3.exome.alignment.index
│   ├── 20130502.phase3.low_coverage.alignment.index
│   ├── 20130502.phase3.sequence.index
│   ├── 20130725.phase3.cg_sra.index
│   ├── 20130820.phase3.cg_data_index
│   ├── 20131219.populations.tsv
│   ├── 20131219.superpopulations.tsv
│   ├── data
│   ├── index.html
│   ├── integrated_sv_map
│   ├── README_20150504_phase3_data
│   └── README_20160404_where_are_the_phase3_variants
├── pilot_data
│   ├── data
│   ├── index.html
│   ├── paper_data_sets
│   ├── pilot_data.alignment.index
│   ├── pilot_data.alignment.index.bas.gz
│   ├── pilot_data.sequence.index
│   ├── README.alignment.index
│   ├── README.bas
│   ├── README.sequence.index
│   ├── release
│   ├── SRP000031.sequence.index
│   ├── SRP000032.sequence.index
│   ├── SRP000033.sequence.index
│   └── technical
├── PRIVACY-NOTICE.txt
├── README_ebi_aspera_info.md
├── README_file_formats_and_descriptions.md
├── README_ftp_site_structure.md
├── README_missing_files.md
├── README_populations.md
├── README_using_1000genomes_cram.md
├── release
│   ├── 2008_12
│   ├── 2009_02
│   ├── 2009_04
│   ├── 2009_05
│   ├── 2009_08
│   ├── 20100804
│   ├── 2010_11
│   ├── 20101123
│   ├── 20110521
│   ├── 20130502
│   └── index.html
└── technical
    ├── browser
    ├── index.html
    ├── method_development
    ├── ncbi_varpipe_data
    ├── other_exome_alignments
    ├── other_exome_alignments.alignment_indices
    ├── phase3_EX_or_LC_only_alignment
    ├── pilot2_high_cov_GRCh37_bams
    ├── pilot3_exon_targetted_GRCh37_bams
    ├── qc
    ├── README.reference
    ├── reference
    ├── retired_reference
    ├── simulations
    ├── supporting
    └── working

As per their README, the directory structure is:

changelog_details

This directory contains a series of files detailing the changes made to the FTP site over time.

data_collections

The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the 1000 Genomes Project data.

For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.

historical_data

This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.

phase1

This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.

phase3

This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.

pilot_data

This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.

release

The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.

Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.

Examples of release subdirectories are: - /datashare/1000genomes/release/2008_12/

In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.

For example, the directory /datashare/1000genomes/release/20100804/ contains the release versions of SNP and indel calls based on the /datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index file.

technical

The technical directory contains subdirectories for other data sets such as simulations, files for method development, interim data sets, reference genomes, etc..

An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.

WARNING: /datashare/1000genomes/technical/working/
 The working directory under technical contains data that has experimental (non-public release) status
 and is suitable for internal project use only. Please use with caution.

BLASTDB

BLAST uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the BLAST ftp site.

The pre-formatted databases offer the following advantages:

  • Pre-formatting removes the need to run makeblastdb
  • Species-level taxonomy ids are included for each database entry
  • Sequences in FASTA format can be generated from the pre-formatted databases by using the blastdbcmd utility
IMPORTANT: The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find here.

All Pre-formatted databases available are located in Graham's /datashare/BLASTDB and will be updated every 3 months (Jan, Apr, Jul, Oct).

Directory structure

/datashare/BLASTDB contains all the pre-formatted without any subfolder. We include the Following:

Name Type Title
16S_ribosomal_RNA DNA 16S ribosomal RNA (Bacteria and Archaea type strains)
18S_fungal_sequences DNA 18S ribosomal RNA sequences (SSU) from Fungi type and reference material
28S_fungal_sequences DNA 28S ribosomal RNA sequences (LSU) from Fungi type and reference material
Betacoronavirus DNA Betacoronavirus
GCF_000001405.38_top_level DNA Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds
GCF_000001635.26_top_level DNA Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds
ITS_RefSeq_Fungi DNA Internal transcribed spacer region (ITS) from Fungi type and reference material
ITS_eukaryote_sequences DNA ITS eukaryote BLAST
env_nt DNA environmental samples
nt DNA Nucleotide collection (nt)
patnt DNA Nucleotide sequences derived from the Patent division of GenBank
pdbnt DNA PDB nucleotide database
ref_euk_rep_genomes DNA RefSeq Eukaryotic Representative Genome Database
ref_prok_rep_genomes DNA Refseq prokaryote representative genomes (contains refseq assembly)
ref_viroids_rep_genomes DNA Refseq viroids representative genomes
ref_viruses_rep_genomes DNA Refseq viruses representative genomes
refseq_rna DNA NCBI Transcript Reference Sequences
refseq_select_rna DNA RefSeq Select RNA sequences
env_nr Protein Proteins from WGS metagenomic projects (env_nr)
landmark Protein Landmark database for SmartBLAST
nr Protein All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
pdbaa Protein PDB protein database
pataa Protein Protein sequences derived from the Patent division of GenBank
refseq_protein Protein NCBI Protein Reference Sequences
refseq_select_prot Protein RefSeq Select proteins
swissprot Protein Non-redundant UniProtKB/SwissProt sequences
split-cdd Protein CDD split into 32 volumes
tsa_nr Protein Transcriptome Shotgun Assembly (TSA) sequences

Usage

The most efficient way to use these databases is to copy the specific database to $SLURM_TMPDIR at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:


   #!/bin/bash
   #SBATCH --time=02:00:00
   #SBATCH --mem=32G
   #SBATCH --cpus-per-task=8
   #SBATCH --account=def-someuser
   module load  StdEnv/2020  gcc/9.3.0 blast+/2.11.0 # load blast and dependencies
   tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR
   blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta


Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.

You can also use /datashare/BLASTDB/nr (as per example), but it might be slower than having the databases in the local disk.

Other Compute Canada Sources

Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.

BLAST_FASTA

This directory contains the raw sequences located in the blast/db/FASTA/ of their directory of the NCBI FTP repository in compressed (by gzip) format:

  134M Apr 10 15:36 swissprot.gz
  96G  Apr 10 22:11 nr.gz
  108G Apr 12 07:55 nt.gz
  32M  Jun  4 15:30 pdbaa.gz

Similar to the pre-formatted databases (located in /datashare/BLASTDB), these fasta files can be found at /datashare/BLAST_FASTA and will be updated every 3 months (Jan, Apr, Jul, Oct).

DIAMONDDB_2.0.9

DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:

  diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp
  diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp
  diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp
  diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp

As can be seen 4 databases are distributed in the /datashare/DIAMONDDB_2.0.9 directory representing blast's nt, nr, pdbaa, and swissprot. All of them contain taxonomic information. Since the source of these databases are the BLAST_FASTA, the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).

Consideration when using these databases

The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the block size parameter -b.

Usage

The most efficient way to use these databases is to copy the specific database to $SLURM_TMPDIR at the begining of your sbatch script, just like with BLASTDB. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with BLASTDB, only one file needs to be move, which means that cp is more efficient than tar moving the file. For example, your sbatch script can look something like this:


   #!/bin/bash
   #SBATCH --time=02:00:00
   #SBATCH --mem=32G
   #SBATCH --cpus-per-task=8
   #SBATCH --account=def-someuser
   module load  StdEnv/2020  diamond/2.0.9 # load blast and dependencies
   cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR
   diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv

Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.

You can also use /datashare/DIAMONDDB_2.0.9/nr (as per example), but it might be slower than having the databases in the local disk.

EggNog

The EggNOG database is a database of biological information hosted by the EMBL. It is based on the original idea of COGs and expands that idea to non-supervised orthologous groups constructed from numerous organisms.

This data mount contains a copy of latest EggNogg databases

Directory structure

EggNOG directory tree (up to level 2):

/datashare/EggNog
├── e5.level_info.tar.gz
├── e5.og_annotations.tsv
├── e5.proteomes.faa
├── e5.sequence_aliases.tsv
├── e5.taxid_info.tsv
├── e5.viruses.faa
├── gbff
│   ├── eutils_wgs_calledGenes
│   └── eutils_wgs_calledGenes_2
├── id_mappings
│   └── uniprot
├── per_tax_level
│   ├── 1
│   ├── 10
│   ├── 1016
│   ├── 10239
│   ├── 1028384
│   ├── 10404
│   ├── 104264
│   ├── 10474
│   ├── 10477
│   ├── 1060
│   ├── 10656
│   ├── 10662
│   ├── 10699
│   ├── 10744
│   ├── 10841
│   ├── 10860
│   ├── 1090
│   ├── 1100069
│   ├── 110618
│   ├── 11157
│   ├── 1117
│   ├── 112252
│   ├── 1129
│   ├── 1142
│   ├── 1150
│   ├── 1161
│   ├── 11632
│   ├── 1164882
│   ├── 117743
│   ├── 117747
│   ├── 118882
│   ├── 118884
│   ├── 1189
│   ├── 118969
│   ├── 119043
│   ├── 119045
│   ├── 119060
│   ├── 119065
│   ├── 119066
│   ├── 119069
│   ├── 119089
│   ├── 119603
│   ├── 11989
│   ├── 121069
│   ├── 1212
│   ├── 122277
│   ├── 1224
│   ├── 1236
│   ├── 1239
│   ├── 1268
│   ├── 1283313
│   ├── 129337
│   ├── 1297
│   ├── 1303
│   ├── 1305
│   ├── 1307
│   ├── 1313
│   ├── 135613
│   ├── 135614
│   ├── 135618
│   ├── 135619
│   ├── 135623
│   ├── 135624
│   ├── 135625
│   ├── 1357
│   ├── 136841
│   ├── 136843
│   ├── 136845
│   ├── 136846
│   ├── 136849
│   ├── 1386
│   ├── 142182
│   ├── 145357
│   ├── 147541
│   ├── 147545
│   ├── 147548
│   ├── 147550
│   ├── 150247
│   ├── 1506553
│   ├── 1511857
│   ├── 155619
│   ├── 1570339
│   ├── 157897
│   ├── 1653
│   ├── 167375
│   ├── 171550
│   ├── 171551
│   ├── 1762
│   ├── 178469
│   ├── 182709
│   ├── 183925
│   ├── 183939
│   ├── 183963
│   ├── 183967
│   ├── 183968
│   ├── 183980
│   ├── 186801
│   ├── 186804
│   ├── 186806
│   ├── 186807
│   ├── 186813
│   ├── 186818
│   ├── 186820
│   ├── 186821
│   ├── 186822
│   ├── 186823
│   ├── 186824
│   ├── 186827
│   ├── 186828
│   ├── 186928
│   ├── 189330
│   ├── 189775
│   ├── 191028
│   ├── 191675
│   ├── 2
│   ├── 200643
│   ├── 200783
│   ├── 200795
│   ├── 200918
│   ├── 200930
│   ├── 200940
│   ├── 201174
│   ├── 203494
│   ├── 203682
│   ├── 203691
│   ├── 204037
│   ├── 204428
│   ├── 204432
│   ├── 204441
│   ├── 204457
│   ├── 204458
│   ├── 2063
│   ├── 206350
│   ├── 206351
│   ├── 206389
│   ├── 213113
│   ├── 213115
│   ├── 213118
│   ├── 213462
│   ├── 213481
│   ├── 2157
│   ├── 216572
│   ├── 224756
│   ├── 225057
│   ├── 228398
│   ├── 2323
│   ├── 237
│   ├── 2433
│   ├── 244698
│   ├── 245186
│   ├── 246874
│   ├── 252301
│   ├── 252356
│   ├── 255475
│   ├── 256005
│   ├── 265
│   ├── 265975
│   ├── 267888
│   ├── 267889
│   ├── 267890
│   ├── 267893
│   ├── 267894
│   ├── 2759
│   ├── 28037
│   ├── 28211
│   ├── 28216
│   ├── 28221
│   ├── 2836
│   ├── 283735
│   ├── 285107
│   ├── 28883
│   ├── 28889
│   ├── 28890
│   ├── 289201
│   ├── 29
│   ├── 29000
│   ├── 290174
│   ├── 29258
│   ├── 29547
│   ├── 301297
│   ├── 302485
│   ├── 3041
│   ├── 308865
│   ├── 311790
│   ├── 314146
│   ├── 314294
│   ├── 31979
│   ├── 31993
│   ├── 32003
│   ├── 32061
│   ├── 32066
│   ├── 32199
│   ├── 326319
│   ├── 326457
│   ├── 33090
│   ├── 33154
│   ├── 33183
│   ├── 33208
│   ├── 33213
│   ├── 33342
│   ├── 33554
│   ├── 335928
│   ├── 33867
│   ├── 33958
│   ├── 34008
│   ├── 34037
│   ├── 34383
│   ├── 34384
│   ├── 34397
│   ├── 35237
│   ├── 35268
│   ├── 35278
│   ├── 35301
│   ├── 35325
│   ├── 35493
│   ├── 355688
│   ├── 35718
│   ├── 358033
│   ├── 363408
│   ├── 3699
│   ├── 38820
│   ├── 39782
│   ├── 400634
│   ├── 40117
│   ├── 40674
│   ├── 41294
│   ├── 414999
│   ├── 422676
│   ├── 423358
│   ├── 439488
│   ├── 43988
│   ├── 4447
│   ├── 451866
│   ├── 451867
│   ├── 451870
│   ├── 452284
│   ├── 45401
│   ├── 45404
│   ├── 45667
│   ├── 46205
│   ├── 464095
│   ├── 468
│   ├── 4751
│   ├── 4776
│   ├── 4890
│   ├── 4891
│   ├── 4893
│   ├── 5042
│   ├── 50557
│   ├── 506
│   ├── 508458
│   ├── 5125
│   ├── 5129
│   ├── 5139
│   ├── 5148
│   ├── 5151
│   ├── 52018
│   ├── 5204
│   ├── 5234
│   ├── 52604
│   ├── 526524
│   ├── 52959
│   ├── 53335
│   ├── 5338
│   ├── 53433
│   ├── 538999
│   ├── 539002
│   ├── 541000
│   ├── 544
│   ├── 544448
│   ├── 547
│   ├── 548681
│   ├── 551
│   ├── 554915
│   ├── 558415
│   ├── 561
│   ├── 5653
│   ├── 572511
│   ├── 57723
│   ├── 5794
│   ├── 5796
│   ├── 5809
│   ├── 5819
│   ├── 583
│   ├── 586
│   ├── 5863
│   ├── 5878
│   ├── 58840
│   ├── 590
│   ├── 59732
│   ├── 60136
│   ├── 613
│   ├── 61432
│   ├── 622450
│   ├── 6231
│   ├── 6236
│   ├── 629
│   ├── 629295
│   ├── 639021
│   ├── 651137
│   ├── 6656
│   ├── 671232
│   ├── 675063
│   ├── 68295
│   ├── 68298
│   ├── 68525
│   ├── 68892
│   ├── 69277
│   ├── 69541
│   ├── 69657
│   ├── 7088
│   ├── 71274
│   ├── 713636
│   ├── 7147
│   ├── 7148
│   ├── 7214
│   ├── 72273
│   ├── 72275
│   ├── 7399
│   ├── 74030
│   ├── 74201
│   ├── 74385
│   ├── 75682
│   ├── 766
│   ├── 766764
│   ├── 76804
│   ├── 76831
│   ├── 768503
│   ├── 7711
│   ├── 772
│   ├── 7742
│   ├── 7898
│   ├── 80864
│   ├── 815
│   ├── 81850
│   ├── 81852
│   ├── 82115
│   ├── 82117
│   ├── 82986
│   ├── 830
│   ├── 83612
│   ├── 84406
│   ├── 8459
│   ├── 84992
│   ├── 84995
│   ├── 84998
│   ├── 85004
│   ├── 85005
│   ├── 85008
│   ├── 85009
│   ├── 85010
│   ├── 85012
│   ├── 85013
│   ├── 85014
│   ├── 85016
│   ├── 85017
│   ├── 85018
│   ├── 85019
│   ├── 85020
│   ├── 85021
│   ├── 85023
│   ├── 85025
│   ├── 85026
│   ├── 8782
│   ├── 90964
│   ├── 909932
│   ├── 91061
│   ├── 91561
│   ├── 91835
│   ├── 9263
│   ├── 92860
│   ├── 93682
│   ├── 9397
│   ├── 9443
│   ├── 9604
│   ├── 97050
│   ├── 976
│   ├── 995019
│   └── 9989
└── raw_data
    ├── e5.best_hit_homology_matrix.tsv.gz
    └── speciation_events.tsv.gz

386 directories, 8 files

The top level directory includes the e5 release of the proteomes and its annotations. The gbff folder contain annotation in genebank format. The folder id_mappings contain the taxonomic information and the mappings with EggNog's taxids. In the per_tax_level contains a series of folders, labeled by taconomic ID. In each one of them, you can find *_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder raw_data contains the homology/speciation events used in EggNog's clustering.

kraken2_dbs

NCBI_taxonomy

This dataset contains the NCBI taxonomy ftp. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.

Directory structure

NCBI_taxonomy directory tree (up to level 2):

/datashare/NCBI_taxonomy
├── accession2taxid
│   ├── dead_nucl.accession2taxid.gz
│   ├── dead_nucl.accession2taxid.gz.md5
│   ├── dead_prot.accession2taxid.gz
│   ├── dead_prot.accession2taxid.gz.md5
│   ├── dead_wgs.accession2taxid.gz
│   ├── dead_wgs.accession2taxid.gz.md5
│   ├── index.html
│   ├── nucl_gb.accession2taxid.gz
│   ├── nucl_gb.accession2taxid.gz.md5
│   ├── nucl_wgs.accession2taxid.gz
│   ├── nucl_wgs.accession2taxid.gz.md5
│   ├── pdb.accession2taxid.gz
│   ├── pdb.accession2taxid.gz.md5
│   ├── prot.accession2taxid.FULL.gz
│   ├── prot.accession2taxid.FULL.gz.md5
│   ├── prot.accession2taxid.gz
│   ├── prot.accession2taxid.gz.md5
│   └── README
├── biocollections
│   ├── Collection_codes.txt
│   ├── index.html
│   ├── Institution_codes.txt
│   └── Unique_institution_codes.txt
├── categories.dmp
├── Ccode_dump.txt
├── citations.dmp
├── coll_dump.txt
├── Cowner_dump.txt
├── delnodes.dmp
├── division.dmp
├── gc.prt
├── gencode.dmp
├── Icode_dump.txt
├── index.html
├── merged.dmp
├── names.dmp
├── ncbi_taxonomy_genussp.txt
├── new_taxdump
│   ├── index.html
│   ├── new_taxdump.tar.gz
│   ├── new_taxdump.tar.gz.md5
│   └── taxdump_readme.txt
├── nodes.dmp
├── README
├── readme.txt
├── taxcat_readme.txt
├── taxcat.tar.gz
├── taxcat.tar.gz.md5
├── taxdump_archive
│   └── index.html
├── taxdump_readme.txt
├── taxdump.tar.gz
└── taxdump.tar.gz.md5

4 directories, 50 files

= Usage with TaxonKit

In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by module load StdEnv/2020 taxonkit. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:

mkdir -p ~/.taxonkit
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/

Then you can use taxonkit directly


PANTHER

The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

In our data mount, we provide users with some of the relevant data found in the pantherdb ftp, namely: hmm_classifications, panther_library, pathway, and sequence_classifications.

Directory structure

PANTHER directory tree (up to level 2):

/datashare/PANTHER/
├── hmm_classifications
│   ├── LICENSE
│   ├── PANTHER15.0_HMM_classifications
│   ├── PANTHER16.0_HMM_classifications
│   └── README
├── panther_library
│   ├── ascii
│   ├── hmmscoring
│   ├── PANTHER15.0_ascii.tgz
│   ├── PANTHER15.0_fasta
│   ├── PANTHER15.0_fasta.tgz
│   ├── PANTHER15.0_hmmscoring.tgz
│   ├── PANTHER16.0_ascii.tgz
│   ├── PANTHER16.0_binary.tgz
│   ├── PANTHER16.0_fasta
│   ├── PANTHER16.0_fasta.tgz
│   ├── README
│   ├── target4
│   └── wget_panther_panther_library.log
├── pathway
│   ├── BioPAX
│   ├── BioPAX.tar.gz
│   ├── sbml
│   ├── sbml.tar.gz
│   ├── SequenceAssociationPathway3.6.4.txt
│   └── SequenceAssociationPathway3.6.5.txt
└── sequence_classifications
    ├── LICENSE
    ├── PANTHER_Sequence_Classification_files
    ├── README
    └── species

12 directories, 19 files
hmm_classifications

This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.

The files are a tab-delimited file in the following format: 1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID 2) Name: The annotation assigned by curators to the PANTHER family or subfamily 3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies 4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies 5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies 6) Protein class* PANTHER protein class terms assigned to families and subfamilies 7) Pathway***: PANTHER pathways have been assigned to families and subfamilies.

For more information check the README file at /datashare/PANTHER/hmm_classifications

panther_library

This is the main folder, containing the panther HMM files along with the fasta inputs.

For more information check the README file at /datashare/PANTHER/panther_library

pathway

This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.

sequence_classifications

The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes.

A total of 142 classification files are provided here, one for each organism. For more information check the README file at /datashare/PANTHER/sequence_classifications

PFAM

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models

SILVA

UNIPROT

AI

CIFAR-10

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels

Directory structure

CIFAR-10 directory tree (up to level 2):

/datashare/CIFAR-10
├── cifar-10-batches-bin
│   ├── batches.meta.txt
│   ├── data_batch_1.bin
│   ├── data_batch_2.bin
│   ├── data_batch_3.bin
│   ├── data_batch_4.bin
│   ├── data_batch_5.bin
│   ├── readme.html
│   └── test_batch.bin
├── cifar-10-batches-mat
│   ├── batches.meta.mat
│   ├── data_batch_1.mat
│   ├── data_batch_2.mat
│   ├── data_batch_3.mat
│   ├── data_batch_4.mat
│   ├── data_batch_5.mat
│   ├── readme.html
│   └── test_batch.mat
├── cifar-10-batches-py
│   ├── batches.meta
│   ├── data_batch_1
│   ├── data_batch_2
│   ├── data_batch_3
│   ├── data_batch_4
│   ├── data_batch_5
│   ├── readme.html
│   └── test_batch
├── cifar-10-binary.tar.gz
├── cifar-10-matlab.tar.gz
└── cifar-10-python.tar.gz

3 directories, 27 files

CIFAR-100

This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.

We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels

Directory structure

CIFAR-100 directory tree (up to level 2):

/datashare/CIFAR-100
├── cifar-100-binary
│   ├── coarse_label_names.txt
│   ├── fine_label_names.txt
│   ├── test.bin
│   └── train.bin
├── cifar-100-binary.tar.gz
├── cifar-100-matlab
│   ├── meta.mat
│   ├── test.mat
│   └── train.mat
├── cifar-100-matlab.tar.gz
├── cifar-100-python
│   ├── file.txt
│   ├── meta
│   ├── test
│   └── train
└── cifar-100-python.tar.gz

3 directories, 13 files


COCO

ImageNet

MNIST

MPI_SINTEL

SVHN

VoxCeleb