https://helpwiki.sharcnet.ca/wiki/api.php?action=feedcontributions&user=Jshleap&feedformat=atom
SHARCNETHelp - User contributions [en]
2024-03-28T15:25:57Z
User contributions
MediaWiki 1.36.1
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=536
Graham Reference Dataset Repository
2022-03-04T16:22:49Z
<p>Jshleap: </p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [https://ftp.ebi.ac.uk/pub/databases/Pfam PFAM ftp]:<br />
<br />
<pre><br />
/datashare/PFAM<br />
├── AntiFam<br />
├── current_release<br />
├── database_files<br />
├── mappings<br />
├── papers<br />
├── proteomes<br />
├── releases<br />
├── Tools<br />
└── vm<br />
<br />
9 directories, 0 files<br />
</pre><br />
<br />
For more information about the structure of their FTP and this dataset, please visit https://pfam-docs.readthedocs.io/en/latest/ftp-site.html.<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README</code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
[https://cocodataset.org COCO] is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:<br />
<br />
* Object segmentation<br />
* Recognition in context<br />
* Superpixel stuff segmentation<br />
* 330K images (>200K labeled)<br />
* 1.5 million object instances<br />
* 80 object categories<br />
* 91 stuff categories<br />
* 5 captions per image<br />
* 250,000 people with keypoints<br />
<br />
SHARCNET provides the 2017 release of the COCO dataset.<br />
<br />
==== Directory Structure ====<br />
The COCO dataset is provided following the the structure explained in https://cocodataset.org/#download:<br />
<br />
<pre><br />
/datashare/COCO<br />
├── annotations<br />
├── test2017<br />
├── train2017<br />
└── val2017<br />
<br />
4 directories, 0 files<br />
</pre><br />
<br />
Within test, train and val the plain images in jpeg format can be found. all related annotations can be found on the folder <code>annotations</code><br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.<br />
<br />
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. <br />
<br />
In SHARCNET we offer a copy of these datasets located at <code>/datashare/MNIST</code><br />
<br />
==== Directory Structure ====<br />
The directory contains the zip file with all training and testing images and labels, as well as the individual gzip files:<br />
<pre><br />
/datashare/MNIST<br />
├── mnist.zip<br />
├── t10k-images-idx3-ubyte.gz<br />
├── t10k-labels-idx1-ubyte.gz<br />
├── train-images-idx3-ubyte.gz<br />
└── train-labels-idx1-ubyte.gz<br />
<br />
0 directories, 5 files<br />
</pre><br />
<br />
For more information about this dataset, please visit http://yann.lecun.com/exdb/mnist/.<br />
<br />
=== MPI_SINTEL ===<br />
The MPI Sintel Dataset addresses limitations of existing optical flow benchmarks. It provides naturalistic video sequences that are challenging for current methods. It is designed to encourage research on long-range motion, motion blur, multi-frame analysis, non-rigid motion.<br />
<br />
The dataset contains flow fields, motion boundaries, unmatched regions, and image sequences. The image sequences are rendered with different levels of difficulty.<br />
<br />
Sintel is an open source animated short film produced by Ton Roosendaal and the Blender Foundation. Here we have modified the film in many ways to make it useful for optical flow evaluation.<br />
<br />
In SHARCNET we provide this dataset as the complete version.<br />
<br />
==== Directory Structure ====<br />
The MPI_SINTEL dataset on graham follows the structure below:<br />
<br />
<pre><br />
/datashare/MPI_SINTEL<br />
├── bundler<br />
│ ├── linux-x64<br />
│ ├── osx<br />
│ ├── README_BUNDLER.txt<br />
│ └── win<br />
├── flow_code<br />
│ ├── C<br />
│ └── MATLAB<br />
├── MPI-Sintel-complete.zip<br />
├── README.txt<br />
├── test<br />
│ ├── clean<br />
│ └── final<br />
└── training<br />
├── albedo<br />
├── clean<br />
├── final<br />
├── flow<br />
├── flow_viz<br />
├── invalid<br />
└── occlusions<br />
<br />
18 directories, 3 files<br />
</pre><br />
<br />
For more information about the dataset you can check the Readme file at <code>/datashare/MPI_SINTEL/README.txt</code> or visit http://sintel.is.tue.mpg.de/<br />
<br />
=== SVHN ===<br />
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.<br />
In SHARCNET we provide the full SVHN dataset at <code>/datashare/SVHN</code> in Graham.<br />
<br />
==== Directory Structure ====<br />
The SVHN dataset folder on graham contains:<br />
<pre><br />
/datashare/SVHN<br />
├── extra<br />
├── extra_32x32.mat<br />
├── extra.tar.gz<br />
├── test<br />
├── test_32x32.mat<br />
├── test.tar.gz<br />
├── train<br />
├── train_32x32.mat<br />
└── train.tar.gz<br />
<br />
3 directories, 6 files<br />
</pre><br />
<br />
The folder extra contains 163728 png images, train 33402 images, and test 13068 images. For more information visit http://ufldl.stanford.edu/housenumbers/<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=497
Graham Reference Dataset Repository
2021-10-13T14:24:06Z
<p>Jshleap: /* Directory Structure */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [https://ftp.ebi.ac.uk/pub/databases/Pfam PFAM ftp]:<br />
<br />
<pre><br />
/datashare/PFAM<br />
├── AntiFam<br />
├── current_release<br />
├── database_files<br />
├── mappings<br />
├── papers<br />
├── proteomes<br />
├── releases<br />
├── Tools<br />
└── vm<br />
<br />
9 directories, 0 files<br />
</pre><br />
<br />
For more information about the structure of their FTP and this dataset, please visit https://pfam-docs.readthedocs.io/en/latest/ftp-site.html.<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
[https://cocodataset.org COCO] is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:<br />
<br />
* Object segmentation<br />
* Recognition in context<br />
* Superpixel stuff segmentation<br />
* 330K images (>200K labeled)<br />
* 1.5 million object instances<br />
* 80 object categories<br />
* 91 stuff categories<br />
* 5 captions per image<br />
* 250,000 people with keypoints<br />
<br />
SHARCNET provides the 2017 release of the COCO dataset.<br />
<br />
==== Directory Structure ====<br />
The COCO dataset is provided following the the structure explained in https://cocodataset.org/#download:<br />
<br />
<pre><br />
/datashare/COCO<br />
├── annotations<br />
├── test2017<br />
├── train2017<br />
└── val2017<br />
<br />
4 directories, 0 files<br />
</pre><br />
<br />
Within test, train and val the plain images in jpeg format can be found. all related annotations can be found on the folder <code>annotations</code><br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.<br />
<br />
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. <br />
<br />
In SHARCNET we offer a copy of these datasets located at <code>/datashare/MNIST</code><br />
<br />
==== Directory Structure ====<br />
The directory contains the zip file with all training and testing images and labels, as well as the individual gzip files:<br />
<pre><br />
/datashare/MNIST<br />
├── mnist.zip<br />
├── t10k-images-idx3-ubyte.gz<br />
├── t10k-labels-idx1-ubyte.gz<br />
├── train-images-idx3-ubyte.gz<br />
└── train-labels-idx1-ubyte.gz<br />
<br />
0 directories, 5 files<br />
</pre><br />
<br />
For more information about this dataset, please visit http://yann.lecun.com/exdb/mnist/.<br />
<br />
=== MPI_SINTEL ===<br />
The MPI Sintel Dataset addresses limitations of existing optical flow benchmarks. It provides naturalistic video sequences that are challenging for current methods. It is designed to encourage research on long-range motion, motion blur, multi-frame analysis, non-rigid motion.<br />
<br />
The dataset contains flow fields, motion boundaries, unmatched regions, and image sequences. The image sequences are rendered with different levels of difficulty.<br />
<br />
Sintel is an open source animated short film produced by Ton Roosendaal and the Blender Foundation. Here we have modified the film in many ways to make it useful for optical flow evaluation.<br />
<br />
In SHARCNET we provide this dataset as the complete version.<br />
<br />
==== Directory Structure ====<br />
The MPI_SINTEL dataset on graham follows the structure below:<br />
<br />
<pre><br />
/datashare/MPI_SINTEL<br />
├── bundler<br />
│ ├── linux-x64<br />
│ ├── osx<br />
│ ├── README_BUNDLER.txt<br />
│ └── win<br />
├── flow_code<br />
│ ├── C<br />
│ └── MATLAB<br />
├── MPI-Sintel-complete.zip<br />
├── README.txt<br />
├── test<br />
│ ├── clean<br />
│ └── final<br />
└── training<br />
├── albedo<br />
├── clean<br />
├── final<br />
├── flow<br />
├── flow_viz<br />
├── invalid<br />
└── occlusions<br />
<br />
18 directories, 3 files<br />
</pre><br />
<br />
For more information about the dataset you can check the Readme file at <code>/datashare/MPI_SINTEL/README.txt</code> or visit http://sintel.is.tue.mpg.de/<br />
<br />
=== SVHN ===<br />
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.<br />
In SHARCNET we provide the full SVHN dataset at <code>/datashare/SVHN</code> in Graham.<br />
<br />
==== Directory Structure ====<br />
The SVHN dataset folder on graham contains:<br />
<pre><br />
/datashare/SVHN<br />
├── extra<br />
├── extra_32x32.mat<br />
├── extra.tar.gz<br />
├── test<br />
├── test_32x32.mat<br />
├── test.tar.gz<br />
├── train<br />
├── train_32x32.mat<br />
└── train.tar.gz<br />
<br />
3 directories, 6 files<br />
</pre><br />
<br />
The folder extra contains 163728 png images, train 33402 images, and test 13068 images. For more information visit http://ufldl.stanford.edu/housenumbers/<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=496
Graham Reference Dataset Repository
2021-10-13T14:21:15Z
<p>Jshleap: /* Directory Structure */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [https://ftp.ebi.ac.uk/pub/databases/Pfam PFAM ftp]:<br />
<br />
<pre><br />
/datashare/PFAM<br />
├── AntiFam<br />
├── current_release<br />
├── database_files<br />
├── mappings<br />
├── papers<br />
├── proteomes<br />
├── releases<br />
├── Tools<br />
└── vm<br />
<br />
9 directories, 0 files<br />
</pre><br />
<br />
For more information about the structure of their FTP and this dataset, please visit https://pfam-docs.readthedocs.io/en/latest/ftp-site.html.<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
[https://cocodataset.org COCO] is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:<br />
<br />
* Object segmentation<br />
* Recognition in context<br />
* Superpixel stuff segmentation<br />
* 330K images (>200K labeled)<br />
* 1.5 million object instances<br />
* 80 object categories<br />
* 91 stuff categories<br />
* 5 captions per image<br />
* 250,000 people with keypoints<br />
<br />
SHARCNET provides the 2017 release of the COCO dataset.<br />
<br />
==== Directory Structure ====<br />
The COCO dataset is provided following the the structure explained in https://cocodataset.org/#download:<br />
<br />
<pre><br />
/datashare/COCO<br />
├── annotations<br />
├── test2017<br />
├── train2017<br />
└── val2017<br />
<br />
4 directories, 0 files<br />
</pre><br />
<br />
Within test, train and val the plain images in jpeg format can be found. all related annotations can be found on the folder <code>annotations</code><br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.<br />
<br />
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. <br />
<br />
In SHARCNET we offer a copy of these datasets located at <code>/datashare/MNIST</code><br />
<br />
==== Directory Structure ====<br />
The directory contains the zip file with all training and testing images and labels, as well as the individual gzip files:<br />
<pre><br />
/datashare/MNIST<br />
├── mnist.zip<br />
├── t10k-images-idx3-ubyte.gz<br />
├── t10k-labels-idx1-ubyte.gz<br />
├── train-images-idx3-ubyte.gz<br />
└── train-labels-idx1-ubyte.gz<br />
<br />
0 directories, 5 files<br />
</pre><br />
<br />
For more information about this dataset, please visit http://yann.lecun.com/exdb/mnist/.<br />
<br />
=== MPI_SINTEL ===<br />
The MPI Sintel Dataset addresses limitations of existing optical flow benchmarks. It provides naturalistic video sequences that are challenging for current methods. It is designed to encourage research on long-range motion, motion blur, multi-frame analysis, non-rigid motion.<br />
<br />
The dataset contains flow fields, motion boundaries, unmatched regions, and image sequences. The image sequences are rendered with different levels of difficulty.<br />
<br />
Sintel is an open source animated short film produced by Ton Roosendaal and the Blender Foundation. Here we have modified the film in many ways to make it useful for optical flow evaluation.<br />
<br />
In SHARCNET we provide this dataset as the complete version.<br />
<br />
==== Directory Structure ====<br />
<br />
=== SVHN ===<br />
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.<br />
In SHARCNET we provide the full SVHN dataset at <code>/datashare/SVHN</code> in Graham.<br />
<br />
==== Directory Structure ====<br />
The SVHN dataset folder on graham contains:<br />
<pre><br />
/datashare/SVHN<br />
├── extra<br />
├── extra_32x32.mat<br />
├── extra.tar.gz<br />
├── test<br />
├── test_32x32.mat<br />
├── test.tar.gz<br />
├── train<br />
├── train_32x32.mat<br />
└── train.tar.gz<br />
<br />
3 directories, 6 files<br />
</pre><br />
<br />
The folder extra contains 163728 png images, train 33402 images, and test 13068 images. For more information visit http://ufldl.stanford.edu/housenumbers/<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=495
Graham Reference Dataset Repository
2021-10-13T14:20:46Z
<p>Jshleap: /* SVHN */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [https://ftp.ebi.ac.uk/pub/databases/Pfam PFAM ftp]:<br />
<br />
<pre><br />
/datashare/PFAM<br />
├── AntiFam<br />
├── current_release<br />
├── database_files<br />
├── mappings<br />
├── papers<br />
├── proteomes<br />
├── releases<br />
├── Tools<br />
└── vm<br />
<br />
9 directories, 0 files<br />
</pre><br />
<br />
For more information about the structure of their FTP and this dataset, please visit https://pfam-docs.readthedocs.io/en/latest/ftp-site.html.<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
[https://cocodataset.org COCO] is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:<br />
<br />
* Object segmentation<br />
* Recognition in context<br />
* Superpixel stuff segmentation<br />
* 330K images (>200K labeled)<br />
* 1.5 million object instances<br />
* 80 object categories<br />
* 91 stuff categories<br />
* 5 captions per image<br />
* 250,000 people with keypoints<br />
<br />
SHARCNET provides the 2017 release of the COCO dataset.<br />
<br />
==== Directory Structure ====<br />
The COCO dataset is provided following the the structure explained in https://cocodataset.org/#download:<br />
<br />
<pre><br />
/datashare/COCO<br />
├── annotations<br />
├── test2017<br />
├── train2017<br />
└── val2017<br />
<br />
4 directories, 0 files<br />
</pre><br />
<br />
Within test, train and val the plain images in jpeg format can be found. all related annotations can be found on the folder <code>annotations</code><br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.<br />
<br />
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. <br />
<br />
In SHARCNET we offer a copy of these datasets located at <code>/datashare/MNIST</code><br />
<br />
==== Directory Structure ====<br />
The directory contains the zip file with all training and testing images and labels, as well as the individual gzip files:<br />
<pre><br />
/datashare/MNIST<br />
├── mnist.zip<br />
├── t10k-images-idx3-ubyte.gz<br />
├── t10k-labels-idx1-ubyte.gz<br />
├── train-images-idx3-ubyte.gz<br />
└── train-labels-idx1-ubyte.gz<br />
<br />
0 directories, 5 files<br />
</pre><br />
<br />
For more information about this dataset, please visit http://yann.lecun.com/exdb/mnist/.<br />
<br />
=== MPI_SINTEL ===<br />
The MPI Sintel Dataset addresses limitations of existing optical flow benchmarks. It provides naturalistic video sequences that are challenging for current methods. It is designed to encourage research on long-range motion, motion blur, multi-frame analysis, non-rigid motion.<br />
<br />
The dataset contains flow fields, motion boundaries, unmatched regions, and image sequences. The image sequences are rendered with different levels of difficulty.<br />
<br />
Sintel is an open source animated short film produced by Ton Roosendaal and the Blender Foundation. Here we have modified the film in many ways to make it useful for optical flow evaluation.<br />
<br />
In SHARCNET we provide this dataset as the complete version.<br />
<br />
==== Directory Structure ====<br />
<br />
=== SVHN ===<br />
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.<br />
In SHARCNET we provide the full SVHN dataset at <code>/datashare/SVHN</code> in Graham.<br />
<br />
==== Directory Structure ====<br />
The SVHN dataset folder on graham contains:<br />
<pre><br />
/datashare/SVHN<br />
├── extra<br />
├── extra_32x32.mat<br />
├── extra.tar.gz<br />
├── housenumbers<br />
├── test<br />
├── test_32x32.mat<br />
├── test.tar.gz<br />
├── train<br />
├── train_32x32.mat<br />
└── train.tar.gz<br />
<br />
4 directories, 6 files<br />
</pre><br />
<br />
The folder extra contains 163728 png images, train 33402 images, and test 13068 images. For more information visit http://ufldl.stanford.edu/housenumbers/<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=494
Graham Reference Dataset Repository
2021-10-13T14:05:09Z
<p>Jshleap: /* MPI_SINTEL */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [https://ftp.ebi.ac.uk/pub/databases/Pfam PFAM ftp]:<br />
<br />
<pre><br />
/datashare/PFAM<br />
├── AntiFam<br />
├── current_release<br />
├── database_files<br />
├── mappings<br />
├── papers<br />
├── proteomes<br />
├── releases<br />
├── Tools<br />
└── vm<br />
<br />
9 directories, 0 files<br />
</pre><br />
<br />
For more information about the structure of their FTP and this dataset, please visit https://pfam-docs.readthedocs.io/en/latest/ftp-site.html.<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
[https://cocodataset.org COCO] is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:<br />
<br />
* Object segmentation<br />
* Recognition in context<br />
* Superpixel stuff segmentation<br />
* 330K images (>200K labeled)<br />
* 1.5 million object instances<br />
* 80 object categories<br />
* 91 stuff categories<br />
* 5 captions per image<br />
* 250,000 people with keypoints<br />
<br />
SHARCNET provides the 2017 release of the COCO dataset.<br />
<br />
==== Directory Structure ====<br />
The COCO dataset is provided following the the structure explained in https://cocodataset.org/#download:<br />
<br />
<pre><br />
/datashare/COCO<br />
├── annotations<br />
├── test2017<br />
├── train2017<br />
└── val2017<br />
<br />
4 directories, 0 files<br />
</pre><br />
<br />
Within test, train and val the plain images in jpeg format can be found. all related annotations can be found on the folder <code>annotations</code><br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.<br />
<br />
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. <br />
<br />
In SHARCNET we offer a copy of these datasets located at <code>/datashare/MNIST</code><br />
<br />
==== Directory Structure ====<br />
The directory contains the zip file with all training and testing images and labels, as well as the individual gzip files:<br />
<pre><br />
/datashare/MNIST<br />
├── mnist.zip<br />
├── t10k-images-idx3-ubyte.gz<br />
├── t10k-labels-idx1-ubyte.gz<br />
├── train-images-idx3-ubyte.gz<br />
└── train-labels-idx1-ubyte.gz<br />
<br />
0 directories, 5 files<br />
</pre><br />
<br />
For more information about this dataset, please visit http://yann.lecun.com/exdb/mnist/.<br />
<br />
=== MPI_SINTEL ===<br />
The MPI Sintel Dataset addresses limitations of existing optical flow benchmarks. It provides naturalistic video sequences that are challenging for current methods. It is designed to encourage research on long-range motion, motion blur, multi-frame analysis, non-rigid motion.<br />
<br />
The dataset contains flow fields, motion boundaries, unmatched regions, and image sequences. The image sequences are rendered with different levels of difficulty.<br />
<br />
Sintel is an open source animated short film produced by Ton Roosendaal and the Blender Foundation. Here we have modified the film in many ways to make it useful for optical flow evaluation.<br />
<br />
In SHARCNET we provide this dataset as the complete version.<br />
<br />
==== Directory Structure ====<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=493
Graham Reference Dataset Repository
2021-10-13T14:01:18Z
<p>Jshleap: /* MNIST */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [https://ftp.ebi.ac.uk/pub/databases/Pfam PFAM ftp]:<br />
<br />
<pre><br />
/datashare/PFAM<br />
├── AntiFam<br />
├── current_release<br />
├── database_files<br />
├── mappings<br />
├── papers<br />
├── proteomes<br />
├── releases<br />
├── Tools<br />
└── vm<br />
<br />
9 directories, 0 files<br />
</pre><br />
<br />
For more information about the structure of their FTP and this dataset, please visit https://pfam-docs.readthedocs.io/en/latest/ftp-site.html.<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
[https://cocodataset.org COCO] is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:<br />
<br />
* Object segmentation<br />
* Recognition in context<br />
* Superpixel stuff segmentation<br />
* 330K images (>200K labeled)<br />
* 1.5 million object instances<br />
* 80 object categories<br />
* 91 stuff categories<br />
* 5 captions per image<br />
* 250,000 people with keypoints<br />
<br />
SHARCNET provides the 2017 release of the COCO dataset.<br />
<br />
==== Directory Structure ====<br />
The COCO dataset is provided following the the structure explained in https://cocodataset.org/#download:<br />
<br />
<pre><br />
/datashare/COCO<br />
├── annotations<br />
├── test2017<br />
├── train2017<br />
└── val2017<br />
<br />
4 directories, 0 files<br />
</pre><br />
<br />
Within test, train and val the plain images in jpeg format can be found. all related annotations can be found on the folder <code>annotations</code><br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.<br />
<br />
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. <br />
<br />
In SHARCNET we offer a copy of these datasets located at <code>/datashare/MNIST</code><br />
<br />
==== Directory Structure ====<br />
The directory contains the zip file with all training and testing images and labels, as well as the individual gzip files:<br />
<pre><br />
/datashare/MNIST<br />
├── mnist.zip<br />
├── t10k-images-idx3-ubyte.gz<br />
├── t10k-labels-idx1-ubyte.gz<br />
├── train-images-idx3-ubyte.gz<br />
└── train-labels-idx1-ubyte.gz<br />
<br />
0 directories, 5 files<br />
</pre><br />
<br />
For more information about this dataset, please visit http://yann.lecun.com/exdb/mnist/.<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=488
Graham Reference Dataset Repository
2021-10-07T16:01:36Z
<p>Jshleap: /* COCO */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [https://ftp.ebi.ac.uk/pub/databases/Pfam PFAM ftp]:<br />
<br />
<pre><br />
/datashare/PFAM<br />
├── AntiFam<br />
├── current_release<br />
├── database_files<br />
├── mappings<br />
├── papers<br />
├── proteomes<br />
├── releases<br />
├── Tools<br />
└── vm<br />
<br />
9 directories, 0 files<br />
</pre><br />
<br />
For more information about the structure of their FTP and this dataset, please visit https://pfam-docs.readthedocs.io/en/latest/ftp-site.html.<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
[https://cocodataset.org COCO] is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:<br />
<br />
* Object segmentation<br />
* Recognition in context<br />
* Superpixel stuff segmentation<br />
* 330K images (>200K labeled)<br />
* 1.5 million object instances<br />
* 80 object categories<br />
* 91 stuff categories<br />
* 5 captions per image<br />
* 250,000 people with keypoints<br />
<br />
SHARCNET provides the 2017 release of the COCO dataset.<br />
<br />
==== Directory Structure ====<br />
The COCO dataset is provided following the the structure explained in https://cocodataset.org/#download:<br />
<br />
<pre><br />
/datashare/COCO<br />
├── annotations<br />
├── test2017<br />
├── train2017<br />
└── val2017<br />
<br />
4 directories, 0 files<br />
</pre><br />
<br />
Within test, train and val the plain images in jpeg format can be found. all related annotations can be found on the folder <code>annotations</code><br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=487
Graham Reference Dataset Repository
2021-10-07T15:55:04Z
<p>Jshleap: /* PFAM */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [https://ftp.ebi.ac.uk/pub/databases/Pfam PFAM ftp]:<br />
<br />
<pre><br />
/datashare/PFAM<br />
├── AntiFam<br />
├── current_release<br />
├── database_files<br />
├── mappings<br />
├── papers<br />
├── proteomes<br />
├── releases<br />
├── Tools<br />
└── vm<br />
<br />
9 directories, 0 files<br />
</pre><br />
<br />
For more information about the structure of their FTP and this dataset, please visit https://pfam-docs.readthedocs.io/en/latest/ftp-site.html.<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=486
Graham Reference Dataset Repository
2021-10-07T15:46:37Z
<p>Jshleap: /* PFAM */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions [https://en.wikipedia.org/wiki/Pfam 1].<br />
<br />
On SHARCNET we provide the latest version of the PFAM database.<br />
<br />
==== Directory Structure ====<br />
We follow the structure of the [ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release PFAM ftp]<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=485
Graham Reference Dataset Repository
2021-10-07T15:07:44Z
<p>Jshleap: /* UNIPROT */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT.<br />
<br />
==== Directory Structure ====<br />
The structure of the UNIPROT dataset follows [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/ UNIPROT's FTP]:<br />
<pre><br />
/datashare/UNIPROT<br />
├── changes.html<br />
├── decoy<br />
│ ├── LICENSE<br />
│ ├── README<br />
│ └── RELEASE.metalink<br />
├── knowledgebase<br />
│ ├── complete<br />
│ ├── genome_annotation_tracks<br />
│ ├── idmapping<br />
│ ├── pan_proteomes<br />
│ ├── proteomics_mapping<br />
│ ├── reference_proteomes<br />
│ └── taxonomic_divisions<br />
├── news.html<br />
├── README<br />
├── RELEASE.metalink<br />
└── relnotes.txt<br />
<br />
9 directories, 8 files<br />
</pre><br />
<br />
The explanation of each directory's content can be found at <code>/datashare/UNIPROT/README<code> or you can check it online [https://ftp.uniprot.org/pub/databases/uniprot/current_release/README here].<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=484
Graham Reference Dataset Repository
2021-10-07T15:01:11Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=483
Graham Reference Dataset Repository
2021-10-07T15:00:55Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
<pre><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
</pre><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=482
Graham Reference Dataset Repository
2021-10-07T15:00:19Z
<p>Jshleap: /* Usage */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
The output of this script is in line with the inspect format. You can check out the [https://github.com/DerrickWood/kraken2/wiki Kraken2 Manual] for more information.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=481
Graham Reference Dataset Repository
2021-10-07T14:58:11Z
<p>Jshleap: /* Directory structure */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
<pre><br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
</pre><br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=480
Graham Reference Dataset Repository
2021-10-07T14:57:47Z
<p>Jshleap: /* kraken2_dbs */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
==== Directory structure ====<br />
Kraken 2 is provided in the following structure:<br />
/datashare/kraken2_dbs<br />
├── 16S_Greengenes_k2db<br />
├── 16S_RDP_k2db<br />
├── 16S_SILVA132_k2db<br />
├── 16S_SILVA138_k2db<br />
├── archaea<br />
├── bacteria<br />
├── dl_log<br />
├── eukaryota<br />
├── fungi<br />
├── human<br />
├── is_my_taxa_there<br />
├── krakendb_100G<br />
├── midikraken_100GB<br />
├── minikraken_8GB_20200312<br />
├── minikraken_8GB_20200312_genomes.txt<br />
├── minikraken_8GB_202003.tgz<br />
├── plant<br />
├── protozoa<br />
├── UniVec_Core<br />
└── viral<br />
<br />
==== Usage ====<br />
By providing the path to the database you are able to query the specific database of your choosing:<br />
<br />
<code>kraken2 --db /datashare/kraken2_dbs/eukaryota test.fa</code><br />
<br />
For your convenience, we provide a simple script to query if your specific taxa is available in the database:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -h<br />
Usage: /datashare/kraken2_dbs/is_my_taxa_there [-t <taxa to look for>|[-d <database>|-h]<br />
-h print usage and exit<br />
-t desired taxa<br />
-d Database to check in (full path)<br />
<br />
NOTE: THE TAXA IS CASE SENSITIVE, for example, if you require arabidopsis genus in the plant database it returns nothing, but Arabidopsis will return the hits<br />
</pre><br />
<br />
For example, let's say that you want to check if the genus `Carcharodon` is included in the <code>eukaryota</code> database, then you do:<br />
<br />
<pre><br />
$ /datashare/kraken2_dbs/is_my_taxa_there -t Carcharodon -d /datashare/kraken2_dbs/eukaryota<br />
Checking if Carcharodon is present in /datashare/kraken2_dbs/eukaryota<br />
<br />
0.03 569792 0 G 13396 Carcharodon<br />
0.03 569792 569792 S 13397 Carcharodon carcharias<br />
</pre><br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=479
Graham Reference Dataset Repository
2021-10-07T14:48:52Z
<p>Jshleap: /* kraken2_dbs */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs ===<br />
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm ([https://ccb.jhu.edu/software/kraken2/ kraken2]). In SHARCNET, we provide some extra databases with expanded taxonomy for our users. These databases are Kraken2 ONLY, that means that it uses a compact hash table. With this structure, it has a <1% chance of returning the incorrect LCA or returning an LCA for a non-inserted minimizer. Users can compensate for this possibility by using Kraken's confidence scoring thresholds.<br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=478
Graham Reference Dataset Repository
2021-09-30T15:46:24Z
<p>Jshleap: /* Bioinformatics */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=477
Graham Reference Dataset Repository
2021-09-30T15:46:11Z
<p>Jshleap: /* Bioinformatics */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [/http://www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=476
Graham Reference Dataset Repository
2021-09-30T15:44:30Z
<p>Jshleap: </p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[user@gra-login1 ~]$ ls -lL /datashare/<br />
total 848<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 9 jshleap sn_staff 141 Sep 28 01:58 alphafold<br />
drwxrwxr-x 36 jshleap sn_staff 98304 Sep 30 10:49 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 299 Jul 26 15:39 EggNog<br />
drwxr-xr-- 6 jshleap jshleap 143 Jul 28 15:45 GATK_resource_bundle<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
dr-xr-xr-x 20 jshleap sn_staff 4096 Sep 20 14:04 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxr-xr-x 4 jshleap jshleap 50 Aug 24 13:40 modulefiles<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 9546 jshleap sn_staff 581632 Aug 7 13:10 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 2021 PANTHER<br />
drwxrwxr-x 11 jshleap sn_staff 214 Aug 10 09:28 PFAM<br />
drwxrwxr-x 6 jshleap sn_staff 213 Aug 25 10:50 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 2021 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== AlphaFold ===<br />
This space contains the data required by the AlphaFold sofware (more info here https://docs.computecanada.ca/wiki/AlphaFold). You can find more information about each dataset at https://github.com/deepmind/alphafold.<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
AlphaFold directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/alphafold<br />
├── bfd<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex<br />
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata<br />
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex<br />
├── mgnify<br />
│ └── mgy_clusters_2018_12.fa<br />
├── params<br />
│ ├── LICENSE<br />
│ ├── params_model_1.npz<br />
│ ├── params_model_1_ptm.npz<br />
│ ├── params_model_2.npz<br />
│ ├── params_model_2_ptm.npz<br />
│ ├── params_model_3.npz<br />
│ ├── params_model_3_ptm.npz<br />
│ ├── params_model_4.npz<br />
│ ├── params_model_4_ptm.npz<br />
│ ├── params_model_5.npz<br />
│ └── params_model_5_ptm.npz<br />
├── pdb70<br />
│ ├── md5sum<br />
│ ├── pdb70_a3m.ffdata<br />
│ ├── pdb70_a3m.ffindex<br />
│ ├── pdb70_clu.tsv<br />
│ ├── pdb70_cs219.ffdata<br />
│ ├── pdb70_cs219.ffindex<br />
│ ├── pdb70_hhm.ffdata<br />
│ ├── pdb70_hhm.ffindex<br />
│ └── pdb_filter.dat<br />
├── pdb_mmcif<br />
│ ├── mmcif_files<br />
│ └── obsolete.dat<br />
├── uniclust30<br />
│ └── uniclust30_2018_08<br />
└── uniref90<br />
└── uniref90.fasta<br />
<br />
9 directories, 29 files<br />
</pre><br />
</div><br />
</div><br />
<br />
To use this following the instruction in https://docs.computecanada.ca/wiki/AlphaFold, set the <code>DOWNLOAD_DIR</code> variable to <code>/datashare/alphafold</code>.<br />
<br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=475
Graham Reference Dataset Repository
2021-09-30T15:12:20Z
<p>Jshleap: /* VoxCeleb */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===<br />
See https://docs.computecanada.ca/wiki/VoxCeleb</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=474
Graham Reference Dataset Repository
2021-09-30T15:12:12Z
<p>Jshleap: /* ImageNet */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
See https://docs.computecanada.ca/wiki/ImageNet<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham%E2%80%99s_Reference_Dataset_Repository&diff=473
Graham’s Reference Dataset Repository
2021-09-30T15:09:16Z
<p>Jshleap: Jshleap moved page Graham’s Reference Dataset Repository to Graham Reference Dataset Repository: Apostrophe is not a good idea</p>
<hr />
<div>#REDIRECT [[Graham Reference Dataset Repository]]</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=472
Graham Reference Dataset Repository
2021-09-30T15:09:16Z
<p>Jshleap: Jshleap moved page Graham’s Reference Dataset Repository to Graham Reference Dataset Repository: Apostrophe is not a good idea</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=431
Graham Reference Dataset Repository
2021-08-27T15:25:44Z
<p>Jshleap: /* UNIPROT */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.<br />
<br />
In Graham we keep the latest release of uniprot at /datashare/UNIPROT<br />
<br />
==== Directory Structure ====<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=430
Graham Reference Dataset Repository
2021-08-25T15:08:14Z
<p>Jshleap: /* Bioinformatics */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
The SILVA databases are developed and maintained by the [http://www.microbial-genomics.de/ Microbial Genomics and Bioinformatics Research Group] in Bremen, Germany, in cooperation with the company [http://www.ribocon.com/ Ribocon GmbH].<br />
<br />
SILVA provides fully aligned and up to date small (16S/18S, SSU) and large (23S/28S, LSU) subunit ribosomal RNA "Parc" databases as well as ARB files preconfigured subsets of only high quality, full-length sequences as ARB & FASTA files (SSU/LSU Ref). It also has full compatibility with the ARB software and and to many common programs like Phylip or Paup via direct Fasta export or the ARB program. <br />
<br />
On Graham, we provide a copy of the latest release, and will be updated twice a year.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
Silva directory tree:<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/SILVA<br />
├── ARB_files<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_30_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz<br />
│ ├── SILVA_138.1_LSURef_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz.md5<br />
│ ├── SILVA_138.1_SSURef_opt.arb.gz<br />
│ └── SILVA_138.1_SSURef_opt.arb.gz.md5<br />
├── CITATION.txt<br />
├── current<br />
│ ├── sina-1.2.11_centos5_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_amd64.tgz<br />
│ ├── sina-1.2.11_ubuntu1004_i386.tgz<br />
│ ├── sina-1.2.11_ubuntu1204_amd64.tgz<br />
│ └── sina-1.2.11_ubuntu1204_i386.tgz<br />
├── Exports<br />
│ ├── accession<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_LSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz<br />
│ │ ├── SILVA_138.1_SSUParc.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz<br />
│ │ ├── SILVA_138.1_SSURef.acs.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.acs.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.acs.gz.md5<br />
│ ├── cluster<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.clstr.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.clstr.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.clstr.gz.md5<br />
│ ├── country_locality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz<br />
│ │ ├── SILVA_138.1_SSURef.country_locality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.country_locality.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.country_locality.gz.md5<br />
│ ├── full_metadata<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSUParc.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz<br />
│ │ ├── SILVA_138.1_SSURef.full_metadata.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.full_metadata.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.full_metadata.gz.md5<br />
│ ├── geographic_location<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSUParc.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz<br />
│ │ ├── SILVA_138.1_SSURef.geographic_location.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.geographic_location.gz<br />
│ │ └── SILVA_138.1_SSURef_Nr99.geographic_location.gz.md5<br />
│ ├── LICENSE.txt<br />
│ ├── quality<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_LSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz<br />
│ │ ├── SILVA_138.1_LSURef.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz<br />
│ │ ├── SILVA_138.1_SSUParc.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz<br />
│ │ ├── SILVA_138.1_SSURef_Nr99.quality.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.quality.gz<br />
│ │ └── SILVA_138.1_SSURef.quality.gz.md5<br />
│ ├── rast<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz<br />
│ │ ├── SILVA_138.1_LSURef.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rast.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rast.gz<br />
│ │ └── SILVA_138.1_SSURef.rast.gz.md5<br />
│ ├── README.txt<br />
│ ├── rnac<br />
│ │ ├── LICENSE.txt<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_LSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz<br />
│ │ ├── SILVA_138.1_LSURef.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz<br />
│ │ ├── SILVA_138.1_SSUParc.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz<br />
│ │ ├── SILVA_138.1_SSURef_NR99.rnac.gz.md5<br />
│ │ ├── SILVA_138.1_SSURef.rnac.gz<br />
│ │ └── SILVA_138.1_SSURef.rnac.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_LSURef_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSUParc_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_full_align_trunc.fasta.gz.md5<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz<br />
│ ├── SILVA_138.1_SSURef_tax_silva_trunc.fasta.gz.md5<br />
│ └── taxonomy<br />
│ ├── LICENSE.txt<br />
│ ├── ncbi<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_embl-ebi_ena_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── taxmap_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_lsu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz<br />
│ │ ├── tax_ncbi-species_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_parc_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz<br />
│ │ ├── tax_ncbi_ssu_ref_138.1.txt.gz.md5<br />
│ │ ├── tax_ncbi_ssu_ref_nr99_138.1.txt.gz<br />
│ │ └── tax_ncbi_ssu_ref_nr99_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_lsu_ref_nr_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_parc_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_138.1.txt.gz.md5<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz<br />
│ ├── taxmap_slv_ssu_ref_nr_138.1.txt.gz.md5<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_lsu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_lsu_138.1.diff.gz<br />
│ ├── tax_slv_lsu_138.1.diff.gz.md5<br />
│ ├── tax_slv_lsu_138.1.map.gz<br />
│ ├── tax_slv_lsu_138.1.map.gz.md5<br />
│ ├── tax_slv_lsu_138.1.tre.gz<br />
│ ├── tax_slv_lsu_138.1.tre.gz.md5<br />
│ ├── tax_slv_lsu_138.1.txt.gz<br />
│ ├── tax_slv_lsu_138.1.txt.gz.md5<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz<br />
│ ├── tax_slv_ssu_138.1.acc_taxid.gz.md5<br />
│ ├── tax_slv_ssu_138.1.diff.gz<br />
│ ├── tax_slv_ssu_138.1.diff.gz.md5<br />
│ ├── tax_slv_ssu_138.1.map.gz<br />
│ ├── tax_slv_ssu_138.1.map.gz.md5<br />
│ ├── tax_slv_ssu_138.1.tre.gz<br />
│ ├── tax_slv_ssu_138.1.tre.gz.md5<br />
│ ├── tax_slv_ssu_138.1.txt.gz<br />
│ └── tax_slv_ssu_138.1.txt.gz.md5<br />
├── Fields_description<br />
│ ├── LICENSE.txt<br />
│ ├── SILVA_description_of_fields_21_09_2016.htm<br />
│ └── SILVA_description_of_fields_21_09_2016.pdf<br />
├── LICENSE.txt<br />
├── README.txt<br />
└── VERSION.txt<br />
<br />
14 directories, 232 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== UNIPROT ===<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=419
Graham Reference Dataset Repository
2021-08-16T15:25:44Z
<p>Jshleap: /* = Usage with TaxonKit */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ====<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT ===<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=418
Graham Reference Dataset Repository
2021-08-16T15:24:40Z
<p>Jshleap: /* Bioinformatics */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/1000genomes<br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/EggNog<br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/NCBI_taxonomy<br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
==== Usage with TaxonKit ===<br />
In Compute Canada, we have a taxonomic manipulation software called [ TaxonKit]. You can load it by <code>module load StdEnv/2020 taxonkit</code>. It requires to have the NCBI taxonomy in a particular location. To set it up with this datashare, simply add a simbolic link to the .taxonkit folder:<br />
<br />
<pre><br />
mkdir -p ~/.taxonkit<br />
ln -s /datashare/NCBI_taxonomy/*.dmp ~/.taxonkit/<br />
</pre><br />
<br />
Then you can use taxonkit directly<br />
<br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
/datashare/PANTHER/<br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT ===<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=417
Graham Reference Dataset Repository
2021-08-16T15:11:45Z
<p>Jshleap: /* AI */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── <br />
<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
<br />
==== Directory structure ====<br />
<br />
CIFAR-10 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-10<br />
├── cifar-10-batches-bin<br />
│ ├── batches.meta.txt<br />
│ ├── data_batch_1.bin<br />
│ ├── data_batch_2.bin<br />
│ ├── data_batch_3.bin<br />
│ ├── data_batch_4.bin<br />
│ ├── data_batch_5.bin<br />
│ ├── readme.html<br />
│ └── test_batch.bin<br />
├── cifar-10-batches-mat<br />
│ ├── batches.meta.mat<br />
│ ├── data_batch_1.mat<br />
│ ├── data_batch_2.mat<br />
│ ├── data_batch_3.mat<br />
│ ├── data_batch_4.mat<br />
│ ├── data_batch_5.mat<br />
│ ├── readme.html<br />
│ └── test_batch.mat<br />
├── cifar-10-batches-py<br />
│ ├── batches.meta<br />
│ ├── data_batch_1<br />
│ ├── data_batch_2<br />
│ ├── data_batch_3<br />
│ ├── data_batch_4<br />
│ ├── data_batch_5<br />
│ ├── readme.html<br />
│ └── test_batch<br />
├── cifar-10-binary.tar.gz<br />
├── cifar-10-matlab.tar.gz<br />
└── cifar-10-python.tar.gz<br />
<br />
3 directories, 27 files<br />
<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
We provide the matlab, and python files with the test and training sets of CIFAR-10, along with the labels<br />
==== Directory structure ====<br />
<br />
CIFAR-100 directory tree (up to level 2):<br />
<br />
<pre><br />
/datashare/CIFAR-100<br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=416
Graham Reference Dataset Repository
2021-08-16T15:09:30Z
<p>Jshleap: /* CIFAR-100 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── <br />
<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and traing sets of CIFAR-10, along with the lables<br />
==== Directory structure ====<br />
<br />
CIFAR directory tree (up to level 2):<br />
<br />
<pre><br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). For more information https://www.cs.toronto.edu/~kriz/cifar.html.<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=415
Graham Reference Dataset Repository
2021-08-16T15:06:32Z
<p>Jshleap: /* CIFAR-10 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── <br />
<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
We provide the matlab, and python files with the test and traing sets of CIFAR-10, along with the lables<br />
==== Directory structure ====<br />
<br />
CIFAR directory tree (up to level 2):<br />
<br />
<pre><br />
├── cifar-100-binary<br />
│ ├── coarse_label_names.txt<br />
│ ├── fine_label_names.txt<br />
│ ├── test.bin<br />
│ └── train.bin<br />
├── cifar-100-binary.tar.gz<br />
├── cifar-100-matlab<br />
│ ├── meta.mat<br />
│ ├── test.mat<br />
│ └── train.mat<br />
├── cifar-100-matlab.tar.gz<br />
├── cifar-100-python<br />
│ ├── file.txt<br />
│ ├── meta<br />
│ ├── test<br />
│ └── train<br />
└── cifar-100-python.tar.gz<br />
<br />
3 directories, 13 files<br />
</pre><br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=414
Graham Reference Dataset Repository
2021-08-16T15:03:22Z
<p>Jshleap: /* CIFAR-10 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── <br />
<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 ===<br />
The [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10] dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.<br />
<br />
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.<br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=413
Graham Reference Dataset Repository
2021-08-06T19:56:04Z
<p>Jshleap: /* PFAM */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── <br />
<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=412
Graham Reference Dataset Repository
2021-08-06T19:47:04Z
<p>Jshleap: /* Directory structure */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── accession2taxid<br />
│ ├── dead_nucl.accession2taxid.gz<br />
│ ├── dead_nucl.accession2taxid.gz.md5<br />
│ ├── dead_prot.accession2taxid.gz<br />
│ ├── dead_prot.accession2taxid.gz.md5<br />
│ ├── dead_wgs.accession2taxid.gz<br />
│ ├── dead_wgs.accession2taxid.gz.md5<br />
│ ├── index.html<br />
│ ├── nucl_gb.accession2taxid.gz<br />
│ ├── nucl_gb.accession2taxid.gz.md5<br />
│ ├── nucl_wgs.accession2taxid.gz<br />
│ ├── nucl_wgs.accession2taxid.gz.md5<br />
│ ├── pdb.accession2taxid.gz<br />
│ ├── pdb.accession2taxid.gz.md5<br />
│ ├── prot.accession2taxid.FULL.gz<br />
│ ├── prot.accession2taxid.FULL.gz.md5<br />
│ ├── prot.accession2taxid.gz<br />
│ ├── prot.accession2taxid.gz.md5<br />
│ └── README<br />
├── biocollections<br />
│ ├── Collection_codes.txt<br />
│ ├── index.html<br />
│ ├── Institution_codes.txt<br />
│ └── Unique_institution_codes.txt<br />
├── categories.dmp<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Cowner_dump.txt<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── Icode_dump.txt<br />
├── index.html<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump<br />
│ ├── index.html<br />
│ ├── new_taxdump.tar.gz<br />
│ ├── new_taxdump.tar.gz.md5<br />
│ └── taxdump_readme.txt<br />
├── nodes.dmp<br />
├── README<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxdump_archive<br />
│ └── index.html<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
└── taxdump.tar.gz.md5<br />
<br />
4 directories, 50 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── <br />
<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=411
Graham Reference Dataset Repository
2021-08-06T19:25:15Z
<p>Jshleap: /* Directory structure */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
── accession2taxid<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Collection_codes.txt<br />
├── Cowner_dump.txt<br />
├── dead_nucl.accession2taxid.gz<br />
├── dead_nucl.accession2taxid.gz.md5<br />
├── dead_prot.accession2taxid.gz<br />
├── dead_prot.accession2taxid.gz.md5<br />
├── dead_wgs.accession2taxid.gz<br />
├── dead_wgs.accession2taxid.gz.md5<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── gi_taxid_nucl.dmp.gz<br />
├── gi_taxid_nucl.dmp.gz.md5<br />
├── gi_taxid_prot.dmp.gz<br />
├── gi_taxid_prot.dmp.gz.md5<br />
├── gi_taxid.readme<br />
├── Icode_dump.txt<br />
├── Institution_codes.txt<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump.tar.gz<br />
├── new_taxdump.tar.gz.md5<br />
├── new_taxdump.tar.Z<br />
├── new_taxdump.tar.Z.md5<br />
├── new_taxdump.zip.md5<br />
├── nodes.dmp<br />
├── nucl_gb.accession2taxid.gz<br />
├── nucl_gb.accession2taxid.gz.md5<br />
├── nucl_wgs.accession2taxid.gz<br />
├── nucl_wgs.accession2taxid.gz.md5<br />
├── pdb.accession2taxid.gz<br />
├── pdb.accession2taxid.gz.md5<br />
├── prot.accession2taxid.FULL.gz<br />
├── prot.accession2taxid.FULL.gz.md5<br />
├── prot.accession2taxid.gz<br />
├── prot.accession2taxid.gz.md5<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxcat.tar.Z<br />
├── taxcat.tar.Z.md5<br />
├── taxcat.zip.md5<br />
├── taxdmp.zip.md5<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
├── taxdump.tar.gz.md5<br />
├── taxdump.tar.Z<br />
├── taxdump.tar.Z.md5<br />
└── Unique_institution_codes.txt<br />
<br />
0 directories, 56 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── <br />
<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
===== pathway =====<br />
This folder contain the metabolic pathways and the annotation of the sequence association with each pathway. It contains some metabolic pathwaus in BioPAX and SMBL format.<br />
<br />
===== sequence_classifications =====<br />
The PANTHER website allows access to to pre-calculated HMM scoring results for the complete proteomes derived from the human, mouse, rat and Drosophila melanogaster genomes. <br />
<br />
A total of 142 classification files are provided here, one for each organism.<br />
For more information check the README file at <code>/datashare/PANTHER/sequence_classifications</code><br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=410
Graham Reference Dataset Repository
2021-08-06T19:20:55Z
<p>Jshleap: /* PANTHER */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
── accession2taxid<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Collection_codes.txt<br />
├── Cowner_dump.txt<br />
├── dead_nucl.accession2taxid.gz<br />
├── dead_nucl.accession2taxid.gz.md5<br />
├── dead_prot.accession2taxid.gz<br />
├── dead_prot.accession2taxid.gz.md5<br />
├── dead_wgs.accession2taxid.gz<br />
├── dead_wgs.accession2taxid.gz.md5<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── gi_taxid_nucl.dmp.gz<br />
├── gi_taxid_nucl.dmp.gz.md5<br />
├── gi_taxid_prot.dmp.gz<br />
├── gi_taxid_prot.dmp.gz.md5<br />
├── gi_taxid.readme<br />
├── Icode_dump.txt<br />
├── Institution_codes.txt<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump.tar.gz<br />
├── new_taxdump.tar.gz.md5<br />
├── new_taxdump.tar.Z<br />
├── new_taxdump.tar.Z.md5<br />
├── new_taxdump.zip.md5<br />
├── nodes.dmp<br />
├── nucl_gb.accession2taxid.gz<br />
├── nucl_gb.accession2taxid.gz.md5<br />
├── nucl_wgs.accession2taxid.gz<br />
├── nucl_wgs.accession2taxid.gz.md5<br />
├── pdb.accession2taxid.gz<br />
├── pdb.accession2taxid.gz.md5<br />
├── prot.accession2taxid.FULL.gz<br />
├── prot.accession2taxid.FULL.gz.md5<br />
├── prot.accession2taxid.gz<br />
├── prot.accession2taxid.gz.md5<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxcat.tar.Z<br />
├── taxcat.tar.Z.md5<br />
├── taxcat.zip.md5<br />
├── taxdmp.zip.md5<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
├── taxdump.tar.gz.md5<br />
├── taxdump.tar.Z<br />
├── taxdump.tar.Z.md5<br />
└── Unique_institution_codes.txt<br />
<br />
0 directories, 56 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── <br />
<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
===== hmm_classifications =====<br />
This folder contains the classification files for versions 15 and 16. They contain the name, molecular functions, biological processes, and pathway for every PANTHER protein family and subfamily in Version 15.0 of the PANTHER HMM library.<br />
<br />
The files are a tab-delimited file in the following format:<br />
1) PANTHER ID: for example, PTHR11258 or PTHR12213:SF6. ":SF" indicates the subfamily ID<br />
2) Name: The annotation assigned by curators to the PANTHER family or subfamily<br />
3) Molecular function*: PANTHER GO slim molecular function terms assigned to families and subfamilies<br />
4) Biological process*: PANTHER GO slim biological process terms assigned to families and subfamilies<br />
5) Cellular components*: PANTHER GO slim cellular component terms assigned to families and subfamilies<br />
6) Protein class* PANTHER protein class terms assigned to families and subfamilies<br />
7) Pathway***: PANTHER pathways have been assigned to families and subfamilies. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/hmm_classifications</code><br />
<br />
===== panther_library =====<br />
This is the main folder, containing the panther HMM files along with the fasta inputs. <br />
<br />
For more information check the README file at <code>/datashare/PANTHER/panther_library</code><br />
<br />
<br />
pathway<br />
sequence_classifications<br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=409
Graham Reference Dataset Repository
2021-08-06T15:59:39Z
<p>Jshleap: </p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
── accession2taxid<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Collection_codes.txt<br />
├── Cowner_dump.txt<br />
├── dead_nucl.accession2taxid.gz<br />
├── dead_nucl.accession2taxid.gz.md5<br />
├── dead_prot.accession2taxid.gz<br />
├── dead_prot.accession2taxid.gz.md5<br />
├── dead_wgs.accession2taxid.gz<br />
├── dead_wgs.accession2taxid.gz.md5<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── gi_taxid_nucl.dmp.gz<br />
├── gi_taxid_nucl.dmp.gz.md5<br />
├── gi_taxid_prot.dmp.gz<br />
├── gi_taxid_prot.dmp.gz.md5<br />
├── gi_taxid.readme<br />
├── Icode_dump.txt<br />
├── Institution_codes.txt<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump.tar.gz<br />
├── new_taxdump.tar.gz.md5<br />
├── new_taxdump.tar.Z<br />
├── new_taxdump.tar.Z.md5<br />
├── new_taxdump.zip.md5<br />
├── nodes.dmp<br />
├── nucl_gb.accession2taxid.gz<br />
├── nucl_gb.accession2taxid.gz.md5<br />
├── nucl_wgs.accession2taxid.gz<br />
├── nucl_wgs.accession2taxid.gz.md5<br />
├── pdb.accession2taxid.gz<br />
├── pdb.accession2taxid.gz.md5<br />
├── prot.accession2taxid.FULL.gz<br />
├── prot.accession2taxid.FULL.gz.md5<br />
├── prot.accession2taxid.gz<br />
├── prot.accession2taxid.gz.md5<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxcat.tar.Z<br />
├── taxcat.tar.Z.md5<br />
├── taxcat.zip.md5<br />
├── taxdmp.zip.md5<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
├── taxdump.tar.gz.md5<br />
├── taxdump.tar.Z<br />
├── taxdump.tar.Z.md5<br />
└── Unique_institution_codes.txt<br />
<br />
0 directories, 56 files<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
The PANTHER (protein analysis through evolutionary relationships) classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. It is part of the [https://en.wikipedia.org/wiki/PANTHER#cite_note-GOproject-2 Gene Ontology Reference Genome Project] designed to classify proteins and their genes for high-throughput analysis. <br />
<br />
In our data mount, we provide users with some of the relevant data found in the [ftp://ftp.pantherdb.org pantherdb ftp], namely: hmm_classifications, panther_library, pathway, and sequence_classifications.<br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
PANTHER directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── hmm_classifications<br />
│ ├── LICENSE<br />
│ ├── PANTHER15.0_HMM_classifications<br />
│ ├── PANTHER16.0_HMM_classifications<br />
│ └── README<br />
├── panther_library<br />
│ ├── ascii<br />
│ ├── hmmscoring<br />
│ ├── PANTHER15.0_ascii.tgz<br />
│ ├── PANTHER15.0_fasta<br />
│ ├── PANTHER15.0_fasta.tgz<br />
│ ├── PANTHER15.0_hmmscoring.tgz<br />
│ ├── PANTHER16.0_ascii.tgz<br />
│ ├── PANTHER16.0_binary.tgz<br />
│ ├── PANTHER16.0_fasta<br />
│ ├── PANTHER16.0_fasta.tgz<br />
│ ├── README<br />
│ ├── target4<br />
│ └── wget_panther_panther_library.log<br />
├── pathway<br />
│ ├── BioPAX<br />
│ ├── BioPAX.tar.gz<br />
│ ├── sbml<br />
│ ├── sbml.tar.gz<br />
│ ├── SequenceAssociationPathway3.6.4.txt<br />
│ └── SequenceAssociationPathway3.6.5.txt<br />
└── sequence_classifications<br />
├── LICENSE<br />
├── PANTHER_Sequence_Classification_files<br />
├── README<br />
└── species<br />
<br />
12 directories, 19 files<br />
</pre><br />
</div><br />
</div><br />
<br />
<br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== SVHN ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=408
Graham Reference Dataset Repository
2021-08-06T13:41:41Z
<p>Jshleap: </p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, diamond, etc) as well as with direct search of accession numbers, taxonomic IDs and related information. It will be updated with the blast databases.<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
NCBI_taxonomy directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
── accession2taxid<br />
├── Ccode_dump.txt<br />
├── citations.dmp<br />
├── coll_dump.txt<br />
├── Collection_codes.txt<br />
├── Cowner_dump.txt<br />
├── dead_nucl.accession2taxid.gz<br />
├── dead_nucl.accession2taxid.gz.md5<br />
├── dead_prot.accession2taxid.gz<br />
├── dead_prot.accession2taxid.gz.md5<br />
├── dead_wgs.accession2taxid.gz<br />
├── dead_wgs.accession2taxid.gz.md5<br />
├── delnodes.dmp<br />
├── division.dmp<br />
├── gc.prt<br />
├── gencode.dmp<br />
├── gi_taxid_nucl.dmp.gz<br />
├── gi_taxid_nucl.dmp.gz.md5<br />
├── gi_taxid_prot.dmp.gz<br />
├── gi_taxid_prot.dmp.gz.md5<br />
├── gi_taxid.readme<br />
├── Icode_dump.txt<br />
├── Institution_codes.txt<br />
├── merged.dmp<br />
├── names.dmp<br />
├── ncbi_taxonomy_genussp.txt<br />
├── new_taxdump.tar.gz<br />
├── new_taxdump.tar.gz.md5<br />
├── new_taxdump.tar.Z<br />
├── new_taxdump.tar.Z.md5<br />
├── new_taxdump.zip.md5<br />
├── nodes.dmp<br />
├── nucl_gb.accession2taxid.gz<br />
├── nucl_gb.accession2taxid.gz.md5<br />
├── nucl_wgs.accession2taxid.gz<br />
├── nucl_wgs.accession2taxid.gz.md5<br />
├── pdb.accession2taxid.gz<br />
├── pdb.accession2taxid.gz.md5<br />
├── prot.accession2taxid.FULL.gz<br />
├── prot.accession2taxid.FULL.gz.md5<br />
├── prot.accession2taxid.gz<br />
├── prot.accession2taxid.gz.md5<br />
├── readme.txt<br />
├── taxcat_readme.txt<br />
├── taxcat.tar.gz<br />
├── taxcat.tar.gz.md5<br />
├── taxcat.tar.Z<br />
├── taxcat.tar.Z.md5<br />
├── taxcat.zip.md5<br />
├── taxdmp.zip.md5<br />
├── taxdump_readme.txt<br />
├── taxdump.tar.gz<br />
├── taxdump.tar.gz.md5<br />
├── taxdump.tar.Z<br />
├── taxdump.tar.Z.md5<br />
└── Unique_institution_codes.txt<br />
<br />
0 directories, 56 files<br />
<br />
</pre><br />
</div><br />
</div><br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=407
Graham Reference Dataset Repository
2021-08-06T13:36:54Z
<p>Jshleap: /* NCBI_taxonomy */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy ===<br />
This dataset contains the [https://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy ftp]. Is intended to work with multiple software (seqkit, kraken, blast, etc) as well as with direct search of accession numbers, taxonomic IDs and related information.<br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=406
Graham Reference Dataset Repository
2021-07-26T20:07:40Z
<p>Jshleap: /* EggNog */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The [http://eggnog5.embl.de/#/app/home EggNOG] database is a database of biological information hosted by the [https://www.embl.org/sites/heidelberg/ EMBL]. It is based on the original idea of [http://www.pdg.cnb.uam.es/cursos/Leon2002/pages/software/DatabasesListNAR2002/summary/7.html COGs] and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [http://eggnog5.embl.de/download/latest/ latest EggNogg databases] <br />
<br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=405
Graham Reference Dataset Repository
2021-07-26T19:54:14Z
<p>Jshleap: /* EggNog */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
The eggNOG database is a database of biological information hosted by the EMBL. It is based on the original idea of COGs and expands that idea to non-supervised orthologous groups constructed from numerous organisms.<br />
<br />
This data mount contains a copy of [[http://eggnog5.embl.de/download/latest/| latest EggNogg databases]] <br />
==== Directory structure ====<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
EggNOG directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── e5.level_info.tar.gz<br />
├── e5.og_annotations.tsv<br />
├── e5.proteomes.faa<br />
├── e5.sequence_aliases.tsv<br />
├── e5.taxid_info.tsv<br />
├── e5.viruses.faa<br />
├── gbff<br />
│ ├── eutils_wgs_calledGenes<br />
│ └── eutils_wgs_calledGenes_2<br />
├── id_mappings<br />
│ └── uniprot<br />
├── per_tax_level<br />
│ ├── 1<br />
│ ├── 10<br />
│ ├── 1016<br />
│ ├── 10239<br />
│ ├── 1028384<br />
│ ├── 10404<br />
│ ├── 104264<br />
│ ├── 10474<br />
│ ├── 10477<br />
│ ├── 1060<br />
│ ├── 10656<br />
│ ├── 10662<br />
│ ├── 10699<br />
│ ├── 10744<br />
│ ├── 10841<br />
│ ├── 10860<br />
│ ├── 1090<br />
│ ├── 1100069<br />
│ ├── 110618<br />
│ ├── 11157<br />
│ ├── 1117<br />
│ ├── 112252<br />
│ ├── 1129<br />
│ ├── 1142<br />
│ ├── 1150<br />
│ ├── 1161<br />
│ ├── 11632<br />
│ ├── 1164882<br />
│ ├── 117743<br />
│ ├── 117747<br />
│ ├── 118882<br />
│ ├── 118884<br />
│ ├── 1189<br />
│ ├── 118969<br />
│ ├── 119043<br />
│ ├── 119045<br />
│ ├── 119060<br />
│ ├── 119065<br />
│ ├── 119066<br />
│ ├── 119069<br />
│ ├── 119089<br />
│ ├── 119603<br />
│ ├── 11989<br />
│ ├── 121069<br />
│ ├── 1212<br />
│ ├── 122277<br />
│ ├── 1224<br />
│ ├── 1236<br />
│ ├── 1239<br />
│ ├── 1268<br />
│ ├── 1283313<br />
│ ├── 129337<br />
│ ├── 1297<br />
│ ├── 1303<br />
│ ├── 1305<br />
│ ├── 1307<br />
│ ├── 1313<br />
│ ├── 135613<br />
│ ├── 135614<br />
│ ├── 135618<br />
│ ├── 135619<br />
│ ├── 135623<br />
│ ├── 135624<br />
│ ├── 135625<br />
│ ├── 1357<br />
│ ├── 136841<br />
│ ├── 136843<br />
│ ├── 136845<br />
│ ├── 136846<br />
│ ├── 136849<br />
│ ├── 1386<br />
│ ├── 142182<br />
│ ├── 145357<br />
│ ├── 147541<br />
│ ├── 147545<br />
│ ├── 147548<br />
│ ├── 147550<br />
│ ├── 150247<br />
│ ├── 1506553<br />
│ ├── 1511857<br />
│ ├── 155619<br />
│ ├── 1570339<br />
│ ├── 157897<br />
│ ├── 1653<br />
│ ├── 167375<br />
│ ├── 171550<br />
│ ├── 171551<br />
│ ├── 1762<br />
│ ├── 178469<br />
│ ├── 182709<br />
│ ├── 183925<br />
│ ├── 183939<br />
│ ├── 183963<br />
│ ├── 183967<br />
│ ├── 183968<br />
│ ├── 183980<br />
│ ├── 186801<br />
│ ├── 186804<br />
│ ├── 186806<br />
│ ├── 186807<br />
│ ├── 186813<br />
│ ├── 186818<br />
│ ├── 186820<br />
│ ├── 186821<br />
│ ├── 186822<br />
│ ├── 186823<br />
│ ├── 186824<br />
│ ├── 186827<br />
│ ├── 186828<br />
│ ├── 186928<br />
│ ├── 189330<br />
│ ├── 189775<br />
│ ├── 191028<br />
│ ├── 191675<br />
│ ├── 2<br />
│ ├── 200643<br />
│ ├── 200783<br />
│ ├── 200795<br />
│ ├── 200918<br />
│ ├── 200930<br />
│ ├── 200940<br />
│ ├── 201174<br />
│ ├── 203494<br />
│ ├── 203682<br />
│ ├── 203691<br />
│ ├── 204037<br />
│ ├── 204428<br />
│ ├── 204432<br />
│ ├── 204441<br />
│ ├── 204457<br />
│ ├── 204458<br />
│ ├── 2063<br />
│ ├── 206350<br />
│ ├── 206351<br />
│ ├── 206389<br />
│ ├── 213113<br />
│ ├── 213115<br />
│ ├── 213118<br />
│ ├── 213462<br />
│ ├── 213481<br />
│ ├── 2157<br />
│ ├── 216572<br />
│ ├── 224756<br />
│ ├── 225057<br />
│ ├── 228398<br />
│ ├── 2323<br />
│ ├── 237<br />
│ ├── 2433<br />
│ ├── 244698<br />
│ ├── 245186<br />
│ ├── 246874<br />
│ ├── 252301<br />
│ ├── 252356<br />
│ ├── 255475<br />
│ ├── 256005<br />
│ ├── 265<br />
│ ├── 265975<br />
│ ├── 267888<br />
│ ├── 267889<br />
│ ├── 267890<br />
│ ├── 267893<br />
│ ├── 267894<br />
│ ├── 2759<br />
│ ├── 28037<br />
│ ├── 28211<br />
│ ├── 28216<br />
│ ├── 28221<br />
│ ├── 2836<br />
│ ├── 283735<br />
│ ├── 285107<br />
│ ├── 28883<br />
│ ├── 28889<br />
│ ├── 28890<br />
│ ├── 289201<br />
│ ├── 29<br />
│ ├── 29000<br />
│ ├── 290174<br />
│ ├── 29258<br />
│ ├── 29547<br />
│ ├── 301297<br />
│ ├── 302485<br />
│ ├── 3041<br />
│ ├── 308865<br />
│ ├── 311790<br />
│ ├── 314146<br />
│ ├── 314294<br />
│ ├── 31979<br />
│ ├── 31993<br />
│ ├── 32003<br />
│ ├── 32061<br />
│ ├── 32066<br />
│ ├── 32199<br />
│ ├── 326319<br />
│ ├── 326457<br />
│ ├── 33090<br />
│ ├── 33154<br />
│ ├── 33183<br />
│ ├── 33208<br />
│ ├── 33213<br />
│ ├── 33342<br />
│ ├── 33554<br />
│ ├── 335928<br />
│ ├── 33867<br />
│ ├── 33958<br />
│ ├── 34008<br />
│ ├── 34037<br />
│ ├── 34383<br />
│ ├── 34384<br />
│ ├── 34397<br />
│ ├── 35237<br />
│ ├── 35268<br />
│ ├── 35278<br />
│ ├── 35301<br />
│ ├── 35325<br />
│ ├── 35493<br />
│ ├── 355688<br />
│ ├── 35718<br />
│ ├── 358033<br />
│ ├── 363408<br />
│ ├── 3699<br />
│ ├── 38820<br />
│ ├── 39782<br />
│ ├── 400634<br />
│ ├── 40117<br />
│ ├── 40674<br />
│ ├── 41294<br />
│ ├── 414999<br />
│ ├── 422676<br />
│ ├── 423358<br />
│ ├── 439488<br />
│ ├── 43988<br />
│ ├── 4447<br />
│ ├── 451866<br />
│ ├── 451867<br />
│ ├── 451870<br />
│ ├── 452284<br />
│ ├── 45401<br />
│ ├── 45404<br />
│ ├── 45667<br />
│ ├── 46205<br />
│ ├── 464095<br />
│ ├── 468<br />
│ ├── 4751<br />
│ ├── 4776<br />
│ ├── 4890<br />
│ ├── 4891<br />
│ ├── 4893<br />
│ ├── 5042<br />
│ ├── 50557<br />
│ ├── 506<br />
│ ├── 508458<br />
│ ├── 5125<br />
│ ├── 5129<br />
│ ├── 5139<br />
│ ├── 5148<br />
│ ├── 5151<br />
│ ├── 52018<br />
│ ├── 5204<br />
│ ├── 5234<br />
│ ├── 52604<br />
│ ├── 526524<br />
│ ├── 52959<br />
│ ├── 53335<br />
│ ├── 5338<br />
│ ├── 53433<br />
│ ├── 538999<br />
│ ├── 539002<br />
│ ├── 541000<br />
│ ├── 544<br />
│ ├── 544448<br />
│ ├── 547<br />
│ ├── 548681<br />
│ ├── 551<br />
│ ├── 554915<br />
│ ├── 558415<br />
│ ├── 561<br />
│ ├── 5653<br />
│ ├── 572511<br />
│ ├── 57723<br />
│ ├── 5794<br />
│ ├── 5796<br />
│ ├── 5809<br />
│ ├── 5819<br />
│ ├── 583<br />
│ ├── 586<br />
│ ├── 5863<br />
│ ├── 5878<br />
│ ├── 58840<br />
│ ├── 590<br />
│ ├── 59732<br />
│ ├── 60136<br />
│ ├── 613<br />
│ ├── 61432<br />
│ ├── 622450<br />
│ ├── 6231<br />
│ ├── 6236<br />
│ ├── 629<br />
│ ├── 629295<br />
│ ├── 639021<br />
│ ├── 651137<br />
│ ├── 6656<br />
│ ├── 671232<br />
│ ├── 675063<br />
│ ├── 68295<br />
│ ├── 68298<br />
│ ├── 68525<br />
│ ├── 68892<br />
│ ├── 69277<br />
│ ├── 69541<br />
│ ├── 69657<br />
│ ├── 7088<br />
│ ├── 71274<br />
│ ├── 713636<br />
│ ├── 7147<br />
│ ├── 7148<br />
│ ├── 7214<br />
│ ├── 72273<br />
│ ├── 72275<br />
│ ├── 7399<br />
│ ├── 74030<br />
│ ├── 74201<br />
│ ├── 74385<br />
│ ├── 75682<br />
│ ├── 766<br />
│ ├── 766764<br />
│ ├── 76804<br />
│ ├── 76831<br />
│ ├── 768503<br />
│ ├── 7711<br />
│ ├── 772<br />
│ ├── 7742<br />
│ ├── 7898<br />
│ ├── 80864<br />
│ ├── 815<br />
│ ├── 81850<br />
│ ├── 81852<br />
│ ├── 82115<br />
│ ├── 82117<br />
│ ├── 82986<br />
│ ├── 830<br />
│ ├── 83612<br />
│ ├── 84406<br />
│ ├── 8459<br />
│ ├── 84992<br />
│ ├── 84995<br />
│ ├── 84998<br />
│ ├── 85004<br />
│ ├── 85005<br />
│ ├── 85008<br />
│ ├── 85009<br />
│ ├── 85010<br />
│ ├── 85012<br />
│ ├── 85013<br />
│ ├── 85014<br />
│ ├── 85016<br />
│ ├── 85017<br />
│ ├── 85018<br />
│ ├── 85019<br />
│ ├── 85020<br />
│ ├── 85021<br />
│ ├── 85023<br />
│ ├── 85025<br />
│ ├── 85026<br />
│ ├── 8782<br />
│ ├── 90964<br />
│ ├── 909932<br />
│ ├── 91061<br />
│ ├── 91561<br />
│ ├── 91835<br />
│ ├── 9263<br />
│ ├── 92860<br />
│ ├── 93682<br />
│ ├── 9397<br />
│ ├── 9443<br />
│ ├── 9604<br />
│ ├── 97050<br />
│ ├── 976<br />
│ ├── 995019<br />
│ └── 9989<br />
└── raw_data<br />
├── e5.best_hit_homology_matrix.tsv.gz<br />
└── speciation_events.tsv.gz<br />
<br />
386 directories, 8 files<br />
</pre><br />
</div><br />
</div><br />
<br />
The top level directory includes the e5 release of the proteomes and its annotations. The <code>gbff</code> folder contain annotation in genebank format. The folder <code>id_mappings</code> contain the taxonomic information and the mappings with EggNog's taxids. In the <code>per_tax_level</code> contains a series of folders, labeled by taconomic ID. In each one of them, you can find <code>*_annotations.tsv.gz *_hmms.tar *_hmms.tar.gz *_members.tsv.gz *_raw_algs.tar *_stats.tsv *_trees.tsv.gz *_trimmed_algs.tar</code> with the Hidden Markov models alignments, annotations, profiles, and phylogenetic trees. Finally, the folder <code>raw_data</code> contains the homology/speciation events used in EggNog's clustering.<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=404
Graham Reference Dataset Repository
2021-07-26T18:58:36Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script, just like with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]]. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. In this case, different than with [[Graham’s_Reference_Dataset_Repository#BLASTDB|BLASTDB]], only one file needs to be move, which means that <code>cp</code> is more efficient than <code>tar</code> moving the file. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 diamond/2.0.9 # load blast and dependencies<br />
cp /datashare/DIAMONDDB_2.0.9/nr.dmnd ${SLURM_TMPDIR} # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
diamond blastp -d /datashare/DIAMONDDB_2.0.9/nr -q YOURREADS.fasta -o AN_OUTPUT.tsv<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where YOURREADS.fasta is located, that YOURREADS.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/DIAMONDDB_2.0.9/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=403
Graham Reference Dataset Repository
2021-07-08T15:50:30Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
==== Consideration when using these databases ====<br />
The Diamond program uses a lot of memory and temporary disk space, especially when dealing with big databases (like the ones we have here) and large query sequences (both in length and number). Should the program fail due to running out of either one, you need to set a lower value for the [https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#memory--performance-options block size parameter -b].<br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=402
Graham Reference Dataset Repository
2021-07-07T19:51:42Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[#BLAST_FASTA|BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=401
Graham Reference Dataset Repository
2021-07-07T19:51:07Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the [[BLAST_FASTA]], the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=400
Graham Reference Dataset Repository
2021-07-07T19:50:15Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information. Since the source of these databases are the BLAST_FASTA, the updates of the databases will follow the same trimonthly schedule (Jan, Apr, Jul, Oct).<br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=399
Graham Reference Dataset Repository
2021-07-07T15:42:51Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxid.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FULL.gz) --taxonnodes /datashare/NCBI_taxonomy/nodes.dmp<br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information.<br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=398
Graham Reference Dataset Repository
2021-07-07T15:40:29Z
<p>Jshleap: /* DIAMONDDB_2.0.9 */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 ===<br />
[https://github.com/bbuchfink/diamond/wiki DIAMOND] is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It works in a similar manner than blast but it has some optimizations done both at the database level and at the software level. In SHARCNET we provide pre-formatted databases for DIAMOND v.2.0.9 built using the following:<br />
<br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nr.gz) -d nr --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.accession2taxid.FUL><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/nt.gz) -d nt --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/nucl_gb.accession2taxid.><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/pdbaa.gz) -d pdbaa --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/pdb.accession2taxi><br />
diamond makedb --in <(gunzip -c /datashare/BLAST_FASTA/swissprot.gz) -d swissprot --taxonmap <(gunzip -c /datashare/NCBI_taxonomy/prot.acces><br />
<br />
As can be seen 4 databases are distributed in the <code>/datashare/DIAMONDDB_2.0.9</code> directory representing blast's <code>nt</code>, <code>nr</code>, <code>pdbaa</code>, and <code>swissprot</code>. All of them contain taxonomic information.<br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=397
Graham Reference Dataset Repository
2021-07-07T14:19:28Z
<p>Jshleap: /* BLAST_FASTA */</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA ===<br />
This directory contains the raw sequences located in the <code>blast/db/FASTA/</code> of their directory of the [https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ NCBI FTP repository] in compressed (by gzip) format:<br />
<br />
134M Apr 10 15:36 swissprot.gz<br />
96G Apr 10 22:11 nr.gz<br />
108G Apr 12 07:55 nt.gz<br />
32M Jun 4 15:30 pdbaa.gz<br />
<br />
Similar to the pre-formatted databases (located in <code>/datashare/BLASTDB</code>), these fasta files can be found at <code>/datashare/BLAST_FASTA</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
=== DIAMONDDB_2.0.9 === <br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=396
Graham Reference Dataset Repository
2021-07-07T13:59:35Z
<p>Jshleap: </p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA === <br />
<br />
=== DIAMONDDB_2.0.9 === <br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap
https://helpwiki.sharcnet.ca/wiki/index.php?title=Graham_Reference_Dataset_Repository&diff=395
Graham Reference Dataset Repository
2021-07-07T13:58:35Z
<p>Jshleap: Created page with "Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used dat..."</p>
<hr />
<div>Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:<br />
<br />
<syntaxhighlight lang="bash"><br />
[jshleap@gra-login1 ~]$ ls -lL /datashare/<br />
total 152<br />
drwxrwxr-x 9 jshleap sn_staff 4096 Jul 6 11:14 1000genomes<br />
drwxrwxr-x 2 jshleap sn_staff 94208 Jun 4 15:30 BLASTDB<br />
drwxrwxr-x 2 jshleap sn_staff 107 Jun 4 15:30 BLAST_FASTA<br />
drwxrwxr-x 5 jshleap sn_staff 229 Jun 4 18:49 CIFAR-10<br />
drwxrwxr-x 5 jshleap sn_staff 221 Jun 4 18:49 CIFAR-100<br />
drwxrwxr-x 6 jshleap sn_staff 115 Apr 27 10:00 COCO<br />
drwxrwxr-x 2 jshleap sn_staff 135 Jun 10 18:23 DIAMONDDB_2.0.9<br />
drwxrwxr-x 6 jshleap sn_staff 321 Feb 4 17:39 EggNog<br />
drwxrwxr-x 2 jshleap sn_staff 6 Mar 16 16:42 github_mirror<br />
drwxrwxr-x 3 jshleap sn_staff 46 Mar 23 14:23 hg38<br />
drwxrws--- 9 jshleap imagenet-optin 244 Jun 16 09:22 ImageNet<br />
drwxrwxr-x 8 jshleap sn_staff 4096 Jun 7 16:58 kraken2_dbs<br />
drwxrwxr-x 2 jshleap sn_staff 191 Jun 4 18:49 MNIST<br />
drwxrwxr-x 2 jshleap sn_staff 50 Jun 4 18:51 MPI_SINTEL<br />
drwxrwxr-x 2 jshleap sn_staff 4096 Jun 9 17:09 NCBI_taxonomy<br />
drwxrwxr-x 6 jshleap sn_staff 145 Feb 4 22:44 PANTHER<br />
drwxrwxr-x 5 jshleap sn_staff 4096 Apr 19 17:24 PFAM<br />
drwxrwxr-x 7 jshleap sn_staff 4096 Mar 29 09:52 SILVA<br />
drwxrwxr-x 6 jshleap sn_staff 257 Feb 4 22:46 SVHN<br />
drwxrwxr-x 4 jshleap sn_staff 189 Apr 19 17:59 UNIPROT<br />
drwxrwx--- 5 jshleap voxceleb-optin 98 Apr 23 15:15 VoxCeleb<br />
</syntaxhighlight><br />
<br />
<br />
Below a detailed description of each dataset and how to access them.<br />
<br />
== Bioinformatics ==<br />
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:<br />
<br />
=== 1000 Genomes ===<br />
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).<br />
<br />
==== Directory structure ====<br />
<br />
<div class="toccolours mw-collapsible mw-collapsed"><br />
1000 Genomes directory tree (up to level 2):<br />
<div class="mw-collapsible-content"><br />
<pre><br />
├── CHANGELOG<br />
├── data_collections<br />
│ ├── 1000G_2504_high_coverage<br />
│ ├── 1000G_2504_high_coverage_SV<br />
│ ├── 1000_genomes_project<br />
│ ├── gambian_genome_variation_project<br />
│ ├── gambian_genome_variation_project_GRCh37<br />
│ ├── geuvadis<br />
│ ├── han_chinese_high_coverage<br />
│ ├── HGDP<br />
│ ├── HGSVC2<br />
│ ├── hgsv_sv_discovery<br />
│ ├── HLA_types<br />
│ ├── illumina_platinum_pedigree<br />
│ ├── index.html<br />
│ ├── README_data_collections.md<br />
│ └── simons_diversity_data<br />
├── historical_data<br />
│ ├── former_toplevel<br />
│ ├── index.html<br />
│ └── README_historical_data.md<br />
├── index.html<br />
├── phase1<br />
│ ├── analysis_results<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── phase1.alignment.index<br />
│ ├── phase1.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index<br />
│ ├── phase1.exome.alignment.index.bas.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.gz<br />
│ ├── phase1.exome.alignment.index.HsMetrics.stats<br />
│ ├── phase1.exome.alignment.index_stats.csv<br />
│ ├── README.phase1_alignment_data<br />
│ └── technical<br />
├── phase3<br />
│ ├── 20130502.phase3.analysis.sequence.index<br />
│ ├── 20130502.phase3.exome.alignment.index<br />
│ ├── 20130502.phase3.low_coverage.alignment.index<br />
│ ├── 20130502.phase3.sequence.index<br />
│ ├── 20130725.phase3.cg_sra.index<br />
│ ├── 20130820.phase3.cg_data_index<br />
│ ├── 20131219.populations.tsv<br />
│ ├── 20131219.superpopulations.tsv<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── integrated_sv_map<br />
│ ├── README_20150504_phase3_data<br />
│ └── README_20160404_where_are_the_phase3_variants<br />
├── pilot_data<br />
│ ├── data<br />
│ ├── index.html<br />
│ ├── paper_data_sets<br />
│ ├── pilot_data.alignment.index<br />
│ ├── pilot_data.alignment.index.bas.gz<br />
│ ├── pilot_data.sequence.index<br />
│ ├── README.alignment.index<br />
│ ├── README.bas<br />
│ ├── README.sequence.index<br />
│ ├── release<br />
│ ├── SRP000031.sequence.index<br />
│ ├── SRP000032.sequence.index<br />
│ ├── SRP000033.sequence.index<br />
│ └── technical<br />
├── PRIVACY-NOTICE.txt<br />
├── README_ebi_aspera_info.md<br />
├── README_file_formats_and_descriptions.md<br />
├── README_ftp_site_structure.md<br />
├── README_missing_files.md<br />
├── README_populations.md<br />
├── README_using_1000genomes_cram.md<br />
├── release<br />
│ ├── 2008_12<br />
│ ├── 2009_02<br />
│ ├── 2009_04<br />
│ ├── 2009_05<br />
│ ├── 2009_08<br />
│ ├── 20100804<br />
│ ├── 2010_11<br />
│ ├── 20101123<br />
│ ├── 20110521<br />
│ ├── 20130502<br />
│ └── index.html<br />
└── technical<br />
├── browser<br />
├── index.html<br />
├── method_development<br />
├── ncbi_varpipe_data<br />
├── other_exome_alignments<br />
├── other_exome_alignments.alignment_indices<br />
├── phase3_EX_or_LC_only_alignment<br />
├── pilot2_high_cov_GRCh37_bams<br />
├── pilot3_exon_targetted_GRCh37_bams<br />
├── qc<br />
├── README.reference<br />
├── reference<br />
├── retired_reference<br />
├── simulations<br />
├── supporting<br />
└── working<br />
</pre><br />
</div><br />
</div><br />
<br />
As per '''their''' README, the directory structure is:<br />
<br />
<span style="font-size:110%">'''changelog_details'''</span><br><br />
<br />
This directory contains a series of files detailing the changes made to the FTP site over time.<br />
<br />
<span style="font-size:110%">'''data_collections'''</span><br><br />
<br />
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.<br />
<br />
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.<br />
<br />
<span style="font-size:110%">'''historical_data'''</span><br><br />
<br />
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.<br />
<br />
<span style="font-size:110%">'''phase1'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''phase3'''</span><br><br />
<br />
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''pilot_data'''</span><br><br />
<br />
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.<br />
<br />
<span style="font-size:110%">'''release'''</span><br><br />
<br />
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.<br />
<br />
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.<br />
<br />
Examples of release subdirectories are:<br />
- /datashare/1000genomes/release/2008_12/<br />
<br />
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the YYYYMMDD.sequence.index file.<br />
<br />
For example, the directory<br />
/datashare/1000genomes/release/20100804/<br />
contains the release versions of SNP and indel calls based on the<br />
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index<br />
file.<br />
<br />
<span style="font-size:110%">'''technical'''</span><br><br />
<br />
The technical directory contains subdirectories for other data sets such as simulations, files for<br />
method development, interim data sets, reference genomes, etc..<br />
<br />
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.<br />
<br />
<div class="warning"><br />
'''WARNING: /datashare/1000genomes/technical/working/'''<br />
The working directory under technical contains data that has experimental (non-public release) status<br />
and is suitable for internal project use only. Please use with '''caution'''.<br />
</div><br />
<br />
=== BLASTDB ===<br />
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].<br />
<br />
The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]<br />
* Species-level taxonomy ids are included for each database entry<br />
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]<br />
<br />
<div class="warning"><br />
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].<br />
</div><br />
<br />
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).<br />
<br />
==== Directory structure ====<br />
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:<br />
<br />
{| class="wikitable"<br />
|-<br />
!|Name<br />
!|Type<br />
!|Title<br />
|-<br />
|16S_ribosomal_RNA<br />
|DNA<br />
|16S ribosomal RNA (Bacteria and Archaea type strains)<br />
|-<br />
|18S_fungal_sequences<br />
|DNA<br />
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material<br />
|-<br />
|28S_fungal_sequences<br />
|DNA<br />
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material<br />
|-<br />
|Betacoronavirus<br />
|DNA<br />
|Betacoronavirus<br />
|-<br />
|GCF_000001405.38_top_level<br />
|DNA<br />
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|GCF_000001635.26_top_level<br />
|DNA<br />
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds<br />
|-<br />
|ITS_RefSeq_Fungi<br />
|DNA<br />
|Internal transcribed spacer region (ITS) from Fungi type and reference material<br />
|-<br />
|ITS_eukaryote_sequences<br />
|DNA<br />
|ITS eukaryote BLAST<br />
|-<br />
|env_nt<br />
|DNA<br />
|environmental samples<br />
|-<br />
|nt<br />
|DNA<br />
|Nucleotide collection (nt)<br />
|-<br />
|patnt<br />
|DNA<br />
|Nucleotide sequences derived from the Patent division of GenBank<br />
|-<br />
|pdbnt<br />
|DNA<br />
|PDB nucleotide database<br />
|-<br />
|ref_euk_rep_genomes<br />
|DNA<br />
|RefSeq Eukaryotic Representative Genome Database<br />
|-<br />
|ref_prok_rep_genomes<br />
|DNA<br />
|Refseq prokaryote representative genomes (contains refseq assembly)<br />
|-<br />
|ref_viroids_rep_genomes<br />
|DNA<br />
|Refseq viroids representative genomes<br />
|-<br />
|ref_viruses_rep_genomes<br />
|DNA<br />
|Refseq viruses representative genomes<br />
|-<br />
|refseq_rna<br />
|DNA<br />
|NCBI Transcript Reference Sequences<br />
|-<br />
|refseq_select_rna<br />
|DNA<br />
|RefSeq Select RNA sequences<br />
|-<br />
|env_nr<br />
|Protein<br />
|Proteins from WGS metagenomic projects (env_nr)<br />
|-<br />
|landmark<br />
|Protein<br />
|Landmark database for SmartBLAST<br />
|-<br />
|nr<br />
|Protein<br />
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects<br />
|-<br />
|pdbaa<br />
|Protein<br />
|PDB protein database<br />
|-<br />
|pataa<br />
|Protein<br />
|Protein sequences derived from the Patent division of GenBank<br />
|-<br />
|refseq_protein<br />
|Protein<br />
|NCBI Protein Reference Sequences<br />
|-<br />
|refseq_select_prot<br />
|Protein<br />
|RefSeq Select proteins<br />
|-<br />
|swissprot<br />
|Protein<br />
|Non-redundant UniProtKB/SwissProt sequences<br />
|-<br />
|split-cdd<br />
|Protein<br />
|CDD split into 32 volumes<br />
|-<br />
|tsa_nr<br />
|Protein<br />
|Transcriptome Shotgun Assembly (TSA) sequences<br />
|}<br />
<br />
==== Usage ====<br />
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:<br />
<br />
<br />
#!/bin/bash<br />
#SBATCH --time=02:00:00<br />
#SBATCH --mem=32G<br />
#SBATCH --cpus-per-task=8<br />
#SBATCH --account=def-someuser<br />
module load StdEnv/2020 gcc/9.3.0 blast+/2.11.0 # load blast and dependencies<br />
tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR<br />
blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta<br />
<br />
<br />
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.<br />
<br />
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.<br />
<br />
==== Other Compute Canada Sources ====<br />
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.<br />
<br />
=== BLAST_FASTA === <br />
<br />
=== DIAMONDDB_2.0.9 === <br />
<br />
=== EggNog ===<br />
<br />
=== hg38 === <br />
<br />
=== kraken2_dbs === <br />
<br />
=== NCBI_taxonomy === <br />
<br />
=== PANTHER === <br />
<br />
=== PFAM ===<br />
<br />
=== SILVA ===<br />
<br />
=== SVHN ===<br />
<br />
=== UNIPROT === <br />
<br />
<br />
<br />
== AI ==<br />
<br />
=== CIFAR-10 === <br />
<br />
=== CIFAR-100 ===<br />
<br />
=== COCO ===<br />
<br />
=== ImageNet ===<br />
<br />
=== MNIST ===<br />
<br />
=== MPI_SINTEL ===<br />
<br />
=== VoxCeleb ===</div>
Jshleap