Sequence database setup: UniProt proteomes

Overview

A UniProt complete proteome consists of the set of proteins thought to be expressed by an organism whose genome has been completely sequenced. A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research.

UniProtKB is a collaboration between the European Bioinformatics Institute, the Swiss Institute of Bioinformatics and the Protein Information Resource.

First, you need to discover the Proteome ID for your proteome of interest by searching https://www.uniprot.org/proteomes/. This example uses rice, taxonomy Oryza sativa subsp. japonica with Proteome ID UP000059680

In Database Manager, create a new custom definition, as follows:

  1. Fasta or New database; Create New
  2. Use pre-defined template; UniProt_proteome_template
  3. Next, Create
  4. Download from remote URL; Next
  5. Set up download URL
  6. Paste the following into the FASTA file URL field, where the proteome ID is for your proteome of interest
    https://rest.uniprot.org/uniprotkb/stream?query=proteome:UP000059680&format=fasta&compressed=false&includeIsoform=true
  7. Save; Start downloading
  8. Activate

(Note that HTTPS support in Database Manager requires Mascot Server 2.6.2 or later.) The complete configuration for the rice proteome in Database Manager would look similar to this:

Mascot database manager

Once configured, You can enable automatic updating by clicking on the database name then choosing Edit schedule.

Manual download

  • Locate the proteome for your organism of interest by searching by name or by taxonomy ID at
    https://www.uniprot.org/proteomes/
  • Click on the Proteome ID link
  • Click on the Download button and choose All protein entries, Fasta (Canonical and isoform), compressed

Taxonomy

Taxonomy is not required for a single organism database

Parse Rules

When a single entry is expanded into entries for multiple isoforms, they share the same ID, so AC must be used as the unique identifier

>sp|Q67W82-2|4CL4_ORYSJ Isoform 2 of Probable 4-coumarate--CoA ligase 4 OS=Oryza sativa subsp. japonica GN=4CL4

AC from Fasta title: ">..|\([^|]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

Configuration (Mascot 2.3 and earlier)

A Fasta file containing canonical and isoform sequence for the rice proteome was downloaded to /usr/local/mascot/sequence/rice_proteome/current, and renamed to rice_proteome_20120414.fasta.

Mascot database maintenance utility

Full text for individual entries can be retrieved across the web from Uniprot. Note that port 80, as shown in the screen shot, no longer works.

Host: www.uniprot.org
Port: 443
Path: /uniprot/#ACCESSION#.txt
Parse rule: RULE_23 "\(.*\)"

Always test a new definition before applying the changes to mascot.dat