Peptide Mass Fingerprint search
If you have no time to read this short tutorial, these are the most important do’s and don’ts:
- You cannot search raw data; it must be converted into a peak list.
- Search parameters are critical and should be determined by running a standard, such as a BSA digest.
- If you are not sure which database to search, start with Swiss-Prot.
- If you use a taxonomy filter, or search a single organism database, include a contaminants database in the search.
- Never specify more than two variable modifications.
- Always choose a specific enzyme (usually trypsin).
- A protein hit is only significant (reliable) if it has an expect value below 0.05, (5% chance of being false).
Tutorial
The first requirement for a Peptide Mass Fingerprint (PMF) search is a peak list; you cannot upload a raw data file. Raw data is converted into a peak list by a process called peak picking or peak detection. Often, the instrument data system takes care of this, and you can submit a Mascot search directly from the data system or save a peak list to a disk file for submission using the web browser search form. If not, or if you have a raw data file and no access to the data system, you’ll need to find a utility to convert it into a peak list. Peak lists are text files and come in various different formats. You can also copy and paste a list of values into the query area of the search form, or even type them in. Each m/z value goes on a separate line. If you also have an intensity value for the peak, this follows the m/z value, separated by a space or a tab.
Mass values for very short peptides contribute little to the score. It is the long peptides, which are unlikely it is to occur in multiple proteins, that provide the greatest specificity, so aim to get as many peptide masses as possible in the range 1000 to 3500 Da. High mass accuracy is good, but sequence coverage is equally important. You will get a better score from 20 mass values at modest accuracy than 5 mass values at very high accuracy.
A peak list, by itself, is not sufficient. There are also a number of search parameters that must be set appropriately. Follow this link to open the search form in a new browser tab. The labels for each control on the search form are also links to help topics. Note that you can set your own defaults for the web browser search form by following the link at the bottom of the Access Mascot Server page.
The form looks much the same whether you have your own Mascot server, in-house, or whether you are connected to the free, public Mascot Server. If you are using the free, public Mascot Server, there are some restrictions, one of which is that you have to provide a name and email address so that we can email a link to your search results if the connection is broken. Whether you enter a search title is your choice. It is displayed at the top of the result report, and can be a useful way of identifying the search at a later date.
If at all possible, run a standard sample and use this to set all the search parameters. By standard sample, we mean something like a BSA digest, which will give a strong match and where you know what the answer is supposed to be. Trying to set search parameters on an unknown is much more difficult, and can lead to false positives.
The first choice you have to make is which database to search. The free public web site has just a few of the more popular public databases, but an in-house server may have a hundred or more. Some databases contain sequences from a single organism. Others contain entries from multiple organisms, but usually include the taxonomy for each entry, so that entries for a specific organism can be selected during a search using a taxonomy filter.
If your target organism is well characterised, such as human or mouse or yeast or arabidopsis, Swiss-Prot is the recommended choice. The entries are all high quality and well annotated. Because Swiss-Prot is non-redundant, it is relatively small, which makes it easier to get a statistically significant match. If you think you know what is in the sample, you can restrict the search to an organism or family by means of the taxonomy filter, but remember that you can never rule out contaminants. When searching entries for a single organism, always include a database of common contaminants. Otherwise, you might fail to get a match, or you could end up reporting your sample is human serum albumin when it is really BSA. In the web browser form, to select two databases, first click on your target database then hold down the control key and click on a contaminants database. If the search includes a taxonomy filter, that’s not a problem because taxonomy is not configured for the contaminants databases, so all the entries will always be searched.
If you are interested in a bacterium or a plant, you may find that it is poorly represented in Swiss-Prot, and it would be better to try one of the comprehensive protein databases, which aim to include all known protein sequences. The two best known are NCBIprot and UniRef100. These are very large databases, and you will always want to select a limited taxonomy. But, never choose a narrow taxonomy without looking at the counts of entries and understanding the classification. In the current Swiss-Prot, for example, there are 26,139 entries for rodentia, of which all but 1,602 are for mouse and rat. So, even if your target organism is hamster, it isn’t a good idea to choose ‘other rodentia’. Better to search rodentia and hope to get a match to a homologous protein from mouse and rat.
You must always choose an enzyme for a PMF. The number of allowed missed cleavages should be set empirically, by running a standard and trying different values to see which gives the best score.
Modifications in database searching are handled in two ways. First, there are the fixed or quantitative modifications. The most common example is the alkylation of cysteine. Since all cysteines are modified, this is effectively just a change in the mass of cysteine. It carries no penalty in terms of search speed or specificity. The most widely used alkylation agents are iodoacetamide (select modification carbamidomethyl), iodoacetic acid (carboxymethyl), and MMTS (methylthio).
In contrast, most post-translational modifications do not apply to all instances of a residue. For example, phosphorylation might affect just one serine in a protein containing many serines and threonines. These variable or non-quantitative modifications are expensive in the sense that they increase the search space. This is because the software has to permute out all the possible arrangements of modified and unmodified residues that fit to the peptide molecular mass. As more and more modifications are considered, the number of combinations and permutations increases geometrically, and we get a so-called combinatorial explosion.
It is not possible to identify post-translational modifications by PMF; this requires MS/MS, so the best advice is to use a minimum of variable modifications, or none at all. In most cases, the only variable modification you need to consider is oxidation of methionine. Try searching the data from your standard with and without this modification to see which gives the highest score.
Protein mass is applied as a sliding window. That is, for each database entry, Mascot looks for the highest scoring set of peptide mass matches within a contiguous stretch of sequence less than or equal to the specified protein mass. Usually, this adds little to the score, and the general advice is to leave this field blank.
Making an estimate of the mass accuracy doesn’t have to be a guessing game. The Mascot Protein View report includes graphs of mass errors. Just run a standard and look at the error graphs for the correct match. Ignore outliers, which are chance mass matches, add on a safety margin and this is your error estimate. You can also use these graphs to decide whether Da or ppm is the best choice for the tolerance unit.
In most cases, PMF data comes from a MALDI experiment, and the mass values are MH+. Your peak list will only contain Mr values (relative molecular mass) if the peak picking software has ‘de-charged’ the measured m/z values. Possibly, because the data contained a mixture of charge states.
Most modern instruments produce monoisotopic mass values. You will only have average masses if the entire isotope distribution has been centroided into a single peak, which usually implies very low resolution. (If you get this setting wrong, the mass errors will be very large and show a strong trend, because the difference between an average and a monoisotopic mass for peptides and proteins is approximately 0.06%.)
If decoy is checked, Mascot repeats the search against a database in which each protein sequence has been randomised. If you have a score close to the significance threshold and are wondering whether the match is reliable, it can help to see the best score from the randomised, decoy database. If this is similar to that from the target, or higher, this can be a useful caution.
Report determines the maximum number of hits displayed in a search results report. Always choose AUTO to display only the protein hits with significant scores, (plus one more, in case there are no significant hits).