PASTAQ

PASTAQ: Pipelines And Systems for Threshold Avoiding Quantification

PASTAQ provides a set of tools for high-performance pre-processing of LC-MS/MS data. Avoiding the use of arbitrary intensity thresholds in early stages of the pipeline can aid in the detection of important biological low intensity signals common in metabolomics and proteomics experiments.

The core of PASTAQ is implemented in C++ and currently comes in three flavours:

A C++ library to implement new high-performance tools.
Python bindings for general data processing, a default Data Dependent Acquisition (DDA) pipeline, and access to the tools to build custom analysis pipelines.
A graphical user interface (PASTAQ-GUI) for easy parametrization of the DDA pipeline (beta).

PASTAQ is suitable to build personalized pipelines using its algorithms for isotope quantification, retention time alignment, deisotoping/feature detection, can be combined with identifications obtained with any engine that outputs the mzIdentML format (and engines whose output can be converted to the mzIdentML format).

The default DDA pipeline can be used for proteomics and metabolomics analyses, and outputs a variety of quantitative tables as described below. It also produces quality control plots that can be used to assess the accuracy of the data and pre-processing algorithms.

Currently annotation of peaks and features is implemented for proteomics using mzIdentML standard, while similar annotation for metabolomics using peaks/features accurate mass and MS/MS spectra can be implemented in straightforward way.

Installation

The simplest way of getting started with PASTAQ is to install it using pip:

pip install pastaq

Alternatively, PASTAQ can be downloaded and compiled from source by downloading or cloning the repository and following the instructions in the README.md. Building from source requires CMake and a C++ compliant compiler. The default installation will build the PASTAQ C++ library and its corresponding Python bindings.

We also provide standalone PASTAQ-GUI installers for Windows in the release page.

Basic Usage

The following instructions apply to the usage of PASTAQ’s built-in Python bindings. You can also find a brief tutorial for PASTAQ-GUI here.

Further Developments

We continue to develop PASTAQ, and are always open for collaborations to add new algorithms and improve the quality of the existing tools. If you want to get involved, feel free to get in touch with us to discuss prospective projects:

Statistics and post-hoc analysis

PASTAQ focuses on pre-processing algorithms for quantification and assessment of quality of LC-MS/MS data. Furthermore, it doesn’t perform any statistics, but readily available solutions already exists, such as Perseus, Excel, or via R or Python scripts. You can find a brief tutorial of how to use PASTAQ quantitative tables in Perseus here.

DDA Pipeline Parameters

The following is an explanation of the parameters that can be configured for the DDA pipeline. In general, the default parameters can be used directly with the exception of the instrument related configuration. The following prepares the parameters for data acquired with an Orbitrap instrument with 70000 resolution at reference m/z of 200 and expected peak width in retention time of 20 seconds (FWHM).

import pastaq

params = pastaq.default_parameters('orbitrap', 20)
params['resolution_ms1'] = 70000
params['reference_mz'] = 200

Input files

Input files can be passed to the pipeline in the following way:

input_files = [
    {'reference': False, 'raw_path': '1_1.mzXML', 'group': 'a', 'ident_path': '1_1.mzid'},
    {'reference': False, 'raw_path': '1_2.mzXML', 'group': 'a', 'ident_path': '1_2.mzid'},
    {'reference': False, 'raw_path': '1_3.mzXML', 'group': 'a', 'ident_path': '1_3.mzid'},
    {'reference': False, 'raw_path': '2_1.mzXML', 'group': 'b', 'ident_path': '2_1.mzid'},
    {'reference': False, 'raw_path': '2_2.mzXML', 'group': 'a', 'ident_path': '2_2.mzid'},
    {'reference': False, 'raw_path': '2_3.mzXML', 'group': 'a', 'ident_path': '2_3.mzid'},
]

If reference is enabled for a single file, it will be used for retention time alignment. If it is set for multiple files, a similarity search will be performed to find the most optimal reference for alignment. If the reference field is not present in any files or set to false in all of them, an exhaustive similarity search will be performed to identify the optimal reference for retention alignment of the whole dataset.

raw_path specifies the path to the .mzXML or .mzML file, and ident_path the corresponding identification file in .mzIdentML version 1.1+ format.

The group is used for the selection of isotope peaks or features that are present in at least a given percentage of samples in any given group as described below.

Instrument configuration

These parameters are used mostly for configuring the extent of smoothing that is applied when resampling the data and the initial estimation of the region-of-interest (ROI) for the quantification of isotope peaks via peak fitting.

`instrument_type`

This parameter defines the type of mass spectrometer used for data acquisition. Currently, PASTAQ supports Orbitrap (orbitrap), time-of-flight (tof), (triple) quad (quad) and Fourier-transform ion cyclotron resonance (fticr).

`resolution_ms1`, `resolution_msn`, `reference_mz`

The selected resolution at the MS1 (resolution_ms1) and MS2 (resolution_msn) levels with respect to the reference m/z (reference_mz).

`avg_fwhm_rt`

Estimated peak width in seconds of chromatographic peaks given as full-width half-maximum. A preliminary exploration of some selected isotope peaks from the extracted-ion chromatogram (XIC) or base-peak chromatogram (BIC) can be used to measure a rough estimate of this parameter. Peak widths can vary widely throughout the chromatographic range, so it is not necessary to provide an accurate measurement. The current implementation expects no change in peak width during the chromatographic elution, however, the Gaussian peak fitting algorithm is tolerant to variations peak width in chromatographic peaks.

Raw data

If users want to process only a subrange in the m/z and/or retention time dimensions, they can do so by adjusting min_mz, max_mz, min_rt and max_rt parameters. Additionally, if the data contains scans in both positive and negative polarities, the polarity setting must be set to either 'pos' or 'neg'. Failure to do so will impact further pre-processing steps such as resampling and peak detection. Positive and negative polarities should thus be separated in different output directories.

Resampling

These parameters can be used for controlling the extent of smoothing applied to the data when resampling and to adjust the size of the output matrix. The resampling procedure will project the raw data into a regular 2D map in which the expected widths of peaks in both the retention time and m/z dimensions remain the same throughout their respective ranges. This ensures that the number of sampling points for all isotope peaks will be similar in the entire map.

By default, num_samples_mz and num_samples_rt establish that there will be at least 5 sampling points within the FWHM of a peak in each dimension. This should be sufficient for the estimation of initial local-maxima, but if not enough memory is available in the system, it may be reduced at the user’s discretion.

The amount of smoothing can be controlled with smoothing_coefficient_mz and smoothing_coefficient_rt, which act as a smoothing multiplier over the Gaussian kernel set at the estimated width (in sigma units) that is automatically calculated from the given instrumental settings and adjusted for local peak resolution in m/z dimension.

Peak detection

Detection of isotopic peaks is performed automatically based on previous settings, but the user may want to configure a maximum number of peaks to keep with the max_peaks parameter. It is advised to keep this parameter at high value, as noise discrimination is performed in the later peak matching stages of the pipeline without

Retention time alignment

The retention time alignment algorithm (Warp2D) divides the retention time range of chromatograms in warp2d_num_points equidistant sampling points. Windows of warp2d_window_size number of points are used to establish the nodes at which the warping will occur. The warp2d_slack indicates how many points the warping nodes can be moved to shorten or extend a segment. When comparing the similarity of each of the segment during the warping, a maximum number of the most abundant isotope peaks (warp2d_peaks_per_window) can be adjusted, allowing to control the calculation time. Note that the retention time range is expanded by warp2d_rt_expand_factor at the beginning and end of the retention time range to avoid edge effects.

As a practical example, if the retention time range goes from 0 to 1000 seconds, with warp2d_num_points = 1000, warp2d_window_size = 100 and warp2d_slack = 20, the range will be divided in 1 second segments, with each alignment window being 100 seconds long, and we can have a maximum warping deviation of +/- 20 seconds.

The default parameters will work fine for common retention time run lengths, but we are working to simplify the parametrization of Warp2D to use a selected deviation in number of sigmas, similar to other parameters in the pipeline.

Deisotoping/Feature detection

Users generally do not need to manually adjust the parameters of the deisotoping/feature detection procedure since those parameters are derived automatically. However, users may want to adjust the feature_detection_charge_states parameter to select the set of possible charge states that will be explored.

Peak/Feature matching

When matching peaks/features across multiple files, a tolerance in m/z and retention time in number of sigmas can be adjusted with metamatch_n_sig_mz and metamatch_n_sig_rt respectively. Additionally, there is a filtering procedure that ensures that only clusters of matched peaks that contain non-zero values in a given percentage of samples are kept. This process works on individual groups, so if metamatch_fraction = 0.7, with 10 samples from group a and 20 of group b, a cluster will be retained if there are at least 7 samples from group a or 14 samples from group b with non-zero values.

Annotation linking and identifications

As with other PASTAQ algorithms, the tolerance for linking MS/MS annotations and identifications is expressed in number of sigmas in m/z and retention time using the parameters link_n_sig_mz and link_n_sig_rt, respectively. In general, for linked peptide-spectrum-matches (PSMs), we may want to keep all the identifications, the user may want to keep all the identifications, not only the ones that has the highest scores, the parameter ident_max_rank_only should be set to False. Similarly, to keep annotations that don’t meet the FDR threshold or are marked as decoy, the parameters ident_require_threshold and ident_ignore_decoy should be set to False, respectively.

Quality control plots

A number of parameters are available to configure the appearance of quality control plots. A color palette can be set with qc_plot_palette parameter, which will accept any color palette included in the seaborn library (husl, crest, Spectral, Flare, etc.). By default image format is set to png. Other formats can be set using qc_plot_extension parameter using common image formats, such as pdf, eps, tiff or svg. The following configuration parameters, can be used to adjust the general style of the output images: qc_plot_dpi (Image DPI), qc_plot_font_family (Font family, e.g. 'serif', 'sans-serif'), qc_plot_font_size, qc_plot_fig_size_x, qc_plot_fig_size_y. Plot legends are not enabled by default, to show the legend, set qc_plot_fig_legend to True. However, this may make certain plots with large number of chromatograms difficult to read.

When plots contain multiple samples, opacity (alpha) can be decreased to blend the different colors. The alpha channel can be controlled for each type of plot by changing qc_plot_fill_alpha, qc_plot_line_alpha, qc_plot_scatter_alpha to a number between 0.0 and 1.0. Alpha can be dynamically selected by setting these parameters to 'dynamic' instead. In that case the alpha would be calculated as alpha = 1 / n_samples, and will never go below the value in qc_plot_min_dynamic_alpha.

If qc_plot_per_file is used, individual images will be generated for each QC plot, instead of having all plots stacked in one figure. Other adjustable QC plot parameters include qc_plot_line_style, which will show certain plots (e.g. XIC, density) as filled plots or line plots by selecting 'fill' or 'line', respectively. In some cases, the amount of points in the mz vs sigma_mz scatterplot may be too large for visual assessment, therefore, qc_plot_mz_vs_sigma_mz_max_peaks will select a maximum number of points to use, and qc_plot_scatter_size parameter can be used to set the point size.

Similarity plots use the number of most intensive peaks set in similarity_num_peaks.

Quantitative table generation

A number of quantitative metrics are calculated in different stages of the pipeline. All this information is stored for the generation of the output csv tables. To avoid generating unnecessary files, a quantitative metric must be selected for isotopes (quant_isotopes) and features (quant_features). Isotope quantification can be set to peak height ('height') or peak volume ('volume'). Likewise, features quantitative information can be based either on the monoisotopic peak ('monoisotopic_height', 'monoisotopic_volume'), the peak with the highest intensity ('max_height', 'max_volume') or the sum of all peaks in the isotopic distribution ('total_height', 'total_volume').

When identification information is provided and quant_features_charge_state_filter is set to True, annotations for features in which the MS1 charge state differs from the one in the identification file will be discarded in the feature quantitative tables.

The type of annotation linkage in quant_ident_linkage can be set to theoretical_mz or msms_event, which set the annotations type in the annotation tables either based on theoretical value of the highest scoring PSM or based on precursor ion m/z stored in the raw data.

When identifications are present in more than one sample, a consensus identification is generated for each peak or feature cluster as long as quant_consensus is enabled. This consensus will select the identifications that appear in the maximum number of files. Additionally will ignore identifications that are not present in at least quant_consensus_min_ident number of samples. By default, this is 2, but more stringent criteria could be used to reduce false positives.

By default all annotations are saved with their corresponding associative tables in addition to the aggregated versions. This can lead to very large files, and thus quant_save_all_annotations can be disabled to save some time and disk space.

Peptide tables are generated by aggregating clusters with the same consensus sequence by summing up multiple charge states of the same peptide. For protein group quantification it is necessary to use protein inference to select which peptides to aggregate for which protein. A minimum number of peptides present for a protein can be selected with quant_proteins_min_peptides. If quant_proteins_remove_subset_proteins is enabled, peptides are only considered for quantification of proteins (razor protein) with the highest number of mapped peptides following parsimonious law. In case a peptide can’t be assigned to a unique protein as ‘razor’ the user can choose to use them for quantification of all mapping proteins where the peptide could be considered as razor (quant_proteins_ignore_ambiguous_peptides is set to False) or to ignore them (quant_proteins_ignore_ambiguous_peptides is set to True). For protein inference, the following quantifications can be selected with the quant_proteins_quant_type parameter:

‘unique’: Only unique peptides are considered for quantification.
‘razor’: Unique and peptides assigned as most likely according to the Occam’s razor constrain.
‘all’: All peptides will be used for quantification for all protein groups. Thus shared peptides will be used more than once.

How to cite this work

The main manuscript has been published in as Open Access Analytical Chemistry with the following details: Alejandro Sánchez Brotons, Jonatan O. Eriksson, Marcel Kwiatkowski, Justina C. Wolters, Ido P. Kema, Andrei Barcaru, Folkert Kuipers, Stephan J. L. Bakker, Rainer Bischoff, Frank Suits, and Péter Horvatovich, Pipelines and Systems for Threshold-Avoiding Quantification of LC–MS/MS Data, Analytical Chemistry, 2021, 93, 32, 11215–11224.