MassNet-Converter is a tool for converting commonly used mass spectrometry data formats into the Mass Spectrometry DDA Tensor (MSDT) format—an efficient, standardized, and AI-friendly representation designed for large-scale proteomics analysis.
-
Supported Input Formats:
mzML(standard open format)MGF(Mascot Generic Format)- Bruker’s native
.ddirectory format (TimsTOF)
-
Output Format:
- Standardized MSDT files stored in Apache Parquet, enabling fast I/O, high compression, and compatibility with distributed GPU/TPU training pipelines.
-
Optimized for AI Workflows:
- Converts raw and search result data into structured tensor format for seamless integration with machine learning models, such as XuanjiNovo, DeepLC, and DDA-BERT.
-
Dockerized Deployment:
- Ready-to-use Docker image available on Docker Hub.
- Run the converter in a reproducible environment without manual dependencies.
We provide test data from Thermo, SCIEX, and Bruker platforms (in mzML, .d, and MGF formats), configuration files for the FragPipe and Sage search engines, and the converted MSDT result files.
All test data and configuration files are available for download via the Google Drive link below:
- 🔗 Download Link: test data and configs
Below are command line examples for running the data conversion using the guomics2017/msdt-converter:v1.1 Docker image for different instrument data.
Note: Please replace the local path
D:\Work\MSDT_Converterin the commands with your actual data storage path.
Uses the config_mzml.json configuration file.
docker run --rm -v "D:\Work\MSDT_Converter":/home guomics2017/msdt-converter:v1.1 -config=/home/config_mzml.jsonUses the config_tims.json configuration file.
docker run --rm -v "D:\Work\MSDT_Converter":/home guomics2017/msdt-converter:v1.1 -config=/home/config_tims.jsonUses the config_wiff.json configuration file.
docker run --rm -v "D:\Work\MSDT_Converter":/home guomics2017/msdt-converter:v1.1 -config=/home/config_wiff.jsonThe running times for our test data by step are roughly as follows:
| mzml | tims | wiff | |
|---|---|---|---|
| generate_rawspectrum | 20 s | 3 min | 1 min |
| generate_sage_search_result | 30 s | 1 min | 1 min |
| generate_fragpipe_search_result | 7 min | 7 min | 5 min |
| generate_msdt(sage) | 5 s | 30 s | 1 min |
| msdt_2_mgf | 5 s | 1 min | 1 min |
| convert_2_msdt | 5 s | 5 s | 5 s |
Setup is available via both Docker and Conda. Choose between Option A (Docker) and Option B (Conda) below.
This repository provides a self-contained Docker image that encapsulates all necessary environments and dependencies for the MassNet-DDA conversion utility. By using this image, users can quickly launch the tool without complex setup.
- Docker Desktop (for Windows/Mac) or Docker Engine (for Linux) must be installed and running.
The process involves pulling the image from Docker Hub and then running a container, mapping your local data directory to the container's working directory.
- Open Docker Desktop. Ensure the Docker engine is running.
- Pull the Docker Image from the registry using your command line (e.g., PowerShell or Command Prompt):
docker pull guomics2017/msdt-converter:v1.1
- Run the Container by mounting your local working directory (
D:\Work\MassNet-DDAin this example) to the container's internal data path (/home/test_data) and specifying the path to your configuration file (config.json):docker run --rm -v "D:\Work\MassNet-DDA":/home/test_data guomics2017/msdt-converter:v1.1 -config=/home/test_data/config.json- Note: The
-vflag maps your local directory to the container. The paths must be adjusted according to your actual setup.
- Note: The
The process involves pulling the image from Docker Hub and then running a container, mapping your local data directory to the container's working directory.
- Ensure the Docker service is running.
- Pull the Docker Image from the registry in your terminal:
docker pull guomics2017/msdt-converter:v1.1
- Run the Container (Example using a typical Linux absolute path):
docker run --rm -v /home/user/MassNet-DDA:/home/test_data guomics2017/msdt-converter:v1.1 -config=/home/test_data/config.json
The process involves pulling the image from Docker Hub and then running a container, mapping your local data directory to the container's working directory.
- Open Docker Desktop. Ensure the Docker engine is running.
- Pull the Docker Image from the registry in your terminal:
docker pull guomics2017/msdt-converter:v1.1
- Run the Container (Example using a typical macOS path):
docker run --rm -v /Users/yourname/Documents/MassNet-DDA:/home/test_data guomics2017/msdt-converter:v1.1 -config=/home/test_data/config.json
You can install MassNet-Converter in a Conda environment. This option is recommended if you prefer a Python-native setup or wish to modify the source code.
⏱️ Estimated setup time: ~2–5 minutes
Download jdk11 from here, unzip and move to project root directory.
⚠️ Note: This project integrates FragPipe v21.1, which includes core components such as Philosopher, diaTracer, and IonQuant. Users working with a different version of FragPipe should download the corresponding components for that version and ensure the FragPipe runtime environment is properly configured.
⚠️ Note: FragPipe versions may differ in their output fields, directory structures, and execution commands. When working with results generated by a different version, users are advised to consult the official output specifications for that release. Accordingly, configuration files (to be specified) or the parsing script (MSDT-Converter/scripts/search_engine.py) may need to be adjusted to ensure correct interpretation and processing of the data.
To learn more about FragPipe’s usage and configuration, please visit: https://github.com/Nesvilab/FragPipe.
- Create a new conda environment first:
conda create --name msdt-converter python=3.13
This will create an anaconda environment
- Activate this environment by running:
conda activate msdt-converter
- Install dependencies:
pip install -r ./requirements.txt
- Set up file permissions:
After cloning the repository and completing the installation, run the following command in the project’s root directory to ensure that all files and subdirectories have the appropriate access and execution permissions:
chmod -R 775 .
⚠️ Note: Before running the script, please download test data and configs to the root directory.
python convert.py -config=/home/test_data/config.jsonThe container requires a single JSON configuration file to define which steps to execute and to specify all necessary input, output, and processing parameters.
The configuration is structured by the main processing steps. Each primary object controls a specific function.
| Parameter Name | Description |
|---|---|
generate_rawspectrum |
Parameters for extracting raw spectral data into a .tsv file. |
generate_sage_search_result |
Parameters for running the Sage search engine. |
generate_fragpipe_search_result |
Parameters for running the FragPipe search pipeline. |
generate_msdt |
Parameters for converting search results and raw data into the MSDT format. |
convert_2_msdt |
Parameters for converting other formats (like MGF) directly to MSDT. |
msdt_2_mgf |
Parameters for converting MSDT back to the MGF format. |
| Parameter | Data Type | Example Value | Description |
|---|---|---|---|
need |
boolean |
true |
Set to true to execute this step (extract raw spectra). |
data_type |
string |
"mzml" |
The type of input data: mzml, tims, or wiff2mzml (for mzML converted from WIFF). |
data_path |
string |
/home/test_data/.../DDA_ingel_3D.mzML |
Input. Absolute path to the raw data file (relative to the Docker mounted volume). |
output |
string |
/home/test_data/.../3D_rawspectrum.tsv |
Output. Path for the generated raw spectrum TSV file. |
| Parameter | Data Type | Example Value | Description |
|---|---|---|---|
need |
boolean |
true |
Set to true to execute this step (run Sage search). |
workdir |
string |
/home/test_data/2_generate_sage_search_result |
Input/Output. Working directory where Sage will generate its result files. |
fasta |
string |
/home/test_data/.../Homo_sapiens_reviewed.fasta |
Input. Path to the FASTA protein sequence database file. |
data_path |
string |
/home/test_data/.../DDA_ingel_3D.mzML |
Input. Path to the mzML file used for searching. |
config_path |
string |
/home/test_data/.../sage_config.json |
Input. Path to the specific configuration file for the Sage search engine. |
| Parameter | Data Type | Example Value | Description |
|---|---|---|---|
need |
boolean |
true |
Set to true to execute this step (run FragPipe search). |
workdir |
string |
/home/test_data/3_generate_fragpipe_search_result |
Input/Output. Working directory where FragPipe will generate results. |
data_path |
string |
/home/test_data/.../DDA_ingel_3D.mzML |
Input. Path to the mzML file used for searching. |
workflow_path |
string |
/home/test_data/.../LFQ_DDA_human_noNQ.workflow |
Input. Path to the FragPipe workflow configuration file and the fasta path should be set in the workflow. |
manifest_path |
string |
/home/test_data/.../fragpipe-files.fp-manifest |
Output. Path for the FragPipe temporary manifest output file. |
thread_num |
integer |
10 |
The number of CPU threads to use for the FragPipe search process. |
This section contains nested configurations based on data type (tims, mzml, wiff).
| Parameter | Data Type | Example Value | Description |
|---|---|---|---|
need_tims |
boolean |
false |
Set to true to generate MSDT from tims data (not currently configured in the example). |
rawspectrum_path |
string |
"" |
Input. Path to the raw spectrum file. |
sage_search_result_path |
string |
"" |
Input. Path to the Sage search result file. |
unify_residue |
boolean |
true |
If true, the residue format will be converted to the unified MSDT format. |
output |
string |
"" |
Output. Path for the generated Sage MSDT file. |
| Parameter | Data Type | Example Value | Description |
|---|---|---|---|
need_mzml |
boolean |
true |
Set to true to generate MSDT from mzML related data. |
need_sage |
boolean |
true |
Set to true to generate MSDT from Sage search results. |
need_fragpipe |
boolean |
true |
Set to true to generate MSDT from FragPipe search results. |
rawspectrum_path |
string |
/home/test_data/.../3D_rawspectrum.tsv |
Input. Path to the raw spectrum file. |
sage_search_result_path |
string |
/home/test_data/.../D_search_result.tsv |
Input. Path to the Sage search result file. |
fp_pin_path |
string |
/home/test_data/.../A18..._edited.pin |
Input. Path to the FragPipe .pin file. |
sage_unify_residue |
boolean |
true |
If true, Sage residue format converts to MSDT format. |
fp_unify_residue |
boolean |
true |
If true, FragPipe residue format converts to MSDT format. |
sage_output |
string |
/home/test_data/.../sage_msdt.parquet |
Output. Path for the generated Sage MSDT .parquet file. |
fp_output |
string |
/home/test_data/.../fp_msdt.parquet |
Output. Path for the generated FragPipe MSDT .parquet file. |
| Parameter | Data Type | Example Value | Description |
|---|---|---|---|
need_wiff |
boolean |
false |
Set to true to generate MSDT from WIFF related data (not currently configured in the example). |
wiff_mzml_path |
string |
"" |
Input. Path to the mzML file converted from WIFF. |
rawspectrum_path |
string |
"" |
Input. Path to the raw spectrum file. |
sage_search_result_path |
string |
"" |
Input. Path to the Sage search result file. |
unify_residue |
boolean |
true |
If true, the residue format will be converted to the unified MSDT format. |
output |
string |
"" |
Output. Path for the generated Sage MSDT file. |
This section handles direct conversion from other data formats to MSDT.
| Parameter | Data Type | Example Value | Description |
|---|---|---|---|
need |
boolean |
true |
Set to true to execute this MGF conversion step. |
mgf_path |
string |
/home/test_data/.../180624_G12.MGF |
Input. Path to the MGF file to be converted. |
output_path |
string |
/home/test_data/.../180624_G12.parquet |
Output. Path for the generated MSDT .parquet file. |
field_type_dict |
object |
{...} |
A dictionary defining the fields present in the MGF file and their corresponding data types. |
field_type_dictdetails:
| Key | Data Type | Description |
|---|---|---|
TITLE |
"string" |
The title of the spectrum (required). |
PEPMASS |
"float" |
The precursor mass (required). |
CHARGE |
"int" |
The precursor charge (e.g., "2+"). Must be convertible to integer. |
RTINSECONDS |
"float" |
The retention time in seconds. |
INSTRUMENT |
"string" |
The instrument name. |
| Parameter | Data Type | Example Value | Description |
|---|---|---|---|
need |
boolean |
true |
Set to true to execute this step (convert MSDT back to MGF). |
msdt_path |
string |
/home/test_data/.../sage_msdt.parquet |
Input. Path to the MSDT .parquet file to be converted. |
output_path |
string |
/home/test_data/.../sage.mgf |
Output. Path for the generated MGF file. |
If you use MassNet-Converter in your work, please cite the following publication:
Jun, A., Zhang, X., Zhang, X., Wei, J., Zhang, T., Deng, Y., ... & Guo, T. (2025). MassNet: billion-scale AI-friendly mass spectral corpus enables robust de novo peptide sequencing. bioRxiv, 2025-06.