|
| 1 | +--- |
| 2 | +title: "UniMol_Tools v0.15: Open-Source Lightweight Pre-Training Framework for One-Click Reproduction of Original Uni-Mol Accuracy!" |
| 3 | +date: 2025-09-29 |
| 4 | +categories: |
| 5 | +- Uni-Mol |
| 6 | +--- |
| 7 | + |
| 8 | +The official release of UniMol_Tools v0.15 introduces lightweight pre-training and a synchronized full-process command-line tool based on Hydra. Developers can complete the entire workflow from preprocessing → pre-training → fine-tuning → property prediction with just a few lines of code, and the reproduced results are nearly identical to those of the original Uni-Mol. This new version aims to provide an efficient and reproducible computing platform for research in materials science, medicinal chemistry, and molecular design. |
| 9 | + |
| 10 | +<!-- more --> |
| 11 | + |
| 12 | +## Core Highlights |
| 13 | + |
| 14 | +This release marks the first research tool on the market that simultaneously covers molecular representation, property prediction, and custom pre-training, offering an efficient and reproducible computing platform for studies in materials science, medicinal chemistry, and molecular design. |
| 15 | + |
| 16 | +1. Lightweight Pre-Training |
| 17 | +The complete pipeline supports masking strategies, multi-task loss functions, metric aggregation, and distributed training, while being compatible with custom pre-trained models and dictionary paths. |
| 18 | +2. One-Command Execution |
| 19 | +Hydra configuration management enables one-click execution of training, representation, and prediction workflows, making experimental reproduction more efficient. |
| 20 | +3. Research-Friendly Optimizations |
| 21 | +Features dynamic loss scaling, mixed-precision training, distributed support, and checkpoint resumption, adapting to large-scale molecular data. |
| 22 | +4. End-to-End Modeling |
| 23 | +Provides a one-stop solution for data preprocessing, model training, molecular representation generation, and property prediction. |
| 24 | +5. Extensibility & Configurability |
| 25 | +Offers abundant configuration files and examples for quick onboarding and customization of personalized tasks. |
| 26 | + |
| 27 | +### Comparison Between UniMol_Tools v0.15 and the Original Uni-Mol |
| 28 | + |
| 29 | +| Capability | This Release | Original Uni-Mol | |
| 30 | +|--------------|--------------|--------------| |
| 31 | +| Pre-training Code Lines | Newly written, over 2,000 lines | Over 6,000 lines | |
| 32 | +| Distributed Training | Natively supports DDP & mixed precision | Requires manual configuration | |
| 33 | +| Data Formats | csv / sdf / smi / txt / lmdb | Only lmdb | |
| 34 | +| Downstream Fine-Tuning | Weight zero conversion; direct use of unimol_tools.train/predict | Requires manual format modification | |
| 35 | + |
| 36 | + |
| 37 | +### One-Command Pre-Training |
| 38 | + |
| 39 | +The new version delivers an "out-of-the-box" training experience. Research users can complete the entire pre-training workflow from data preprocessing to model training with a single command, significantly lowering the barrier to experimentation. |
| 40 | + |
| 41 | +```python |
| 42 | +torchrun \ # DDP |
| 43 | + --nnodes=$MLP_WORKER_NUM \ |
| 44 | + --nproc_per_node=$MLP_WORKER_GPU \ |
| 45 | + --node_rank=$MLP_ROLE_INDEX \ |
| 46 | + --master_addr=$MLP_WORKER_0_HOST \ |
| 47 | + --master_port=$MLP_WORKER_0_PORT \ |
| 48 | + -m unimol_tools.cli.run_pretrain \ |
| 49 | + dataset.train_path=train.csv \ |
| 50 | + dataset.valid_path=valid.csv \ |
| 51 | + dataset.data_type=csv \ # optional: csv, sdf, smi, txt, list |
| 52 | + dataset.smiles_column=smiles \ |
| 53 | + training.total_steps=1000000 \ |
| 54 | + training.batch_size=16 \ |
| 55 | + training.update_freq=1 |
| 56 | +``` |
| 57 | + |
| 58 | +## Technical Details |
| 59 | + |
| 60 | +1. Multi-Target Masking Loss (Masked Token + 3D Coord + Dist Map) |
| 61 | +The pre-training curve overlaps with the original Uni-Mol by over 99%, ensuring stable performance. |
| 62 | + |
| 63 | +<center><img src=https://dp-public.oss-cn-beijing.aliyuncs.com/community/Blog%20Files/UniMol_Tools_v0.15_29_09_2025/pic01.png pic_center width="70%" height="70%" /></center> |
| 64 | + |
| 65 | +<center><img src=https://dp-public.oss-cn-beijing.aliyuncs.com/community/Blog%20Files/UniMol_Tools_v0.15_29_09_2025/pic02.png pic_center width="70%" height="70%" /></center> |
| 66 | + |
| 67 | +<center><img src=https://dp-public.oss-cn-beijing.aliyuncs.com/community/Blog%20Files/UniMol_Tools_v0.15_29_09_2025/pic03.png pic_center width="70%" height="70%" /></center> |
| 68 | + |
| 69 | +<center><img src=https://dp-public.oss-cn-beijing.aliyuncs.com/community/Blog%20Files/UniMol_Tools_v0.15_29_09_2025/pic04.png pic_center width="70%" height="70%" /></center> |
| 70 | + |
| 71 | +2. Modular Design |
| 72 | +The complete workflow can be reproduced with just four files: |
| 73 | +```python |
| 74 | +unimol_tools/pretrain/ |
| 75 | +├── dataset.py # Masking + data pipeline |
| 76 | +├── loss.py # Multi-target loss |
| 77 | +├── trainer.py # Distributed training loop |
| 78 | +└── unimol.py # Model architecture |
| 79 | +``` |
| 80 | +This minimizes the threshold for secondary development—modify just one line of configuration to run custom tasks. |
| 81 | + |
| 82 | +3. Backward Compatibility |
| 83 | +- Existing APIs such as unimol_tools.train / predict / repr remain unchanged; |
| 84 | +- Supports passing custom pretrained_model_path and dict_path—old scripts only need two additional parameters to load new weights; |
| 85 | + |
| 86 | +## Overvoew of Updates |
| 87 | + |
| 88 | +- Lightweight pre-training module: The complete pipeline supports masking strategies, multi-target loss for 3D coordinates and distance matrices, metric aggregation, and distributed training; |
| 89 | +- Hydra full-process CLI: One command to run training, representation, and prediction; parameters can be quickly adjusted; |
| 90 | +- Enhanced data processing: Supports csv / sdf / smi / txt / lmdb, flexibly adapting to formats commonly used by research users; |
| 91 | +- Optimized distributed training: Native DDP + mixed precision, supporting checkpoint resumption; |
| 92 | +- Modular design: The complete workflow can be reproduced with only four core files, facilitating secondary development; |
| 93 | +- Compatibility with old-version APIs: Load new pre-trained weights without modifications, supporting custom models and dictionary paths; |
| 94 | +- Performance and reproducibility guarantee: Pre-training curve is highly consistent with the original Uni-Mol; |
| 95 | + |
| 96 | +## Open-Source Community |
| 97 | + |
| 98 | +UniMol_Tools is one of the open-source projects under the DeepModeling community. Developers interested in the project are welcome to participate long-term: |
| 99 | +- GitHub Repo: https://github.com/deepmodeling/unimol_tools |
| 100 | +- Documentation: https://unimol-tools.readthedocs.io/ |
| 101 | +- The Issue section welcomes feedback on problems, suggestions, and feature requests; |
| 102 | +- New users can refer to the Readme and documentation for quick onboarding. |
| 103 | +If you encounter any issues during use, please submit an Issue on GitHub or contact us via email. |
| 104 | + |
| 105 | +## About Uni-Mol |
| 106 | + |
| 107 | +Uni-Mol is a widely acclaimed molecular pre-training model in recent years, dedicated to building a universal 3D molecular modeling framework. As its derivative toolkit, UniMol_Tools aims to lower the application threshold of the model and improve development efficiency. |
0 commit comments