Skip to content

Commit df30e6b

Browse files
authored
Create UniMol_Tools_v0.15_29_09_2025.md (#322)
1 parent ea97703 commit df30e6b

File tree

1 file changed

+107
-0
lines changed

1 file changed

+107
-0
lines changed
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
title: "UniMol_Tools v0.15: Open-Source Lightweight Pre-Training Framework for One-Click Reproduction of Original Uni-Mol Accuracy!"
3+
date: 2025-09-29
4+
categories:
5+
- Uni-Mol
6+
---
7+
8+
The official release of UniMol_Tools v0.15 introduces lightweight pre-training and a synchronized full-process command-line tool based on Hydra. Developers can complete the entire workflow from preprocessing → pre-training → fine-tuning → property prediction with just a few lines of code, and the reproduced results are nearly identical to those of the original Uni-Mol. This new version aims to provide an efficient and reproducible computing platform for research in materials science, medicinal chemistry, and molecular design.
9+
10+
<!-- more -->
11+
12+
## Core Highlights
13+
14+
This release marks the first research tool on the market that simultaneously covers molecular representation, property prediction, and custom pre-training, offering an efficient and reproducible computing platform for studies in materials science, medicinal chemistry, and molecular design.
15+
16+
1. Lightweight Pre-Training
17+
The complete pipeline supports masking strategies, multi-task loss functions, metric aggregation, and distributed training, while being compatible with custom pre-trained models and dictionary paths.
18+
2. One-Command Execution
19+
Hydra configuration management enables one-click execution of training, representation, and prediction workflows, making experimental reproduction more efficient.
20+
3. Research-Friendly Optimizations
21+
Features dynamic loss scaling, mixed-precision training, distributed support, and checkpoint resumption, adapting to large-scale molecular data.
22+
4. End-to-End Modeling
23+
Provides a one-stop solution for data preprocessing, model training, molecular representation generation, and property prediction.
24+
5. Extensibility & Configurability
25+
Offers abundant configuration files and examples for quick onboarding and customization of personalized tasks.
26+
27+
### Comparison Between UniMol_Tools v0.15 and the Original Uni-Mol
28+
29+
| Capability | This Release | Original Uni-Mol |
30+
|--------------|--------------|--------------|
31+
| Pre-training Code Lines | Newly written, over 2,000 lines | Over 6,000 lines |
32+
| Distributed Training | Natively supports DDP & mixed precision | Requires manual configuration |
33+
| Data Formats | csv / sdf / smi / txt / lmdb | Only lmdb |
34+
| Downstream Fine-Tuning | Weight zero conversion; direct use of unimol_tools.train/predict | Requires manual format modification |
35+
36+
37+
### One-Command Pre-Training
38+
39+
The new version delivers an "out-of-the-box" training experience. Research users can complete the entire pre-training workflow from data preprocessing to model training with a single command, significantly lowering the barrier to experimentation.
40+
41+
```python
42+
torchrun \ # DDP
43+
--nnodes=$MLP_WORKER_NUM \
44+
--nproc_per_node=$MLP_WORKER_GPU \
45+
--node_rank=$MLP_ROLE_INDEX \
46+
--master_addr=$MLP_WORKER_0_HOST \
47+
--master_port=$MLP_WORKER_0_PORT \
48+
-m unimol_tools.cli.run_pretrain \
49+
dataset.train_path=train.csv \
50+
dataset.valid_path=valid.csv \
51+
dataset.data_type=csv \ # optional: csv, sdf, smi, txt, list
52+
dataset.smiles_column=smiles \
53+
training.total_steps=1000000 \
54+
training.batch_size=16 \
55+
training.update_freq=1
56+
```
57+
58+
## Technical Details
59+
60+
1. Multi-Target Masking Loss (Masked Token + 3D Coord + Dist Map)
61+
The pre-training curve overlaps with the original Uni-Mol by over 99%, ensuring stable performance.
62+
63+
<center><img src=https://dp-public.oss-cn-beijing.aliyuncs.com/community/Blog%20Files/UniMol_Tools_v0.15_29_09_2025/pic01.png pic_center width="70%" height="70%" /></center>
64+
65+
<center><img src=https://dp-public.oss-cn-beijing.aliyuncs.com/community/Blog%20Files/UniMol_Tools_v0.15_29_09_2025/pic02.png pic_center width="70%" height="70%" /></center>
66+
67+
<center><img src=https://dp-public.oss-cn-beijing.aliyuncs.com/community/Blog%20Files/UniMol_Tools_v0.15_29_09_2025/pic03.png pic_center width="70%" height="70%" /></center>
68+
69+
<center><img src=https://dp-public.oss-cn-beijing.aliyuncs.com/community/Blog%20Files/UniMol_Tools_v0.15_29_09_2025/pic04.png pic_center width="70%" height="70%" /></center>
70+
71+
2. Modular Design
72+
The complete workflow can be reproduced with just four files:
73+
```python
74+
unimol_tools/pretrain/
75+
├── dataset.py # Masking + data pipeline
76+
├── loss.py # Multi-target loss
77+
├── trainer.py # Distributed training loop
78+
└── unimol.py # Model architecture
79+
```
80+
This minimizes the threshold for secondary development—modify just one line of configuration to run custom tasks.
81+
82+
3. Backward Compatibility
83+
- Existing APIs such as unimol_tools.train / predict / repr remain unchanged;
84+
- Supports passing custom pretrained_model_path and dict_path—old scripts only need two additional parameters to load new weights;
85+
86+
## Overvoew of Updates
87+
88+
- Lightweight pre-training module: The complete pipeline supports masking strategies, multi-target loss for 3D coordinates and distance matrices, metric aggregation, and distributed training;
89+
- Hydra full-process CLI: One command to run training, representation, and prediction; parameters can be quickly adjusted;
90+
- Enhanced data processing: Supports csv / sdf / smi / txt / lmdb, flexibly adapting to formats commonly used by research users;
91+
- Optimized distributed training: Native DDP + mixed precision, supporting checkpoint resumption;
92+
- Modular design: The complete workflow can be reproduced with only four core files, facilitating secondary development;
93+
- Compatibility with old-version APIs: Load new pre-trained weights without modifications, supporting custom models and dictionary paths;
94+
- Performance and reproducibility guarantee: Pre-training curve is highly consistent with the original Uni-Mol;
95+
96+
## Open-Source Community
97+
98+
UniMol_Tools is one of the open-source projects under the DeepModeling community. Developers interested in the project are welcome to participate long-term:
99+
- GitHub Repo: https://github.com/deepmodeling/unimol_tools
100+
- Documentation: https://unimol-tools.readthedocs.io/
101+
- The Issue section welcomes feedback on problems, suggestions, and feature requests;
102+
- New users can refer to the Readme and documentation for quick onboarding.
103+
If you encounter any issues during use, please submit an Issue on GitHub or contact us via email.
104+
105+
## About Uni-Mol
106+
107+
Uni-Mol is a widely acclaimed molecular pre-training model in recent years, dedicated to building a universal 3D molecular modeling framework. As its derivative toolkit, UniMol_Tools aims to lower the application threshold of the model and improve development efficiency.

0 commit comments

Comments
 (0)