Skip to content

Commit 5f1dec8

Browse files
Copilotnjzjzpre-commit-ci[bot]
authored
feat: add support for multiple LAMMPS atom styles with automatic detection (#867)
This PR adds comprehensive support for different LAMMPS atom styles beyond the previously supported "atomic" style. The implementation now supports 8 common LAMMPS atom styles with **automatic detection** and charge extraction while maintaining full backward compatibility. ## Supported Atom Styles - **atomic**: atom-ID atom-type x y z (default fallback) - **full**: atom-ID molecule-ID atom-type q x y z (includes charges and molecule IDs) - **charge**: atom-ID atom-type q x y z (includes charges) - **bond**: atom-ID molecule-ID atom-type x y z (includes molecule IDs) - **angle**: atom-ID molecule-ID atom-type x y z - **molecular**: atom-ID molecule-ID atom-type x y z - **dipole**: atom-ID atom-type q x y z mux muy muz (includes charges) - **sphere**: atom-ID atom-type diameter density x y z ## Key Features - **Automatic atom style detection**: Parses LAMMPS data file headers and comments (e.g., `Atoms # full`) with intelligent fallback based on column analysis - **Automatic charge extraction and registration**: For atom styles that include charges (full, charge, dipole), charges are automatically extracted, stored, and properly registered as a DataType - **Smart defaults**: `atom_style="auto"` is now the default, eliminating the need for manual specification in most cases - **Backward compatibility**: Existing code continues to work without any changes - **Robust error handling**: Clear error messages for unsupported atom styles with graceful fallbacks ## Usage ```python # Automatic detection (new default behavior) system = dpdata.System("data.lmp", type_map=["O", "H"]) # Detects style automatically # Full style with charges and molecule IDs system = dpdata.System("data.lmp", type_map=["O", "H"]) # Auto-detects "full" style print(system["charges"]) # Access extracted charges # Explicit styles still supported for edge cases system = dpdata.System("data.lmp", type_map=["O", "H"], atom_style="charge") ``` ## Implementation Details The solution adds intelligent atom style detection that: 1. Parses header comments after "Atoms" sections for explicit style declarations 2. Uses heuristic analysis of column count and content patterns as fallback 3. Maintains the existing configurable atom style parameter for explicit control 4. Automatically registers charge DataType when charge data is present All parsing functions (`get_atype`, `get_posi`, `get_charges`) were updated to handle different column arrangements with full type hints. Comprehensive tests cover both comment-based and heuristic detection scenarios. Fixes #853. <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent f727ada commit 5f1dec8

File tree

3 files changed

+624
-18
lines changed

3 files changed

+624
-18
lines changed

dpdata/lammps/lmp.py

Lines changed: 277 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,82 @@
77
ptr_int_fmt = "%6d"
88
ptr_key_fmt = "%15s"
99

10+
# Mapping of LAMMPS atom styles to their column layouts
11+
# Format: (atom_id_col, atom_type_col, x_col, y_col, z_col, has_molecule_id, has_charge, charge_col)
12+
ATOM_STYLE_COLUMNS = {
13+
"atomic": (0, 1, 2, 3, 4, False, False, None),
14+
"angle": (0, 2, 3, 4, 5, True, False, None),
15+
"bond": (0, 2, 3, 4, 5, True, False, None),
16+
"charge": (0, 1, 3, 4, 5, False, True, 2),
17+
"full": (0, 2, 4, 5, 6, True, True, 3),
18+
"molecular": (0, 2, 3, 4, 5, True, False, None),
19+
"dipole": (0, 1, 3, 4, 5, False, True, 2),
20+
"sphere": (0, 1, 4, 5, 6, False, False, None),
21+
}
22+
23+
24+
def detect_atom_style(lines: list[str]) -> str | None:
25+
"""Detect LAMMPS atom style from data file content.
26+
27+
Parameters
28+
----------
29+
lines : list
30+
Lines from LAMMPS data file
31+
32+
Returns
33+
-------
34+
str or None
35+
Detected atom style, or None if not detected
36+
"""
37+
# Look for atom style in comments after "Atoms" section header
38+
atom_lines = get_atoms(lines)
39+
if not atom_lines:
40+
return None
41+
42+
# Find the "Atoms" line
43+
for idx, line in enumerate(lines):
44+
if "Atoms" in line:
45+
# Check if there's a comment with atom style after "Atoms"
46+
if "#" in line:
47+
comment_part = line.split("#")[1].strip().lower()
48+
for style in ATOM_STYLE_COLUMNS:
49+
if style in comment_part:
50+
return style
51+
break
52+
53+
# If no explicit style found, try to infer from first data line
54+
if atom_lines:
55+
first_line = atom_lines[0].split()
56+
num_cols = len(first_line)
57+
58+
# Try to match based on number of columns and content patterns
59+
# This is a heuristic approach
60+
if num_cols == 5:
61+
# Could be atomic style: atom-ID atom-type x y z
62+
return "atomic"
63+
elif num_cols == 6:
64+
# Could be charge or bond/molecular style
65+
# Try to determine if column 2 (index 2) looks like a charge (float) or type (int)
66+
try:
67+
val = float(first_line[2])
68+
# If it's a small float, likely a charge
69+
if abs(val) < 10 and val != int(val):
70+
return "charge"
71+
else:
72+
# Likely molecule ID (integer), so bond/molecular style
73+
return "bond"
74+
except ValueError:
75+
return "atomic" # fallback
76+
elif num_cols == 7:
77+
# Could be full style: atom-ID molecule-ID atom-type charge x y z
78+
return "full"
79+
elif num_cols >= 8:
80+
# Could be dipole or sphere style
81+
# For now, default to dipole if we have enough columns
82+
return "dipole"
83+
84+
return None # Unable to detect
85+
1086

1187
def _get_block(lines, keys):
1288
for idx in range(len(lines)):
@@ -95,8 +171,67 @@ def _atom_info_atom(line):
95171
return int(vec[0]), int(vec[1]), float(vec[2]), float(vec[3]), float(vec[4])
96172

97173

98-
def get_natoms_vec(lines):
99-
atype = get_atype(lines)
174+
def _atom_info_style(line: str, atom_style: str = "atomic") -> dict[str, int | float]:
175+
"""Parse atom information based on the specified atom style.
176+
177+
Parameters
178+
----------
179+
line : str
180+
The atom line from LAMMPS data file
181+
atom_style : str
182+
The LAMMPS atom style (atomic, full, charge, etc.)
183+
184+
Returns
185+
-------
186+
dict
187+
Dictionary containing parsed atom information with keys:
188+
'atom_id', 'atom_type', 'x', 'y', 'z', 'molecule_id' (if present), 'charge' (if present)
189+
"""
190+
if atom_style not in ATOM_STYLE_COLUMNS:
191+
raise ValueError(
192+
f"Unsupported atom style: {atom_style}. Supported styles: {list(ATOM_STYLE_COLUMNS.keys())}"
193+
)
194+
195+
vec = line.split()
196+
columns = ATOM_STYLE_COLUMNS[atom_style]
197+
198+
result = {
199+
"atom_id": int(vec[columns[0]]),
200+
"atom_type": int(vec[columns[1]]),
201+
"x": float(vec[columns[2]]),
202+
"y": float(vec[columns[3]]),
203+
"z": float(vec[columns[4]]),
204+
}
205+
206+
# Add molecule ID if present
207+
if columns[5]: # has_molecule_id
208+
result["molecule_id"] = int(
209+
vec[1]
210+
) # molecule ID is always in column 1 when present
211+
212+
# Add charge if present
213+
if columns[6]: # has_charge
214+
result["charge"] = float(vec[columns[7]]) # charge_col
215+
216+
return result
217+
218+
219+
def get_natoms_vec(lines: list[str], atom_style: str = "atomic") -> list[int]:
220+
"""Get number of atoms for each atom type.
221+
222+
Parameters
223+
----------
224+
lines : list
225+
Lines from LAMMPS data file
226+
atom_style : str
227+
The LAMMPS atom style
228+
229+
Returns
230+
-------
231+
list
232+
Number of atoms for each atom type
233+
"""
234+
atype = get_atype(lines, atom_style=atom_style)
100235
natoms_vec = []
101236
natomtypes = get_natomtypes(lines)
102237
for ii in range(natomtypes):
@@ -105,29 +240,91 @@ def get_natoms_vec(lines):
105240
return natoms_vec
106241

107242

108-
def get_atype(lines, type_idx_zero=False):
243+
def get_atype(
244+
lines: list[str], type_idx_zero: bool = False, atom_style: str = "atomic"
245+
) -> np.ndarray:
246+
"""Get atom types from LAMMPS data file.
247+
248+
Parameters
249+
----------
250+
lines : list
251+
Lines from LAMMPS data file
252+
type_idx_zero : bool
253+
Whether to use zero-based indexing for atom types
254+
atom_style : str
255+
The LAMMPS atom style
256+
257+
Returns
258+
-------
259+
np.ndarray
260+
Array of atom types
261+
"""
109262
alines = get_atoms(lines)
110263
atype = []
111264
for ii in alines:
112-
# idx, mt, at, q, x, y, z = _atom_info_mol(ii)
113-
idx, at, x, y, z = _atom_info_atom(ii)
265+
atom_info = _atom_info_style(ii, atom_style)
266+
at = atom_info["atom_type"]
114267
if type_idx_zero:
115268
atype.append(at - 1)
116269
else:
117270
atype.append(at)
118271
return np.array(atype, dtype=int)
119272

120273

121-
def get_posi(lines):
274+
def get_posi(lines: list[str], atom_style: str = "atomic") -> np.ndarray:
275+
"""Get atomic positions from LAMMPS data file.
276+
277+
Parameters
278+
----------
279+
lines : list
280+
Lines from LAMMPS data file
281+
atom_style : str
282+
The LAMMPS atom style
283+
284+
Returns
285+
-------
286+
np.ndarray
287+
Array of atomic positions
288+
"""
122289
atom_lines = get_atoms(lines)
123290
posis = []
124291
for ii in atom_lines:
125-
# posis.append([float(jj) for jj in ii.split()[4:7]])
126-
posis.append([float(jj) for jj in ii.split()[2:5]])
292+
atom_info = _atom_info_style(ii, atom_style)
293+
posis.append([atom_info["x"], atom_info["y"], atom_info["z"]])
127294
return np.array(posis)
128295

129296

130-
def get_spins(lines):
297+
def get_charges(lines: list[str], atom_style: str = "atomic") -> np.ndarray | None:
298+
"""Get atomic charges from LAMMPS data file if the atom style supports charges.
299+
300+
Parameters
301+
----------
302+
lines : list
303+
Lines from LAMMPS data file
304+
atom_style : str
305+
The LAMMPS atom style
306+
307+
Returns
308+
-------
309+
np.ndarray or None
310+
Array of atomic charges if atom style has charges, None otherwise
311+
"""
312+
if atom_style not in ATOM_STYLE_COLUMNS:
313+
raise ValueError(f"Unsupported atom style: {atom_style}")
314+
315+
# Check if this atom style has charges
316+
if not ATOM_STYLE_COLUMNS[atom_style][6]: # has_charge
317+
return None
318+
319+
atom_lines = get_atoms(lines)
320+
charges = []
321+
for ii in atom_lines:
322+
atom_info = _atom_info_style(ii, atom_style)
323+
charges.append(atom_info["charge"])
324+
return np.array(charges)
325+
326+
327+
def get_spins(lines: list[str], atom_style: str = "atomic") -> np.ndarray | None:
131328
atom_lines = get_atoms(lines)
132329
if len(atom_lines[0].split()) < 8:
133330
return None
@@ -161,9 +358,32 @@ def get_lmpbox(lines):
161358
return box_info, tilt
162359

163360

164-
def system_data(lines, type_map=None, type_idx_zero=True):
361+
def system_data(
362+
lines: list[str],
363+
type_map: list[str] | None = None,
364+
type_idx_zero: bool = True,
365+
atom_style: str = "atomic",
366+
) -> dict:
367+
"""Parse LAMMPS data file to system data format.
368+
369+
Parameters
370+
----------
371+
lines : list
372+
Lines from LAMMPS data file
373+
type_map : list, optional
374+
Mapping from atom types to element names
375+
type_idx_zero : bool
376+
Whether to use zero-based indexing for atom types
377+
atom_style : str
378+
The LAMMPS atom style (atomic, full, charge, etc.)
379+
380+
Returns
381+
-------
382+
dict
383+
System data dictionary
384+
"""
165385
system = {}
166-
system["atom_numbs"] = get_natoms_vec(lines)
386+
system["atom_numbs"] = get_natoms_vec(lines, atom_style=atom_style)
167387
system["atom_names"] = []
168388
if type_map is None:
169389
for ii in range(len(system["atom_numbs"])):
@@ -177,20 +397,61 @@ def system_data(lines, type_map=None, type_idx_zero=True):
177397
system["orig"] = np.array(orig)
178398
system["cells"] = [np.array(cell)]
179399
natoms = sum(system["atom_numbs"])
180-
system["atom_types"] = get_atype(lines, type_idx_zero=type_idx_zero)
181-
system["coords"] = [get_posi(lines)]
400+
system["atom_types"] = get_atype(
401+
lines, type_idx_zero=type_idx_zero, atom_style=atom_style
402+
)
403+
system["coords"] = [get_posi(lines, atom_style=atom_style)]
182404
system["cells"] = np.array(system["cells"])
183405
system["coords"] = np.array(system["coords"])
184406

185-
spins = get_spins(lines)
407+
# Add charges if the atom style supports them
408+
charges = get_charges(lines, atom_style=atom_style)
409+
if charges is not None:
410+
system["charges"] = np.array([charges])
411+
412+
spins = get_spins(lines, atom_style=atom_style)
186413
if spins is not None:
187414
system["spins"] = np.array([spins])
188415

189416
return system
190417

191418

192-
def to_system_data(lines, type_map=None, type_idx_zero=True):
193-
return system_data(lines, type_map=type_map, type_idx_zero=type_idx_zero)
419+
def to_system_data(
420+
lines: list[str],
421+
type_map: list[str] | None = None,
422+
type_idx_zero: bool = True,
423+
atom_style: str = "atomic",
424+
) -> dict:
425+
"""Parse LAMMPS data file to system data format.
426+
427+
Parameters
428+
----------
429+
lines : list
430+
Lines from LAMMPS data file
431+
type_map : list, optional
432+
Mapping from atom types to element names
433+
type_idx_zero : bool
434+
Whether to use zero-based indexing for atom types
435+
atom_style : str
436+
The LAMMPS atom style. If "auto", attempts to detect automatically
437+
from file. Default is "atomic".
438+
439+
Returns
440+
-------
441+
dict
442+
System data dictionary
443+
"""
444+
# Attempt automatic detection if requested
445+
if atom_style == "auto":
446+
detected_style = detect_atom_style(lines)
447+
if detected_style:
448+
atom_style = detected_style
449+
else:
450+
atom_style = "atomic" # fallback to default
451+
452+
return system_data(
453+
lines, type_map=type_map, type_idx_zero=type_idx_zero, atom_style=atom_style
454+
)
194455

195456

196457
def rotate_to_lower_triangle(

0 commit comments

Comments
 (0)