Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solvent mask partial occ 202410 #345

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

CV-GPhL
Copy link
Contributor

@CV-GPhL CV-GPhL commented Dec 10, 2024

New features to create non-binary masks when dealing with partial
occupancies and alternate conformations.

The new command-line argument (to "gemmi mask") --set-occupancy triggers the switch from a binary (0,1) mask computation to one that accommodates partially occupied atoms. In that case we need to loop over all atoms and find (for each grid point) the maximum occupancy of any atom hitting it.

To accommodate the current method of running a flood-fill algorithm (for island remvavl), we need to create an intermediate binary grid ... at least that is the way I found it to work.

There might easily be more elegant ways of doing some of this, but the bottom line is: it seems to work and do exactly what is wanted and therefore improves the computation of the bulk solvent contribution during refinement ... at least for BUSTER (https://www.globalphasing.com/buster/)

Copy link
Member

@wojdyr wojdyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Considering altlocs is a clear improvement.

I added several comments, but haven't reviewed the main changes yet.
I'll go through it this week.

If you want to change anything in the meantime, force-push it into your branch.

@@ -301,7 +301,9 @@ struct Ccp4 : public Ccp4Base {
read_ccp4_file(input.path());
}

void write_ccp4_map(const std::string& path) const;
/// @param path filename for output CCP4/MRC map
/// @param mode_out (optional) map mode (default of -1 leaves it at the current value); typical values are 0 (binary/integer mask) or 2 (float map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are doxygen @param comments wrappable? Currently max. line length in gemmi is 100.

assert(ccp4_header.size() >= 256);
fileptr_t f = file_open(path.c_str(), "wb");
std::fwrite(ccp4_header.data(), 4, ccp4_header.size(), f.get());
int mode = header_i32(4);
if (mode == 0)
if (mode_out < 0 ) mode_out = mode;
if (mode_out == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if mode_out is different than mode, the file will be incorrect, because ccp4_header.data() written above has mode not matching the data.
I think the new arg is not needed, because you set mode before calling write_ccp4_map anyway.

use_points_around<true>(fctr, radius, [&](T& ref, double) { ref += value; }, false);
else
use_points_around<false>(fctr, radius, [&](T& ref, double) { ref += value; }, false);
}
Copy link
Member

@wojdyr wojdyr Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comment for all the set_points_around_to_* functions)
if all these functions were to be added, we'd need a general function to avoid duplication. A function that would take Position, radius, function and bool use_pbc. But here it'd be simpler to move one function which is used to solmask.hpp. It reduces to two lines:

Fractional fctr = unit_cell.fractionalize(ctr);
use_points_around<true>(fctr, radius, [&](T& ref, double) { if(value<ref) {ref = value;} }, false);

}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also rather have it in solmask.hpp, to not increase the size of grid.hpp.
In C++17 it will be just two lines if inlined:

for (auto& d : data)
  d = std::clamp(d, min_value, max_value);

@@ -70,7 +70,7 @@ const option::Descriptor Usage[] = {
};

void print_cell_parameters(const char* prefix, const gemmi::UnitCell& cell) {
printf("%s %g %7g %7g %6g %6g %6g\n", prefix,
printf("%s %7g %7g %7g %6g %6g %6g\n", prefix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous PR got included here

@@ -32,6 +33,8 @@ const option::Descriptor Usage[] = {
CommonUsage[Help],
CommonUsage[Version],
CommonUsage[Verbose],
{ Verbosity, 0, "", "verbosity", Arg::Int,
" --verbosity=I \tVerbosity setting." },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having both --verbose and --verbosity is confusing. The former is sufficient, because it can be used multiple times , e.g. -vvv. You can change how many times it was used with

options[Verbose].count()

@@ -468,19 +470,22 @@ void Ccp4<T>::set_extent(const Box<Fractional>& box) {
grid.axis_order = AxisOrder::Unknown;
}

/// @param path filename for output CCP4/MRC map
/// @param mode_out (optional) map mode (default of -1 leaves it at the current value); typical values are 0 (binary/integer mask) or 2 (float map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exact copy of the previous comment. Does doxygen require it in both places?

/// @param value value to use (unless use_atom_occupancy is set to true)
/// @param ignore_hydrogen should we skip hydrogen atoms
/// @param ignore_zero_occupancy_atoms should we skip atoms with occupancy of zero
/// @param use_atom_occupancy should we use the occupancy of each atom (to build up a non-binary mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A higher level interface that wraps this function SolventMasker::mask_points().
If this function has documentation in Doxygen, users will expect that the API is stable and then they'll be upset if it changes. The functions that are documented also change from time to time, but I try to minimize it.

mask.set_points_around(atom.pos, radius, value);
bool ignore_zero_occupancy_atoms,
bool use_atom_occupancy,
int verbosity = 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to think how to simplify it. The code gets repeated between this function and mask_points_in_varied_radius. And both functions became quite long.

I'll need to check how these functions are used in the current code. Perhaps arg ignore_zero_occupancy_atoms can be removed. Masks with real numbers have been used before, although only for "soft masks" that change smoothly as a function of radius. That idea wasn't fully tested. I don't know if "soft masks" can improve refinement.

bool use_atom_occupancy,
int verbosity = 0) {

if (verbosity>0) printf("\n#### constant-radius masker\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's an output for debugging it's fine.
For output that's meant for end users, it's a bit harder. Library functions can be called from a GUI app, or from web app, so the output should to be possible to redirect. In most of the functions I just avoid printing anything. If it's necessary to print something, recently I started using struct Logger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants