Skip to content

Conversation

ye11owSub
Copy link
Contributor

@ye11owSub ye11owSub commented Feb 15, 2025

@ye11owSub ye11owSub force-pushed the chempy_refactoring branch 2 times, most recently from 491cd50 to 9233afc Compare February 15, 2025 21:27
@ye11owSub ye11owSub force-pushed the chempy_refactoring branch 4 times, most recently from d8efebe to 5614259 Compare February 16, 2025 12:53
@ye11owSub ye11owSub changed the title WIP: chempy.cpv refactoring WIP: Switching from Python lists to NumPy arrays for linear algebra operations Feb 16, 2025
@ye11owSub ye11owSub changed the title WIP: Switching from Python lists to NumPy arrays for linear algebra operations Switching the computation in cpv.py from python lists to numpy arrays for linear algebra operations Feb 16, 2025
@ye11owSub ye11owSub requested a review from speleo3 February 16, 2025 22:05
@ye11owSub ye11owSub changed the title Switching the computation in cpv.py from python lists to numpy arrays for linear algebra operations Switching the computation in cpv.py from python lists to numpy arrays Feb 16, 2025
@JarrettSJohnson
Copy link
Member

Won't comment on details at the moment, but currently I do have a couple of high-level comments:

  1. I'm usually all in favor of refactoring code, but doing so should come with some sort of purpose so that there's an overall net positive. Usually there is a cost of doing so ( it is not always free: a708606 98c85b8 Shortcut now missing has_key method #425 ) .
  2. Similar to above I don't think the cpv module posed any sort of developer obstacle that would warrant a change. In fact, instead of investing time into making cpv nicer to use, I think it would also be worth to consider if cpv is even needed in the first place (No code to maintain is better than maintaining the nicest code). Most of this module was written over two decades ago and there now exists libraries (especially numpy) that have much higher usage from a higher number of experts that have thought about linear algebra more than we have. IMO, one step forward would be to consider if we can replace cpv altogether (perhaps keeping the functions not present in numpy/numpy.linalg). This would of course mean that scripts that use cpv would need to be changed to use numpy (which IMO should rely on numpy and not PyMOL to do linear algebra).
  3. I'd rather keep the verbose setting on for CI so that I can see each step of the C++ compilation process.

@ye11owSub
Copy link
Contributor Author

hey @JarrettSJohnson !

I'm usually all in favor of refactoring code, but doing so should come with some sort of purpose so that there's an overall net positive. Usually there is a cost of doing so ( it is not always free: a708606 98c85b8 #425 ) .

This is the cost of poor code quality and lack of test coverage. Refactoring is a method to find and fix these issues.

Similar to above I don't think the cpv module posed any sort of developer obstacle that would warrant a change. In fact, instead of investing time into making cpv nicer to use, I think it would also be worth to consider if cpv is even needed in the first place (No code to maintain is better than maintaining the nicest code). Most of this module was written over two decades ago and there now exists libraries (especially numpy) that have much higher usage from a higher number of experts that have thought about linear algebra more than we have. IMO, one step forward would be to consider if we can replace cpv altogether (perhaps keeping the functions not present in numpy/numpy.linalg). This would of course mean that scripts that use cpv would need to be changed to use numpy (which IMO should rely on numpy and not PyMOL to do linear algebra).

I'm not sure I understand what you mean when you say "there now exists libraries (especially numpy) that have much higher usage". The entire cpv.py module was completely rewritten using numpy in this pull request. This new implementation is compatible with scripts in the pymol-scripts repo and other pymol modules.
In any case, I completely agree with your idea of replacing cpv.py with numpy. However, I assumed that changing the established API to use numpy would be unacceptable. If you believe that replacing all calls of cpv.py in the pymol and pymol-scripts repositories with numpy is a good idea, then I would be happy to do it.

I'd rather keep the verbose setting on for CI so that I can see each step of the C++ compilation process.

done

@JarrettSJohnson
Copy link
Member

This is the cost of poor code quality and lack of test coverage. Refactoring is a method to find and fix these issues.

Even if the original code quality was poor and without sufficient test coverage, these specific issues were manifested from the refactoring process due to a couple of properties missed from the original code (which were also easily identifiable by using common functionality in PyMOL--and that's on me too for not testing the PR before merging it). I think we should emphasize code coverage a little bit more than class-level refactoring.

If you believe that replacing all calls of cpv.py in the pymol and pymol-scripts repositories with numpy is a good idea, then I would be happy to do it.

Might make sense to first open up an issue there to get insights/opinions from other developers/maintainers, but I'm generally in favor of removing the basic linear algebra functions from cpv.

@ye11owSub
Copy link
Contributor Author

ye11owSub commented Feb 17, 2025

Even if the original code quality was poor and without sufficient test coverage, these specific issues were manifested from the refactoring process due to a couple of properties missed from the original code (which were also easily identifiable by using common functionality in PyMOL--and that's on me too for not testing the PR before merging it).

I'm truly sorry that my previous PR caused issues that you had to fix. However, in my opinion, this is common part of the software development process.
Due to my lack of experience using pymol, it is difficult for me to test any scenarios since. I have never had to work with pymol as a user, so I rely heavily on tests and grep. Actually these small PRs help me understand the project and contribute something useful in the process

I think we should emphasize code coverage a little bit more than class-level refactoring.

Adding commas to docstrings is worthless stuff, but I am trying to make the code more readable. Therefore, I don't see a problem with refactoring at the class level.
You are right that some of the scripts in the project are more than 20 years old and their readability is poor. Taking small steps to improve them is better than doing nothing at all.

@speleo3
Copy link
Contributor

speleo3 commented Feb 17, 2025

I support Jarrett's assessment here.

Might make sense to first open up an issue there to get insights/opinions from other developers/maintainers

Fully agree.

I like the added tests and type hints from the first commit, but the numpyfy refactoring is too much IMHO.

In my own scripts I always used either only numpy -- taking advantage of all its features and keeping data in numpy arrays -- or chempy.cpv for its simplicity and no numpy dependency. Making chempy.cpv a numpy wrapper feels like combining the disadvantages from both worlds.

@ye11owSub
Copy link
Contributor Author

ye11owSub commented Feb 17, 2025

Might make sense to first open up an issue there to get insights/opinions from other developers/maintainers

No one argued with that

@speleo3 as you wish, there are now only type annotations and tests

@ye11owSub ye11owSub force-pushed the chempy_refactoring branch 2 times, most recently from 308bae1 to 1373979 Compare February 17, 2025 16:24
@ye11owSub ye11owSub changed the title Switching the computation in cpv.py from python lists to numpy arrays tests for cpv.py Feb 17, 2025
@TstewDev
Copy link
Collaborator

Hello @ye11owSub,

I'm Thomas Stewart, a PyMOL developer and the current Product Manager for PyMOL at Schrödinger. I just wanted to add my thoughts to a few of your comments:

I'm truly sorry that my previous PR caused issues that you had to fix. However, in my opinion, this is common part of the software development process. Due to my lack of experience using pymol, it is difficult for me to test any scenarios since. I have never had to work with pymol as a user, so I rely heavily on tests and grep. Actually these small PRs help me understand the project and contribute something useful in the process

There's no need to apologize but I do hope it helps explain our general reluctance. I agree that fixing issues introduced by PR's is definitely part of the software development process and I don't want reject PR's simply on that basis. However, I would point out that finding and fixing these issues does take developer time and resources that could be spent on more productive tasks. This means that these PR's do come with a cost (reviewing, testing, maintaining, etc.), regardless of how simple they may appear. They need to be impactful enough to justify merging them into the codebase.

You also mention your lack of experience using PyMOL as a user. Not to say that only heavy PyMOL users can contribute to the project, but I'm curious what your motivations are if you're not trying to address an issue with how the app currently functions. I certainly understand the benefits of clean and well-documented code, but I don't view this as being a beneficial use of your time and effort if these files should be replaced completely.

If you (or anyone else reading this) are really interested in making a significant contribution to the project, I would encourage you to play around with PyMOL and try to identify some functionality/features that would benefit from your effort.

Taking small steps to improve them is better than doing nothing at all.

I certainly understand what you're saying here, but I think it's an oversimplification for the reasons I stated above. In addition to the review/maintenance costs, small changes impact git blame, git history, and consistency across files. Refactoring to make the code more readable can be a noble goal when done with a clear objective, however it can also come with a real cost when just making changes for the sake of making changes.

All that being said, I do believe there is real value being added in this PR and the tests for PyMOL certainly should be improved. I just want to explain my thought process when evaluating PR's in general if you plan on submitting more in the future.

@ye11owSub
Copy link
Contributor Author

Hey @TstewDev!
This PR has attracted more attention than it deserves.

There's no need to apologize but I do hope it helps explain our general reluctance. I agree that fixing issues introduced by PR's is definitely part of the software development process and I don't want reject PR's simply on that basis. However, I would point out that finding and fixing these issues does take developer time and resources that could be spent on more productive tasks. This means that these PR's do come with a cost (reviewing, testing, maintaining, etc.), regardless of how simple they may appear. They need to be impactful enough to justify merging them into the codebase.

I understand, each PR has a cost (so let's reduce this cost through testing).

I'm curious what your motivations are if you're not trying to address an issue with how the app currently functions. I certainly understand the benefits of clean and well-documented code, but I don't view this as being a beneficial use of your time and effort if these files should be replaced completely.

The shortest and at the same time the most complete answer is because I can. It's sad to see that, in 6 years, the project has had 80 PRs closed from the open-source community. Pymol is a popular tool for a specific group of people, I'm not one of them, but i have a CS degree and some free time
I didn't find any specific plans for the future development of the project, so I decided to focus on something that was clearly in need of an update.
You say that these files will be completely replaced, but this is only true for the end of the process. There are a lot of things that need to be done before and testing of old code is one of these things. I think it will take a significant amount of time to replace the cpv.py, and even then, it will be replaced with code from these tests.

Refactoring to make the code more readable can be a noble goal when done with a clear objective, however it can also come with a real cost when just making changes for the sake of making changes.

In general I agree, but in this case, I don't think that's the case. If you have a different opinion, that's fine. Let's fix/add/delete what you think is necessary or close this PR and move on. That's OK for me.

I am also currently refactoring the chempy {models.py, __init__.py, io.py}. I wanted to split this into separate PRs, but if you prefer to have more changes per PR, we can set this one on pause.

m1[1][0]*m2[0][2] + m1[1][1]*m2[1][2] + m1[1][2]*m2[2][2]],
[m1[2][0]*m2[0][0] + m1[2][1]*m2[1][0] + m1[2][2]*m2[2][0],
m1[2][0]*m2[0][1] + m1[2][1]*m2[1][1] + m1[2][2]*m2[2][1],
m1[2][0]*m2[0][2] + m1[2][1]*m2[1][2] + m1[2][2]*m2[2][2]]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matrix multiplication was not implemented correctly. This and code duplication has also been fixed in this PR

@ye11owSub ye11owSub changed the title tests for cpv.py WIP: chempy refactoring Feb 20, 2025
@TstewDev
Copy link
Collaborator

Hello @ye11owSub!

The shortest and at the same time the most complete answer is because I can. It's sad to see that, in 6 years, the project has had 80 PRs closed from the open-source community.

Say no more, welcome the project! The effort you have already put into these PR's is really appreciated and it sounds like you really are serious about making a contribution.

Please forgive my original tone of skepticism, I just know that open-source projects like this can fall victim to developers creating PR's when they have little intention of actually seeing these changes through. It definitely doesn't sound like that's the case here and we welcome all the help we can get.

I understand, each PR has a cost (so let's reduce this cost through testing).

I'm a big fan of adding tests like this and I think it's one of the obvious areas for improvement.

In general I agree, but in this case, I don't think that's the case. If you have a different opinion, that's fine. Let's fix/add/delete what you think is necessary or close this PR and move on. That's OK for me.

I don't actually think I have any issue with this review now that it has this refined scope. I will take another closer look and add any additional comments if necessary.

I am also currently refactoring the chempy {models.py, __init__.py, io.py}. I wanted to split this into separate PRs, but if you prefer to have more changes per PR, we can set this one on pause.

Happy to hear it! I'm normally in favor of splitting these into multiple smaller review but it sounds like these might be quite intertwined? I'll leave it up your judgement but if you feel like there's relevant context that these other changes would provide, feel free to combine them.

@ye11owSub
Copy link
Contributor Author

Hi @TstewDev !
Happy to hear that. Thank you!
For this PR, it is important to demonstrate that the new tests pass before and after the changes.
Therefore, I was focused on fixing the issues in the CI pipeline. I hope someone could also review this PR

@ye11owSub ye11owSub force-pushed the chempy_refactoring branch from d45d258 to 7932692 Compare April 12, 2025 11:29
return sm

#------------------------------------------------------------------------------
def get_nuclear_charges(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this and a bunch of other methods removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JarrettSJohnson!
I wanted to discuss this after I finish, but it's probably better to do it as early as possible.
I couldn't find the use of these functions anywhere, so maybe it's a good idea to remove this code?
If you think this code might still be useful, then I'll restore it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a public API, so if any users out there are using them, their scripts would break. If this is intended, we should have least have a deprecation period and some alternative path to obtain a similar result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy that.
I'll restore everything that's working properly

if b.index[0] == c:
indexed.bond.append(b)
c = c + 1
self.reset()
Copy link
Contributor Author

@ye11owSub ye11owSub Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JarrettSJohnson also some functions that are used in other places seems to me that they are not working correctly. For example:

for a in self.bond:
   for b in a:
      ...

b in a: - but bond is not iterable class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have examples of this not working properly? Connected::bond is a nested list, which is a different form than Model::bond.

What doesn't seem to be working though, for a different reason however, is Connected::insert_atom

Traceback (most recent call last):
  File "D:\PyMOL\pymol\bond_test.py", line 6, in <module>
    c.insert_atom(0, 0)
  File "D:\mambaforge\envs\devenv\Lib\site-packages\chempy\models.py", line 612, in insert_atom
    for a in self.bonds:
             ^^^^^^^^^^
AttributeError: 'Connected' object has no attribute 'bonds'. Did you mean: 'bond'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the part mentioned above from the convert_to_indexed function.

def convert_to_indexed(self):
        if chempy.feedback['verbose']:
            print(" "+str(self.__class__)+": converting to indexed model...")
        indexed = Indexed()
        indexed.atom = self.atom
        indexed.molecule = self.molecule
        c = 0
        for a in self.bond:
            **for b in a:**
                if b.index[0] == c:
                    indexed.bond.append(b)
            c = c + 1
        self.reset()

Copy link
Member

@JarrettSJohnson JarrettSJohnson Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me the traceback?

cmd.fetch('1obyA')
model = cmd.get_model()
c = model.convert_to_connected()
i = c.convert_to_indexed()

doesn't give me a traceback

This looks correct for the reasons I mentioned. Connected::bond is not a list of Bond. It's a list of list of Bonds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model = Connected()
model.molecule = self.molecule
model.atom = self.atom
model.bond = []
model.index = None
for a in model.atom:
model.bond.append([])
for b in self.bond:
model.bond[b.index[0]].append(b) # note two refs to same object
model.bond[b.index[1]].append(b) # note two refs to same object

line 348/349

Copy link
Contributor Author

@ye11owSub ye11owSub Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still, it doesn't work correctly. The Connected class is inherited from the Base class, which is store bond as a list[Bond], this means that some of the methods of the base class will not work for the Connected class

Copy link
Contributor Author

@ye11owSub ye11owSub Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two more

self = <chempy.models.Indexed object at 0x105cb8590>, index = 0

    def remove_bond(self,index):
        if chempy.feedback['bonds']:
            print(" "+str(self.__class__)+": removing bond %d." % index)
>       nBond=len(self.Bond)
E       AttributeError: 'Indexed' object has no attribute 'Bond'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self = <chempy.models.Connected object at 0x1063bff90>, index = 2

        def delete_atom(self,index):
            if chempy.feedback['atoms']:
                print(" "+str(self.__class__)+": deleting atom %d." % index)
    
            nAtom=self.nAtom
    
    # update index if it exists
            if self.index:
                idx = self.index
                for k in list(idx.keys()):
                    if idx[k] > index:
                        idx[k] = idx[k] - 1
                del idx[id(self.atom[index])]
    
    # delete atom
            del self.atom[index]
    
    # delete bonds associated with this atom
    
            nBond = len(self.bond)
    
            for a in self.bond:
                i = 0
                templist = []
                for b in a:
                    if index in b.index:
                        templist.append(i)
                    i = i + 1
                for i in range(len(templist)):
                    j = templist[i] - i
                    del a[j]
    
    # re-index bond table
            for b in self.bond:
>               if b.index[0] > index:
E               TypeError: 'builtin_function_or_method' object is not subscriptable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, perhaps just rename the Connected::bond to another name to avoid clashing.

nBond=len(self.Bond) line can probably just be removed since the results are unused.

You'll likely run into several issues here. This module was written (and a lot untouched) from decades ago and rarely used by folks (which is why I think effort is probably better spent elsewhere outside this module).

IMO, just fix what you can, and anything that's not fixable, rather than deleting, you can leave them be and we'll have a proper deprecation period for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger that

@ye11owSub ye11owSub force-pushed the chempy_refactoring branch 3 times, most recently from 54c88e7 to 6762ce7 Compare April 17, 2025 22:24
@JarrettSJohnson
Copy link
Member

I probably won't be able to deeply get into this for a while, but in any case, lets keep the classes in their original files for now to keep the diff more succinct and so that it's easier to compare the pertinent changes.

@ye11owSub ye11owSub force-pushed the chempy_refactoring branch 7 times, most recently from 869f19d to 4f0a7dd Compare April 20, 2025 20:45
@ye11owSub ye11owSub force-pushed the chempy_refactoring branch from 4f0a7dd to 6ec0804 Compare April 20, 2025 20:47
Copy link
Member

@JarrettSJohnson JarrettSJohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Review is a bit hard to follow when there are massive relocations/formatting changes mixed in with logic changes. I would propose in the future to keep those separate in future reviews so that it's easier to separate what's reorganization versus what I need to focus on for correctness. Otherwise, thanks for the time you spent on this!

Also whenever this is ready for review & merge, please remove the "WIP" from the title.

Comment on lines +138 to +139
# HAVEN'T YET VERIFIED THAT THIS CONFORMS TO STANDARD DEFT
# upd: no, it's not(fixed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.

if chempy.feedback['atoms']:
print(" "+str(self.__class__)+": deleting atom %d." % index)
def _handle_new_atom(self) -> None:
"""In case of Connected class we need to add an empty list to slef.bond"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants