Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Unit Tests for Different Parameters #182

Merged
merged 17 commits into from
Oct 28, 2024

Conversation

michaelmckinsey1
Copy link
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented Jun 24, 2024

This PR proposes to expand running the unit tests for (1) Thickets with intersection trees (default is union) and (2) Thickets without filling the performance data (default is filling the performance data). Therefore, testing for each combination would run all the unit tests for 4 types of Thickets:

  1. Union+Filled
  2. Union+NonFilled
  3. Intersection+Filled
  4. Intersection+NonFilled

Example of using parametrized fixtures from pytest docs

When tests fail, we can see which configuration of parameters failed:

FAILED thicket/tests/test_tree.py::test_indices[Intersection-FillPerfdata] - Key...
FAILED thicket/tests/test_tree.py::test_indices[Union-NoFillPerfdata] - KeyError: ...

We can run single tests for a single set of parameters by specifying them to pytest

$ python -m pytest thicket/tests/test_tree.py::test_indices[Union-NoFillPerfdata]

============================= test session starts =============================
platform win32 -- Python 3.11.7, pytest-8.0.0, pluggy-1.4.0
rootdir: C:\Users\Micha\Documents\Github\thicket
configfile: pytest.ini
collected 1 item

thicket\tests\test_tree.py F                                             [100%]

@michaelmckinsey1
Copy link
Collaborator Author

michaelmckinsey1 commented Jul 9, 2024

Depends on #193

@michaelmckinsey1
Copy link
Collaborator Author

michaelmckinsey1 commented Jul 9, 2024

depends on #181

@michaelmckinsey1 michaelmckinsey1 force-pushed the run-allparams branch 3 times, most recently from 7da35c1 to 90b2907 Compare July 9, 2024 22:33
@michaelmckinsey1 michaelmckinsey1 self-assigned this Jul 9, 2024
@michaelmckinsey1 michaelmckinsey1 added area-ci Issues and PRs related to Thicket's continuous integration area-tests Issues and PRs involving Thicket's automated tests priority-normal Normal priority issues and PRs status-work-in-progress PR is currently being worked on type-feature Requests for new features or PRs which implement new features labels Jul 9, 2024
@michaelmckinsey1 michaelmckinsey1 marked this pull request as ready for review July 11, 2024 19:08
@michaelmckinsey1 michaelmckinsey1 added status-ready-for-review This PR is ready to be reviewed by assigned reviewers and removed status-work-in-progress PR is currently being worked on labels Jul 11, 2024
Copy link
Collaborator

@ilumsden ilumsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but there are a couple of things I'd like you to clarify @michaelmckinsey1.

thicket/tests/test_concat_thickets.py Show resolved Hide resolved
thicket/tests/test_concat_thickets.py Show resolved Hide resolved
thicket/tests/test_intersection.py Show resolved Hide resolved
@michaelmckinsey1
Copy link
Collaborator Author

@ilumsden Tests that use the thicket_axis_columns fixture are already parametrized, since thicket_axis_columns is parametrized itself. Unlike the other fixtures, which are lists of files, thicket_axis_columns and stats_thicket_axis_columns create the Thickets in the fixtures.

So running python -m pytest thicket/tests/test_concat_thickets.py::test_filter_stats_concat_thickets_columns results in:

0.44s call     thicket/tests/test_concat_thickets.py::test_filter_stats_concat_thickets_columns[Intersection-FillPerfdata]
0.41s call     thicket/tests/test_concat_thickets.py::test_filter_stats_concat_thickets_columns[Intersection-NoFillPerfdata]
0.33s call     thicket/tests/test_concat_thickets.py::test_filter_stats_concat_thickets_columns[Union-FillPerfdata]
0.33s call     thicket/tests/test_concat_thickets.py::test_filter_stats_concat_thickets_columns[Union-NoFillPerfdata]

Comment on lines 1151 to 1210
if isinstance(self.dataframe.columns, pd.MultiIndex):
rows = []
nodes = self.dataframe.index.get_level_values("node").unique()
extend_len = len(self.dataframe)//len(nodes)
for node in nodes:
df = self.dataframe.loc[node]
keep = all([df[header].notna() # We are checking for NaNs
.all() # For all values in a row
.all() # For all rows in the slice
for header in self.dataframe.columns.get_level_values(0)] # For each column header df[header] == slice
)
rows.extend([keep]*extend_len) # Extend by extend_len for MultiIndex
tkc = self.deepcopy()
tkc.dataframe = tkc.dataframe[rows]
tkc = tkc.squash()
return tkc
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilumsden Do you think this can be a query? This code works but I was not successful at making a query.

This is performing an intersection when the columns are MultiIndex, which can be identified when all of the metric values for a given header are NA like the three rows seen below under the l header.

image

I don't recall if there is support for MultiIndex columns in the query language.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is support for MultiIndex columns in the QL nowadays, but I would not recommend doing this with the QL unless it's really needed. If we care about performance, the QL shouldn't be used for everything. At the end of the day, the problem that the QL is solving (a special variant of subgraph isomorphism) is an NP-Hard problem. Although I can make the QL faster, it will never be super fast due to that fact.

The QL should be used when you need to select or filter a Thicket while considering the node relationships encoded in the graph. If you don't need to take those relationships into account, then the QL could introduce unnecessary bottlenecks or slowdowns into your code. A good example of when you should use the QL is what you do in NCUReader @michaelmckinsey1. A good example of when you should not use the QL is when you just want to apply a filter to each node independently.

Besides all that stuff related to the QL, I'd also recommend you don't do the mentioned change in this PR. That's a different change to the code, so it would belong in a different PR if you were to do it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore my point about not changing this in this PR. I hadn't gone over your changes before I said that. My comment about not using the QL willy-nilly stands though.

Copy link
Collaborator

@ilumsden ilumsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a few more changes that I'd like you to make. The most notable is that I'd encourage you to remove the use of the QL in Thicket.intersection since you are already editing it.

thicket/tests/conftest.py Outdated Show resolved Hide resolved
thicket/tests/conftest.py Outdated Show resolved Hide resolved
)
else:
# If perfdata not padded
query = Query().match(".", lambda df: len(df) == len(self.profile))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this before. I get why you were asking about using the QL with MultiIndex now. I'd still recommend not using the QL here. At the end of the day, it's almost always better to use P code than NP Hard code.

Copy link
Collaborator Author

@michaelmckinsey1 michaelmckinsey1 Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose if there were a case where the parent was NaN, e.g.:

1.781 RAJAPerf
├─ nan Algorithm
│  ├─ 0.002 Algorithm_MEMCPY
│  ├─ 0.002 Algorithm_MEMSET
│  ├─ 0.003 Algorithm_REDUCE_SUM

becomes

1.781 RAJAPerf
├─ 0.002 Algorithm_MEMCPY
├─ 0.002 Algorithm_MEMSET
├─ 0.003 Algorithm_REDUCE_SUM

when it should be

1.781 RAJAPerf

I can't think of a situation where this would happen, since in reality if the parent node was filled with NaNs because it did not exist, then the children shouldn't have existed either. The children can't exist without the parent, so they would also have NaNs. However, the QL would properly remove the children in this example.

Copy link
Collaborator Author

@michaelmckinsey1 michaelmckinsey1 Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using the QL where I can, because it is better tested and more readable than writing new code like in this PR. I suppose then you would think that we would benefit from having a performance data filter function, and not from the QL? And a performance data filter would replace the code in intersection, including the existing queries?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back and trying to knock out some of these reviews now.

A performance data filter function (similar to Hatchet's filter function) would be a great thing to have. At the end of the day, if you are doing filtering that doesn't care about the structure of the graph, you really should be using Pandas operations over the QL, and that's what this type of function would provide. I have some thoughts on how to improve the QL's performance, but it should still never be faster than Pandas on single node filtering (unless Pandas is much worse in performance than I think).

That being said I do get your point about using the QL in this case. I think a good thing to do here would be to continue using the QL in intersection (for now at least, we may revisit later) and then create a performance data filter function.

Copy link
Collaborator

@ilumsden ilumsden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but see my last comment regarding performance data filtering and the QL.

@pearce8 this PR is ready for your review.

@ilumsden ilumsden requested a review from pearce8 August 12, 2024 14:05
@ilumsden ilumsden added status-approved No more revisions are required on this PR and it is ready for merge and removed status-ready-for-review This PR is ready to be reviewed by assigned reviewers labels Aug 12, 2024
@michaelmckinsey1 michaelmckinsey1 added this to the 2024.2.0 milestone Sep 4, 2024
@slabasan
Copy link
Collaborator

@michaelmckinsey1 Can you please resolve conflicts on this PR?

@slabasan slabasan merged commit c1cd36e into LLNL:develop Oct 28, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-ci Issues and PRs related to Thicket's continuous integration area-tests Issues and PRs involving Thicket's automated tests priority-normal Normal priority issues and PRs status-approved No more revisions are required on this PR and it is ready for merge type-feature Requests for new features or PRs which implement new features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants