-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Unit Tests for Different Parameters #182
Conversation
Depends on |
f87ded1
to
35508f8
Compare
depends on |
7da35c1
to
90b2907
Compare
90b2907
to
db2fd67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but there are a couple of things I'd like you to clarify @michaelmckinsey1.
@ilumsden Tests that use the So running
|
thicket/thicket.py
Outdated
if isinstance(self.dataframe.columns, pd.MultiIndex): | ||
rows = [] | ||
nodes = self.dataframe.index.get_level_values("node").unique() | ||
extend_len = len(self.dataframe)//len(nodes) | ||
for node in nodes: | ||
df = self.dataframe.loc[node] | ||
keep = all([df[header].notna() # We are checking for NaNs | ||
.all() # For all values in a row | ||
.all() # For all rows in the slice | ||
for header in self.dataframe.columns.get_level_values(0)] # For each column header df[header] == slice | ||
) | ||
rows.extend([keep]*extend_len) # Extend by extend_len for MultiIndex | ||
tkc = self.deepcopy() | ||
tkc.dataframe = tkc.dataframe[rows] | ||
tkc = tkc.squash() | ||
return tkc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilumsden Do you think this can be a query? This code works but I was not successful at making a query.
This is performing an intersection when the columns are MultiIndex, which can be identified when all of the metric values for a given header are NA like the three rows seen below under the l
header.
I don't recall if there is support for MultiIndex columns in the query language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is support for MultiIndex columns in the QL nowadays, but I would not recommend doing this with the QL unless it's really needed. If we care about performance, the QL shouldn't be used for everything. At the end of the day, the problem that the QL is solving (a special variant of subgraph isomorphism) is an NP-Hard problem. Although I can make the QL faster, it will never be super fast due to that fact.
The QL should be used when you need to select or filter a Thicket while considering the node relationships encoded in the graph. If you don't need to take those relationships into account, then the QL could introduce unnecessary bottlenecks or slowdowns into your code. A good example of when you should use the QL is what you do in NCUReader
@michaelmckinsey1. A good example of when you should not use the QL is when you just want to apply a filter to each node independently.
Besides all that stuff related to the QL, I'd also recommend you don't do the mentioned change in this PR. That's a different change to the code, so it would belong in a different PR if you were to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore my point about not changing this in this PR. I hadn't gone over your changes before I said that. My comment about not using the QL willy-nilly stands though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got a few more changes that I'd like you to make. The most notable is that I'd encourage you to remove the use of the QL in Thicket.intersection
since you are already editing it.
) | ||
else: | ||
# If perfdata not padded | ||
query = Query().match(".", lambda df: len(df) == len(self.profile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see this before. I get why you were asking about using the QL with MultiIndex now. I'd still recommend not using the QL here. At the end of the day, it's almost always better to use P code than NP Hard code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose if there were a case where the parent was NaN, e.g.:
1.781 RAJAPerf
├─ nan Algorithm
│ ├─ 0.002 Algorithm_MEMCPY
│ ├─ 0.002 Algorithm_MEMSET
│ ├─ 0.003 Algorithm_REDUCE_SUM
becomes
1.781 RAJAPerf
├─ 0.002 Algorithm_MEMCPY
├─ 0.002 Algorithm_MEMSET
├─ 0.003 Algorithm_REDUCE_SUM
when it should be
1.781 RAJAPerf
I can't think of a situation where this would happen, since in reality if the parent node was filled with NaNs because it did not exist, then the children shouldn't have existed either. The children can't exist without the parent, so they would also have NaNs. However, the QL would properly remove the children in this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using the QL where I can, because it is better tested and more readable than writing new code like in this PR. I suppose then you would think that we would benefit from having a performance data filter function, and not from the QL? And a performance data filter would replace the code in intersection
, including the existing queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back and trying to knock out some of these reviews now.
A performance data filter function (similar to Hatchet's filter
function) would be a great thing to have. At the end of the day, if you are doing filtering that doesn't care about the structure of the graph, you really should be using Pandas operations over the QL, and that's what this type of function would provide. I have some thoughts on how to improve the QL's performance, but it should still never be faster than Pandas on single node filtering (unless Pandas is much worse in performance than I think).
That being said I do get your point about using the QL in this case. I think a good thing to do here would be to continue using the QL in intersection
(for now at least, we may revisit later) and then create a performance data filter function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but see my last comment regarding performance data filtering and the QL.
@pearce8 this PR is ready for your review.
@michaelmckinsey1 Can you please resolve conflicts on this PR? |
6df4876
to
8905b43
Compare
This PR proposes to expand running the unit tests for (1) Thickets with intersection trees (default is union) and (2) Thickets without filling the performance data (default is filling the performance data). Therefore, testing for each combination would run all the unit tests for 4 types of Thickets:
Example of using parametrized fixtures from pytest docs
When tests fail, we can see which configuration of parameters failed:
We can run single tests for a single set of parameters by specifying them to pytest