-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform: N column, exclude NA values #67
Comments
That N value in the
Which gives the following
The recommended method for the hmisc transform to see the column summary is to use the option
If one wants to change the default summary, this is a Categorical X Categorical variable, dispatched via the "hmisc" transform, lines 446-448 at the end of
Is the one you'd want to change. To override this tranform, create your own transform in the current environment by copying the existing. Then define whatever you'd like.
Then the new behavior would be used for every call. However, the original 'N' are descriptive of the data in a consistent manner that handles multiple rows. The dplyr approach is probably the simpler, because now you're giving a description of data that has been processed to drop those with missing columns variables. |
Oh, the row entries are generated using
|
I think I understand what you are saying, but this is not what I was trying to convey. In my case I have a dataframe with 133 observations(patients). 122 of those have an Endpoint at closeout(sum(!is.na(datar)), however since I am grouping by Group_DM2_Non, I only have 108 observations over both columns of my table. As such You are expecting that the grouping value(e.g. drug) has no NA values, as such you only exclude NA values from row$data and dont check something like In my opinion the row N should never be larger than the sum of the columns, I think a overall column doesnt help, since that would still show Overall(N=118) but some rows will have N>118! Example:
|
I understand what you are saying quite well. The problem is there has to be a consistent application of a principal. The issue is not as easy as it first appears. Right now the N value of the row is "Number of defined row values". This has been the standard report from summaryM and this package for quite some time. Very straight forward. What is requested is that the N value of the row only count if their are defined values of the columns. I can add new behavior by defining an interface that adds an option of some form. The question is design. If I add a pass through of the Maybe allow it and add something like |
I don't see the problem in excluding column NA values from the row N. In my mind it is the expected behaviour simply because of the fact that in categorical x categorical and collapse=FALSE the number of observations in each column for each factor would add up to the row N. I fixed it for myself by filtering with dplyr, but I would suggest that having this as an option would be nice! Thanks for your help so far! |
If not done carefully it can lead to variable N values for the columns per row. That is the danger. This cannot be allowed via the interface, it must be consistent through all cases no matter how obscure. Right now it is consistent because it has a very simple rule that it runs. Relaxing and opening this interface up for different options is okay, but I can't have just your use case considered I have to consider all cases, and I should write tests for them as well. I think I've got something that will work, have two argument
This would exclude all row NA values, NaN, any numeric 3, any factor "absent" and another comb from something external that was not provided--and all missing column values. Similar logic can be applied to the column. Since row and column are evaluated independently, no inconsistency can arise. The interface is incredibly versatile, and covers some other requests I've had in a robust manner. So the question is does |
I've tried a variety of solutions to this, but they've all led to inconsistencies. I was ready to throw in the towel and give up on it when I hit upon a method that just might work, so I've not given up yet. |
Your committment to this is remarkable, thanks! I had to rotate to an outpatient clinic for the last weeks so I am just now catching up on things. In the past I found it hard to find documentation on optional function arguments for tangram, if you include something like row.exclude it has to be documented thoroughly. I would still argue that this should not be optional but rather have it vice-versa to optionally fall back to the current behaviour. In my mind the row N for non-NA values should never be larger than the combined column N values. I know my colleagues were confused at first as well. My proposed feature would allow for people to recognize that some data sets are incomplete. In the end it is up to you, for now I have fixed it by filtering with dplyr beforehand. While you are at it, is there a simple interface to access table cell values by column/row names in order to reference it in the full text? |
You're the third or fourth person who's asked me for it. So I figure it's a broader user need. For documentation in the current release try I have some code written, I'm testing. It can lead to surprising behavior, so I'm working on allowing targeted exclusion criteria.
Right now table cell values are just numbered. It's a list of lists. The question becomes what would column/row names be. A single variable can make multiple rows/columns, so the variable name is insufficient. At the inception of the package there was a strict naming convention that was intended for result tracing and tying a number to the originating data set. This rarely got used, so it's fallen into disrepair, but it would be possible to do that. |
This is still broken for data coming from the environment. I don't know how I'm going to do that, it's more involved, but will have to be supported for this enhancement to be complete. |
So I encountered another problem/style question while using tangram. Currently the N column only removes NA values from datar, however it would be nice to also exclude entries which don't exist in the column.
For example I have the following table:
While there it is correct that 122 patients had an endpoint at the end of the study only 108 patients had an endpoint AND were in either in the Non-Diabetic or Diabetic group:
While I could filter for those patients using dplyr, it would probably make sense to include this in the transform itself. However I can't wrap my head around parts of the API, any help would be appreciated!
The text was updated successfully, but these errors were encountered: