-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column names in data almost always contain spaces/special characters, are sometimes not unique #78
Comments
@naupaka @gavinsimpson took over maintaining this pkg, i'll hand it over to them |
Hi @japhir -- thanks for the note. I would be a bit worried about changing the column names in any sort of automated fashion, since it could break a lot of old code from users that might rely on the old names. I think if a user wants to run An alternative would be some sort of warning if there are column name issues and perhaps a suggestion/message to the user to consider using |
Yeah I agree that doing this always would probably be too drastic. Ideally we would make sure that future data packages have something like an elaborate description of what each column means and what units it has in the metadata, while the column names still have a letters, periods/underscores and numbers (not as first character) and are guaranteed to be unique. Best would be if they would adhere to some kind of ontology where for example age is always called age and always has the same units (e.g. Ma or ka) and d18O and d13C would always be called that, with the metadata indicating whether they have been adjusted for species-specific vital effects etc. Perhaps there could be a link between the full column name and the tidied up column name so that if you ever want to get back to what was written originally you can still do so? I agree that the easiest implementation would be to just write a suggestion message. In and of itself, having such column names is not too much of a problem because (at least when using the tidyverse) you can wrap them with backtics "`". However, this almost always is very annoying to type and doesn't autocomplete ;-). And this does not work if any of the full column names are duplicates. |
Maybe pangaear could also reexport |
Personally I don't think that's a good idea as it forces a dependency on users who might not otherwise want to do their wrangling using {janitor}. The pages should be as agnostic in that regard as possible. |
A way to address the issue of multiple repeat column names for longer variable names, without requiring external dependencies, is to change the behaviour of the internal function
But it would be a better default behaviour to use:
To ensure backward compatibility the package maintainer could add the argument Currently because |
Most datasets that I could find contain special characters and spaces, and for a recent dataset turned out not to be unique (resulting in errors when doing further analysis using the tidyverse).
It would be nice if it could apply e.g.
janitor::clean_names()
The text was updated successfully, but these errors were encountered: