-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add handling of NaN values to Scaler TableTransformers #873
Comments
Closes #650 ### Summary of Changes Adds a RobustScaler class that works like the StandardScaler but uses median instead of mean and interquartile range instead of standard deviation. If the interquartile range is 0 it will only substract the median from all rows. For now cannot handle columns containing NaN-values. See Issue #873 --------- Co-authored-by: srose <118634249+wastedareas@users.noreply.github.com> Co-authored-by: Simon <simon@schwubbel.dip0.t-ipconnect.de> Co-authored-by: megalinter-bot <129584137+megalinter-bot@users.noreply.github.com>
Currently the route chosen is to replace NaN values with None. It is implemented and works on StandardScaler, but needs more test coverage. |
We discovered that maybe the implementation is not working as intended. We forgot that the with_columns returns a new data_frame (while in the background it only modifies). But then we have some unexplainable error in the robust scaler, see TODO in the test_fit method. |
## [0.27.0](v0.26.0...v0.27.0) (2024-07-19) ### Features * join ([#870](#870)) ([5764441](5764441)), closes [#745](#745) * activation function for forward layer ([#891](#891)) ([5b5bb3f](5b5bb3f)), closes [#889](#889) * add `ImageDataset.split` ([#846](#846)) ([3878751](3878751)), closes [#831](#831) * add FunctionalTableTransformer ([#901](#901)) ([37905be](37905be)), closes [#858](#858) * add InvalidFitDataError ([#824](#824)) ([487854c](487854c)), closes [#655](#655) * add KNearestNeighborsImputer ([#864](#864)) ([fcdfecf](fcdfecf)), closes [#743](#743) * add moving average plot ([#836](#836)) ([abcf68a](abcf68a)) * add RobustScaler ([#874](#874)) ([62320a3](62320a3)), closes [#650](#650) [#873](#873) * add SequentialTableTransformer ([#893](#893)) ([e93299f](e93299f)), closes [#802](#802) * add temporal operations ([#832](#832)) ([06eab77](06eab77)) * added 'histogram_2d' in TablePlotter ([#903](#903)) ([4e65ba9](4e65ba9)), closes [#869](#869) [#798](#798) * added from_str_to_temporal and continues prediction ([#767](#767)) ([35f468a](35f468a)), closes [#806](#806) [#765](#765) [#740](#740) [#773](#773) * added GRU layer ([#845](#845)) ([d33cb5d](d33cb5d)) * Adds Dropout Layer ([#868](#868)) ([a76f0a1](a76f0a1)), closes [#848](#848) * dark mode for plots ([#911](#911)) ([5447551](5447551)), closes [#798](#798) * easily create a baseline model ([#811](#811)) ([8e1b995](8e1b995)), closes [#710](#710) * get first cell with value other than `None` ([#904](#904)) ([5a0cdb3](5a0cdb3)), closes [#799](#799) * hyperparameter optimization for fnn models ([#897](#897)) ([c1f66e5](c1f66e5)), closes [#861](#861) * implement violin plots ([#900](#900)) ([9f5992a](9f5992a)), closes [#867](#867) * plot decision tree ([#876](#876)) ([d3f81dc](d3f81dc)), closes [#856](#856) * prediction no longer takes a time series dataset only table ([#838](#838)) ([762e5c2](762e5c2)), closes [#837](#837) * raise if `remove_colums` is called with unknown column by default ([#852](#852)) ([8f78163](8f78163)), closes [#807](#807) * regularization strength for logistic classifier ([#866](#866)) ([9f74e92](9f74e92)), closes [#750](#750) * reorders parameters of RangeScaler and makes them keyword-only ([#847](#847)) ([2b82db7](2b82db7)), closes [#809](#809) * replace seaborn with matplotlib for box_plot ([#863](#863)) ([4ef078e](4ef078e)), closes [#805](#805) [#849](#849) * replaced seaborn with matplotlib for correlation_heatmap ([#850](#850)) ([d4680d4](d4680d4)), closes [#800](#800) [#849](#849) ### Bug Fixes * **deps:** bump urllib3 from 2.2.1 to 2.2.2 ([#842](#842)) ([b81bcd6](b81bcd6)), closes [#3122](https://github.com/Safe-DS/Library/issues/3122) [#3363](https://github.com/Safe-DS/Library/issues/3363) [#3122](https://github.com/Safe-DS/Library/issues/3122) [#3363](https://github.com/Safe-DS/Library/issues/3363) [#3406](https://github.com/Safe-DS/Library/issues/3406) [#3398](https://github.com/Safe-DS/Library/issues/3398) [#3399](https://github.com/Safe-DS/Library/issues/3399) [#3396](https://github.com/Safe-DS/Library/issues/3396) [#3394](https://github.com/Safe-DS/Library/issues/3394) [#3391](https://github.com/Safe-DS/Library/issues/3391) [#3316](https://github.com/Safe-DS/Library/issues/3316) [#3387](https://github.com/Safe-DS/Library/issues/3387) [#3386](https://github.com/Safe-DS/Library/issues/3386) * labels of correlation heatmap ([#894](#894)) ([a88a609](a88a609)), closes [#871](#871) * make multi-processing in baseline models more consistent ([#909](#909)) ([fa24560](fa24560)), closes [#907](#907) ### Performance Improvements * improved performance in various methods in `Image` and `ImageList` ([#879](#879)) ([134e7d8](134e7d8))
Is your feature request related to a problem?
Currently RobustScaler, StandardScaler and possibly RangeScaler will not work on columns containing NaN-values.
Desired solution
The fit and transform methods of the scalers should ignore NaN as they ignore None.
Possible alternatives (optional)
Implementing and raising a ContainsNaNError for the end user might be preferable, as the choice of how to handle rows and columns containing NaN might be relevant to the data science problem.
Screenshots (optional)
No response
Additional Context (optional)
Currently there is no test coverage for NaN and None values in tables for the scalers.
The text was updated successfully, but these errors were encountered: