-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stabilize normalization results by modifying sampling process #15
base: dev
Are you sure you want to change the base?
Conversation
기존 방식처럼 dataframe의 sample() 함수는 random하게 sampling을 잘 할겁니다. 굳이 재반복해서 sampling을 하여 교집합을 찾을 필요는 없어요.. |
넵! 참고해서 수정한 후 다시 PR 올리겠습니다. |
featuretools/mkfeat/normalize.py
Outdated
for _ in range(5): | ||
df = df.sample(n=1000) | ||
es = an.auto_entityset(df, index=key_colname, accuracy=0.98) | ||
entities.update(es.entities[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
현재 코드는 원하는 방식은 아닌 거 같습니다. 찾아낸 entity와 relationship들의 교집합이 되어야 될 거 같습니다. 그런데 단순히 set을 써서 합집합, 교집합을 얻어내는 것이라기 보다는 entity의 컬럼의 교집합이 되어야 하지 않을지? 설명이 어렵네요. ㅠ
Now, column spec will be OK if it has no key column. In that case, a temporary key is generated, whose column name is 'id' or 'id_xxx'.
A label file for importance should have exactly one label column. Old implementation assumed that a label file has only 1 numeric column. Also, apply validation for FeatureImportance.
normalize.py
autonormalize
를 수행해서 normalize 대상을 파악하는 과정 반복 (default: 5
로 설정해둠)normalize
수행