stabilize normalization results by modifying sampling process #15

leahincom · 2021-08-23T12:46:11Z

normalize.py

random 하게 sampling 하여 결과가 일정하지 않은 문제 해결
샘플링한 데이터에 autonormalize를 수행해서 normalize 대상을 파악하는 과정 반복 (default: 5로 설정해둠)
위의 결과를 이용해 최종적으로 기존 데이터에 대해 normalize 수행

cezanne · 2021-08-23T14:30:46Z

기존 방식처럼 dataframe의 sample() 함수는 random하게 sampling을 잘 할겁니다. 굳이 재반복해서 sampling을 하여 교집합을 찾을 필요는 없어요..
다시 설명을 하면, sampling한 것으로 autonormalize 후 결과값이 나올 것이고, 이 과정을 수차례 반복한 결과값들의 공통 컬럼들을 기준으로 최종 normalize를 하는 것으로 하자는 것입니다.

leahincom · 2021-08-24T11:34:58Z

기존 방식처럼 dataframe의 sample() 함수는 random하게 sampling을 잘 할겁니다. 굳이 재반복해서 sampling을 하여 교집합을 찾을 필요는 없어요..
다시 설명을 하면, sampling한 것으로 autonormalize 후 결과값이 나올 것이고, 이 과정을 수차례 반복한 결과값들의 공통 컬럼들을 기준으로 최종 normalize를 하는 것으로 하자는 것입니다.

넵! 참고해서 수정한 후 다시 PR 올리겠습니다.
감사합니다 :)

cezanne · 2021-08-27T04:59:57Z

featuretools/mkfeat/normalize.py

+        for _ in range(5):  
+            df = df.sample(n=1000)
+            es = an.auto_entityset(df, index=key_colname, accuracy=0.98)
+            entities.update(es.entities[1:])


현재 코드는 원하는 방식은 아닌 거 같습니다. 찾아낸 entity와 relationship들의 교집합이 되어야 될 거 같습니다. 그런데 단순히 set을 써서 합집합, 교집합을 얻어내는 것이라기 보다는 entity의 컬럼의 교집합이 되어야 하지 않을지? 설명이 어렵네요. ㅠ

…tiveFalse

…tiveNegatives

…tivePositives

…tiveTrue

…tiveZeros

Now, column spec will be OK if it has no key column. In that case, a temporary key is generated, whose column name is 'id' or 'id_xxx'.

A label file for importance should have exactly one label column. Old implementation assumed that a label file has only 1 numeric column. Also, apply validation for FeatureImportance.

leahincom requested a review from cezanne August 23, 2021 12:46

leahincom self-assigned this Aug 23, 2021

leahincom force-pushed the dev branch from 1a69150 to bd0d4fb Compare August 27, 2021 03:55

stabilize normalization results by modifying sampling process

a9c3825

leahincom force-pushed the dev branch from bd0d4fb to a9c3825 Compare August 27, 2021 04:01

cezanne reviewed Aug 27, 2021

View reviewed changes

cezanne and others added 16 commits August 29, 2021 21:29

Fix FeatureImportance for separate label file case

5260fa6

add argument conditions to transform operator: GreaterThanPrevious

3ecddd6

add argument conditions to transform operator: LessThanPrevious

9aa2e41

add argument conditions to aggregation operator: CountInsideRange

d335970

add argument conditions to aggregation operator: CountOutsideRange

c9ef85c

add argument conditions and refactor aggregation operator: MaxConsecu…

37db785

…tiveFalse

add argument conditions and refactor aggregation operator: MaxConsecu…

3c67f1c

…tiveNegatives

add argument conditions and refactor aggregation operator: MaxConsecu…

7648337

…tivePositives

add argument conditions and refactor aggregation operator: MaxConsecu…

bcfd64b

…tiveTrue

add argument conditions and refactor aggregation operator: MaxConsecu…

a5d5653

…tiveZeros

Handle the case a key column has a non-unique

23970ba

Add support for auto key

06100cf

Now, column spec will be OK if it has no key column. In that case, a temporary key is generated, whose column name is 'id' or 'id_xxx'.

Manage the bug of autonormalize for multi key

aa84c71

Fix error for non-numeric columns of label file

b534fac

A label file for importance should have exactly one label column. Old implementation assumed that a label file has only 1 numeric column. Also, apply validation for FeatureImportance.

Add ERR_DATA_LABEL_COUNT_MISMATCH

89e42bb

Merge remote-tracking branch 'fork/dev' into dev

62d5976

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stabilize normalization results by modifying sampling process #15

stabilize normalization results by modifying sampling process #15

leahincom commented Aug 23, 2021 •

edited

Loading

cezanne commented Aug 23, 2021

leahincom commented Aug 24, 2021

cezanne Aug 27, 2021

stabilize normalization results by modifying sampling process #15

Are you sure you want to change the base?

stabilize normalization results by modifying sampling process #15

Conversation

leahincom commented Aug 23, 2021 • edited Loading

cezanne commented Aug 23, 2021

leahincom commented Aug 24, 2021

cezanne Aug 27, 2021

Choose a reason for hiding this comment

leahincom commented Aug 23, 2021 •

edited

Loading