Skip to content

stabilize normalization results by modifying sampling process #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: dev
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
a9c3825
stabilize normalization results by modifying sampling process
leahincom Aug 21, 2021
5260fa6
Fix FeatureImportance for separate label file case
cezanne Aug 29, 2021
3ecddd6
add argument conditions to transform operator: GreaterThanPrevious
leahincom Aug 30, 2021
9aa2e41
add argument conditions to transform operator: LessThanPrevious
leahincom Aug 30, 2021
d335970
add argument conditions to aggregation operator: CountInsideRange
leahincom Aug 30, 2021
c9ef85c
add argument conditions to aggregation operator: CountOutsideRange
leahincom Aug 30, 2021
37db785
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom Aug 30, 2021
3c67f1c
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom Aug 30, 2021
7648337
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom Aug 30, 2021
bcfd64b
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom Aug 30, 2021
a5d5653
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom Aug 30, 2021
23970ba
Handle the case a key column has a non-unique
cezanne Aug 30, 2021
06100cf
Add support for auto key
cezanne Aug 30, 2021
aa84c71
Manage the bug of autonormalize for multi key
cezanne Aug 30, 2021
b534fac
Fix error for non-numeric columns of label file
cezanne Aug 31, 2021
89e42bb
Add ERR_DATA_LABEL_COUNT_MISMATCH
cezanne Aug 31, 2021
62d5976
Merge remote-tracking branch 'fork/dev' into dev
leahincom Sep 6, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 18 additions & 10 deletions featuretools/mkfeat/normalize.py
Original file line number Diff line number Diff line change
@@ -1,33 +1,41 @@
from pandas import DataFrame
import autonormalize as an


def normalize(df: DataFrame, key_colname):
es = None
entities = set()
relationships = set()

if len(df) > 1000:
df = df.sample(n=1000)
es = an.auto_entityset(df, index=key_colname, accuracy=0.98)
for _ in range(5):
df = df.sample(n=1000)
es = an.auto_entityset(df, index=key_colname, accuracy=0.98)
entities.update(es.entities[1:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

현재 코드는 원하는 방식은 아닌 거 같습니다. 찾아낸 entity와 relationship들의 교집합이 되어야 될 거 같습니다. 그런데 단순히 set을 써서 합집합, 교집합을 얻어내는 것이라기 보다는 entity의 컬럼의 교집합이 되어야 하지 않을지? 설명이 어렵네요. ㅠ

relationships.update(es.relationships)
else:
es = an.auto_entityset(df, index=key_colname, accuracy=0.98)
entities.update(es.entities[1:])
relationships.update(es.relationships)

norminfos = []
# 첫번째 이외의 entity들에 대해서. 첫번째 entity가 main임을 가정
entities = es.entities[1:]
for et in entities:
norminfo = []
for var in et.variables:
norminfo.append(var.name)
norminfos.append(norminfo)
for norminfo in norminfos:
parent_ids = _get_parent_entity_ids(es, norminfo[0])
parent_ids = _get_parent_entity_ids(relationships, norminfo[0])
for parent_id in parent_ids:
vars = es[parent_id].variables
for var in vars[1:]:
norminfo.append(var.name)
return norminfos


def _get_parent_entity_ids(es, child_id):
def _get_parent_entity_ids(rels, child_id):
parent_ids = []
for rel in es.relationships:
for rel in rels:
if child_id == rel.child_entity.id:
parent_ids.append(rel.parent_entity.id)
parent_ids += _get_parent_entity_ids(es, rel.parent_entity.id)
return parent_ids
parent_ids += _get_parent_entity_ids(rels, rel.parent_entity.id)
return parent_ids