forked from alteryx/featuretools
-
Notifications
You must be signed in to change notification settings - Fork 3
stabilize normalization results by modifying sampling process #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
leahincom
wants to merge
17
commits into
oslab-ewha:dev
Choose a base branch
from
leahincom:dev
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
a9c3825
stabilize normalization results by modifying sampling process
leahincom 5260fa6
Fix FeatureImportance for separate label file case
cezanne 3ecddd6
add argument conditions to transform operator: GreaterThanPrevious
leahincom 9aa2e41
add argument conditions to transform operator: LessThanPrevious
leahincom d335970
add argument conditions to aggregation operator: CountInsideRange
leahincom c9ef85c
add argument conditions to aggregation operator: CountOutsideRange
leahincom 37db785
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom 3c67f1c
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom 7648337
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom bcfd64b
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom a5d5653
add argument conditions and refactor aggregation operator: MaxConsecu…
leahincom 23970ba
Handle the case a key column has a non-unique
cezanne 06100cf
Add support for auto key
cezanne aa84c71
Manage the bug of autonormalize for multi key
cezanne b534fac
Fix error for non-numeric columns of label file
cezanne 89e42bb
Add ERR_DATA_LABEL_COUNT_MISMATCH
cezanne 62d5976
Merge remote-tracking branch 'fork/dev' into dev
leahincom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,41 @@ | ||
from pandas import DataFrame | ||
import autonormalize as an | ||
|
||
|
||
def normalize(df: DataFrame, key_colname): | ||
es = None | ||
entities = set() | ||
relationships = set() | ||
|
||
if len(df) > 1000: | ||
df = df.sample(n=1000) | ||
es = an.auto_entityset(df, index=key_colname, accuracy=0.98) | ||
for _ in range(5): | ||
df = df.sample(n=1000) | ||
es = an.auto_entityset(df, index=key_colname, accuracy=0.98) | ||
entities.update(es.entities[1:]) | ||
relationships.update(es.relationships) | ||
else: | ||
es = an.auto_entityset(df, index=key_colname, accuracy=0.98) | ||
entities.update(es.entities[1:]) | ||
relationships.update(es.relationships) | ||
|
||
norminfos = [] | ||
# 첫번째 이외의 entity들에 대해서. 첫번째 entity가 main임을 가정 | ||
entities = es.entities[1:] | ||
for et in entities: | ||
norminfo = [] | ||
for var in et.variables: | ||
norminfo.append(var.name) | ||
norminfos.append(norminfo) | ||
for norminfo in norminfos: | ||
parent_ids = _get_parent_entity_ids(es, norminfo[0]) | ||
parent_ids = _get_parent_entity_ids(relationships, norminfo[0]) | ||
for parent_id in parent_ids: | ||
vars = es[parent_id].variables | ||
for var in vars[1:]: | ||
norminfo.append(var.name) | ||
return norminfos | ||
|
||
|
||
def _get_parent_entity_ids(es, child_id): | ||
def _get_parent_entity_ids(rels, child_id): | ||
parent_ids = [] | ||
for rel in es.relationships: | ||
for rel in rels: | ||
if child_id == rel.child_entity.id: | ||
parent_ids.append(rel.parent_entity.id) | ||
parent_ids += _get_parent_entity_ids(es, rel.parent_entity.id) | ||
return parent_ids | ||
parent_ids += _get_parent_entity_ids(rels, rel.parent_entity.id) | ||
return parent_ids |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
현재 코드는 원하는 방식은 아닌 거 같습니다. 찾아낸 entity와 relationship들의 교집합이 되어야 될 거 같습니다. 그런데 단순히 set을 써서 합집합, 교집합을 얻어내는 것이라기 보다는 entity의 컬럼의 교집합이 되어야 하지 않을지? 설명이 어렵네요. ㅠ