-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Keyword search in text from reference dictionary (#46)
* init file and pull sample data * add helper functions and reference dict * clean up text, tokenize, add new labels from regex function * move example into its own file, seperate out functions to use * clean functions and make generic names * clean up example * move reference dict to example_data * remove pdb * move dict out of functions file * missed import * add metrics to measure accuracy * rename other category to unknown for fair comparison * move keyword search to preprocess step, util * move add label fucntion to classify * fix imports, add comment for viz * remove temp file * change arg to dict * add keyword util * update formating * whitespace * update args to match * remove whitespace * function rename * clean up dict * add sample data * update docstring and change col dict name * clean up and use sample data in examples * add plot function * add in visualization function for confusion matrix * add more detail to dox string * clean up function to expect predicted_col * clean up example * remove whitespace * change col name from new to predicted * clean up sample data * add notebook example * add functionality to replace text in preprocessing * add text replacement step * linter * remove example .py file in favor of jupyter notebook * remove reference dict * add new csv mapping rfiles * add csv * update jupyternotebook example * use df instead of dict * remove dict * remove unused imports and functions * update jupyter notebook and remove module * remove plot * update docstrings * linter * remove whitespace
- Loading branch information
1 parent
2526d49
commit a4de9a4
Showing
7 changed files
with
420 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
in,out_ | ||
combiner,combiner | ||
comb,combiner | ||
cb,combiner | ||
battery,battery | ||
bess,battery | ||
inverter,inverter | ||
invert,inverter | ||
inv,inverter | ||
met,met | ||
meter,meter | ||
module,module | ||
mod,module | ||
recloser,recloser | ||
reclose,recloser | ||
relay,relay | ||
substation,substation | ||
switchgear,switchgear | ||
switch,switchgear | ||
tracker,tracker | ||
transformer,transformer | ||
xfmr,transformer | ||
wiring,wiring | ||
wire,wiring | ||
wires,wiring |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
in,out_ | ||
comm,communication | ||
energy,energy | ||
kwh,energy | ||
mwh,energy | ||
grid,grid | ||
curtailment,grid | ||
curtail,grid | ||
poi,grid | ||
offline,outage | ||
solar,solar | ||
pv,solar | ||
photovoltaic,solar | ||
system,system | ||
site,system | ||
farm,system | ||
project,system | ||
sma,make_model | ||
cm,corrective_maintence | ||
pm,preventative_maintence |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Adding keyword labels to O&M data\n", | ||
"This notebook demonstrates the use of the `pvops.classify.get_attributes_from_keywords` module for adding asset labels based off O&M notes." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"from sklearn.metrics import accuracy_score\n", | ||
"\n", | ||
"from pvops.text import utils, preprocess\n", | ||
"from pvops.text.classify import get_attributes_from_keywords\n", | ||
"from pvops.text.visualize import visualize_classification_confusion_matrix" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Step 0: Get sample data, remap assets" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# pull in sample data and remap assets for ease of comparison\n", | ||
"\n", | ||
"om_df = pd.read_csv('example_data/example_ML_ticket_data.csv')\n", | ||
"col_dict = {\n", | ||
" \"data\" : \"CompletionDesc\",\n", | ||
" \"eventstart\" : \"Date_EventStart\",\n", | ||
" \"save_data_column\" : \"processed_data\",\n", | ||
" \"save_date_column\" : \"processed_date\",\n", | ||
" \"attribute_col\" : \"Asset\",\n", | ||
" \"predicted_col\" : \"Keyword_Asset\",\n", | ||
" \"remapping_col_from\": \"in\",\n", | ||
" \"remapping_col_to\": \"out_\"\n", | ||
"}\n", | ||
"\n", | ||
"# remap assets\n", | ||
"remapping_df = pd.read_csv('example_data/remappings_asset.csv')\n", | ||
"remapping_df['out_'] = remapping_df['out_'].replace({'met station': 'met',\n", | ||
" 'energy storage': 'battery',\n", | ||
" 'energy meter': 'meter'})\n", | ||
"om_df = utils.remap_attributes(om_df, remapping_df, col_dict, allow_missing_mappings=True)\n", | ||
"om_df.head()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Step 1: Text preprocessing" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# preprocessing steps\n", | ||
"om_df[col_dict['attribute_col']] = om_df.apply(lambda row: row[col_dict['attribute_col']].lower(), axis=1)\n", | ||
"om_df = preprocess.preprocessor(om_df, lst_stopwords=[], col_dict=col_dict, print_info=False, extract_dates_only=False)\n", | ||
"\n", | ||
"DATA_COL = col_dict['data']\n", | ||
"om_df[DATA_COL] = om_df['processed_data']\n", | ||
"\n", | ||
"# replace terms\n", | ||
"equipment_df = pd.read_csv('~/pvOps/examples/example_data/mappings_equipment.csv')\n", | ||
"pv_terms_df = pd.read_csv('~/pvOps/examples/example_data/mappings_pv_terms.csv')\n", | ||
"pv_reference_df = pd.concat([equipment_df, pv_terms_df])\n", | ||
"om_df = utils.remap_words_in_text(om_df=om_df, remapping_df=pv_reference_df, remapping_col_dict=col_dict)\n", | ||
"\n", | ||
"om_df.head()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Step 2: Search for keywords to use as labels" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# add asset labels from keyword reference dict\n", | ||
"om_df = get_attributes_from_keywords(om_df=om_df,\n", | ||
" col_dict=col_dict,\n", | ||
" reference_df=equipment_df)\n", | ||
"om_df.head()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Step 3: Metrics" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# get accuracy measures and count metrics\n", | ||
"PREDICT_COL = col_dict['predicted_col']\n", | ||
"LABEL_COL = col_dict['attribute_col']\n", | ||
"\n", | ||
"# entries with some keyword over interest, over all entries\n", | ||
"label_count = om_df[PREDICT_COL].count() / len(om_df)\n", | ||
"\n", | ||
"# replace 'Other' values with 'Unknown'\n", | ||
"om_df[LABEL_COL] = om_df[LABEL_COL].replace('other', 'unknown')\n", | ||
"# replace NaN values to use accuracy score\n", | ||
"om_df[[LABEL_COL, PREDICT_COL]] = om_df[[LABEL_COL, PREDICT_COL]].fillna('unknown')\n", | ||
"acc_score = accuracy_score(y_true=om_df[LABEL_COL], y_pred=om_df[PREDICT_COL])\n", | ||
"\n", | ||
"msg = f'{label_count:.2%} of entries had a keyword of interest, with {acc_score:.2%} accuracy.'\n", | ||
"print(msg)" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Step 4: Visualization" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# plot confusion matrix\n", | ||
"title = 'Confusion Matrix of Actual and Predicted Asset Labels'\n", | ||
"visualize_classification_confusion_matrix(om_df, col_dict, title)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.5" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.