Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPGrowth/FPMax and Association Rules with the existence of missing values (#1004) #1106

Merged
merged 6 commits into from
Oct 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions docs/sources/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,18 @@ The CHANGELOG for the current development version is available at

##### New Features and Enhancements

- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) Implemented three new metrics: Jaccard, Certainty, and Kulczynski. ([#1096](https://github.com/rasbt/mlxtend/issues/1096))
- Integrated scikit-learn's `set_output` method into `TransactionEncoder` ([#1087](https://github.com/rasbt/mlxtend/issues/1087) via [it176131](https://github.com/it176131))
- Implement the FP-Growth and FP-Max algorithms with the possibility of missing values in the input dataset. Added a new metric Representativity for the association rules generated ([#1004](https://github.com/rasbt/mlxtend/issues/1004) via [zazass8](https://github.com/zazass8)).
Files updated:
- ['mlxtend.frequent_patterns.fpcommon']
- ['mlxtend.frequent_patterns.fpgrowth'](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/)
- ['mlxtend.frequent_patterns.fpmax'](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpmax/)
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)Implemented three new metrics: Jaccard, Certainty, and Kulczynski. ([#1096](https://github.com/rasbt/mlxtend/issues/1096))
- Integrated scikit-learn's `set_output` method into `TransactionEncoder` ([#1087](https://github.com/rasbt/mlxtend/issues/1087) via[it176131](https://github.com/it176131))

##### Changes

- [`mlxtend.frequent_patterns.fpcommon`] Added the null_values parameter in valid_input_check signature to check in case the input also includes null values. Changes the returns statements and function signatures for setup_fptree and generated_itemsets respectively to return the disabled array created and to include it as a parameter. Added code in [`mlxtend.frequent_patterns.fpcommon`] and [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) to implement the algorithms in case null values exist when null_values is True.
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) Added optional parameter 'return_metrics' to only return a given list of metrics, rather than every possible metric.

- Add `n_classes_` attribute to stacking classifiers for compatibility with scikit-learn 1.3 ([#1091](https://github.com/rasbt/mlxtend/issues/1091))
Expand Down
360 changes: 352 additions & 8 deletions docs/sources/user_guide/frequent_patterns/association_rules.ipynb

Large diffs are not rendered by default.

265 changes: 262 additions & 3 deletions docs/sources/user_guide/frequent_patterns/fpgrowth.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,9 @@
"\n",
"In general, the algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as \"frequent\" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.\n",
"\n",
"In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets."
"In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets.\n",
"\n",
"A new feature is implemented in this algorithm, which is the sub-case when the input contains missing information [3]. The same structure and logic of the algorithm is kept, while \"ignoring\" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. The support is computed differently where for a single item, the cardinality of null values is deducted from the cardinality of all transactions in the database. For the case of an itemset, of more than one elements, the cardinality of null values in at least one item in them itemset is deducted from the cardinality of all transactions in the database. "
]
},
{
Expand All @@ -49,6 +51,8 @@
"\n",
"[2] Agrawal, Rakesh, and Ramakrishnan Srikant. \"[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf).\" Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.\n",
"\n",
"[3] Ragel, A. and Crémilleux, B., 1998. \"[Treatment of missing values for association rules](https://link.springer.com/chapter/10.1007/3-540-64383-4_22)\". In Research and Development in Knowledge Discovery and Data Mining: Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings 2 (pp. 258-270). Springer Berlin Heidelberg.\n",
"\n",
"## Related\n",
"\n",
"- [FP-Max](./fpmax.md)\n",
Expand Down Expand Up @@ -479,6 +483,261 @@
"fpgrowth(df, min_support=0.6, use_colnames=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The example below implements the algorithm when there is missing information from the data, by arbitrarily removing datapoints from the original dataset."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
" df.iloc[idx[i], col[i]] = np.nan\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Apple</th>\n",
" <th>Corn</th>\n",
" <th>Dill</th>\n",
" <th>Eggs</th>\n",
" <th>Ice cream</th>\n",
" <th>Kidney Beans</th>\n",
" <th>Milk</th>\n",
" <th>Nutmeg</th>\n",
" <th>Onion</th>\n",
" <th>Unicorn</th>\n",
" <th>Yogurt</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion \\\n",
"0 False False False True False True True True True \n",
"1 False NaN True True False True False True True \n",
"2 True False False True False True True False False \n",
"3 False True False False NaN NaN True NaN False \n",
"4 False True False True NaN True False False NaN \n",
"\n",
" Unicorn Yogurt \n",
"0 NaN NaN \n",
"1 False True \n",
"2 False False \n",
"3 NaN True \n",
"4 False False "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"from mlxtend.frequent_patterns import fpgrowth\n",
"\n",
"rows, columns = df.shape\n",
"idx = np.random.randint(0, rows, 10)\n",
"col = np.random.randint(0, columns, 10)\n",
"\n",
"for i in range(10):\n",
" df.iloc[idx[i], col[i]] = np.nan\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same function as above is applied by setting `null_values=True` with at least 60% support:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>support</th>\n",
" <th>itemsets</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>(Kidney Beans)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.8</td>\n",
" <td>(Eggs)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.6</td>\n",
" <td>(Milk)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>(Eggs, Kidney Beans)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" support itemsets\n",
"0 1.0 (Kidney Beans)\n",
"1 0.8 (Eggs)\n",
"2 0.6 (Milk)\n",
"3 1.0 (Eggs, Kidney Beans)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fpgrowth(df, min_support=0.6, null_values = True, use_colnames=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -677,7 +936,7 @@
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
Expand All @@ -691,7 +950,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.12.7"
},
"toc": {
"nav_menu": {},
Expand Down
Loading
Loading