This repository contains the code for the paper "On Whom Should I Perform the Lab Test on Next? An Active Feature Elicitation Approach" which has been published in International Joint Conference on AI (IJCAI) 2018. The link for the paper is http://www.ijcai.org/proceedings/2018/0486.pdf. Folds for Pima dataset has been uploaded as a test data set for this code. For any queries related to the code, drop an email to sxd170431@utdallas.edu.
Run instructions:
Run the code active_feature_acq.py in the below mentioned way.
active_feature_acq.py "D:\git_projects\Active-Feature-Elicitation\Data" Pima Pima_clean GB KL 5 5 14 5 10 5 6 3 0
Details of the command line arguments are as below:
- path (Folder named "Pima" where all files are kept) data set name
- Folder named "Pima" where all files are kept
- Data set name "Pima_clean"
- Classifier used:- GB: Gradient Boosting, LR: Logistic Regression, SVM: Support Vector Machine, DT: Decision tree. You can add your own classifier in Classifier.py
- Divergence metric used:- KL: KL-divergence, HD: Hellinger distance, TV: Total Variation, CD: Chi squared distance
- Number of observed features (1+actual number of features, the first column in Pima_clean data set contains identifiers which are also included for convenience). The unique identifiers in the 1st column are however not considered while learning the models.
- Number of data points to be acquired in every iteration of Active learning
- No. of iterations of data point acquisition in each run
- No. of runs
- Observed set sample size to start with
- test-split. 1/test_split gives the size of the test_set. If test-split=5, test_size=0.2 which means 80-20 is the train test split.
- Depth of the Gradient Boosting classifier on all the features
- Depth of the Gradient Boosting classifier on the observed features
- pos_to_neg ratio: This parameter is currently not required in the code, can set it to 0
Important instructions:
- Folder structure should be the same as that in the Data Folder.
- The IJCAI paper has all the experiments using Gradient Boosting Classifier and KL-divergence as the distance metric. Hence, these are currently there in the folder structure.
- Every Classifier and distance metric has separate folders where results get stored. In order to try a new classifier, you have to create a new folder with the desired name.
- The code runs the algorithm and also runs the Random baseline as mentioned in the paper.
- In the All_runs folder, Pima_clean_stat.csv will give the performance metric of all the runs. In the same folder, *_sugg.csv will contain individual values of all iterations across all runs. *_rnd.csv will contain details about the random baseline.
- Graphs folder will contain the graphs.
- The 5-folds for the experiments have already been created and stored in the Fold* folders.
- The above mentioned parameter setting is the one that is reported in the paper. The parameters are also mentioned in settings.txt file.