Update NAB to Python 3 (numenta#353)

* Convert NAB and most detectors to Python 3 * Refactor remaining Python 2 dependant detectors. - Move twitterADVec documentation and script from wiki+gist to repository. * Update main README.md to reflect changes following NAB port * Disambiguate -d, --detect, and --detectors by readding --detect argument (See: numenta#346) * Remove duplicate files from htmjava detector dir * add plot and build, dist to gitignore * NAB: py2 numenta detector: update install instructions * Add executable bit to gradlew * Reduce problematic wait time value in runner.py scripts * Added RCF runtime to list of offline detectors. * Fixed versions, added CHANGELOG * Specify python 3.7 * Added dockerfile for python 2 detectors * Added Conda stuff * Conda install nab dev mode
ndnchapathi69 · Oct 21, 2019 · a9853ca · a9853ca
1 parent 4346b67
commit a9853ca
Show file tree

Hide file tree

Showing 70 changed files with 4,228 additions and 391 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,9 +1,11 @@
-
+*.bak
 .DS_Store
 *.pyc
 
 data_test/
 deprecated/
+build/
+dist/
 nab.egg-info/
 .idea/
 .project
@@ -13,5 +15,5 @@ nab/detectors/htmjava/.pydevproject
 scripts/.ipynb_checkpoints/
 
 # Generated files
-
+plot_*
 *resultsSummary*
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,14 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+
+## [v1.1] - 2019-09-12
+### Updated runtime to Python 3
+- Moved python 2 runtimes into independent detectors.
+- Updated documentation and examples.
+
+## [v1.0] - 2017-04-26
+### Initial release
+- Established proper python program setup.
+
+## [v0.8] - 2015-09-04
+### Initial tag for scoreboard
diff --git a/Dockerfile.py27 b/Dockerfile.py27
@@ -0,0 +1,26 @@
+FROM numenta/nupic
+
+# Plus Java so we can run HTM.Java as well
+RUN wget https://d3pxv6yz143wms.cloudfront.net/8.212.04.2/java-1.8.0-amazon-corretto-jdk_8.212.04-2_amd64.deb && \
+    apt-get update &&  apt-get install java-common && apt-get install -y --no-install-recommends apt-utils && \
+    dpkg --install java-1.8.0-amazon-corretto-jdk_8.212.04-2_amd64.deb
+
+ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-amazon-corretto
+ENV PATH $JAVA_HOME/bin:$PATH
+
+ENV NAB /usr/local/src/nab
+
+ADD . $NAB
+
+# Run Numenta detectors
+RUN echo "Running numenta detectors in Python 2.7..."
+WORKDIR $NAB/nab/detectors/numenta
+RUN python -m pip install -r requirements.txt
+RUN python run.py --skipConfirmation
+
+# Run HTM.Java detector
+RUN echo "Running HTM.Java detector in Java 8 / Python 2.7..."
+WORKDIR $NAB/nab/detectors/htmjava/nab/detectors/htmjava
+RUN ./gradlew clean build
+WORKDIR $NAB/nab/detectors/htmjava
+RUN python run.py --skipConfirmation
diff --git a/README.md b/README.md
@@ -1,28 +1,31 @@
-The Numenta Anomaly Benchmark [![Build Status](https://travis-ci.org/numenta/NAB.svg?branch=master)](https://travis-ci.org/numenta/NAB)
+The Numenta Anomaly Benchmark (NAB) [![Build Status](https://travis-ci.org/numenta/NAB.svg?branch=master)](https://travis-ci.org/numenta/NAB)
 -----------------------------
 
-Welcome. This repository contains the data and scripts comprising the Numenta
-Anomaly Benchmark (NAB). NAB is a novel benchmark for evaluating
+Welcome. This repository contains the data and scripts which comprise the
+Numenta Anomaly Benchmark (NAB) v1.1. NAB is a novel benchmark for evaluating
 algorithms for anomaly detection in streaming, real-time applications. It is
-comprised of over 50 labeled real-world and artificial timeseries data files plus a
-novel scoring mechanism designed for real-time applications.
-
-Included are the tools to allow you to easily run NAB on your
-own anomaly detection algorithms; see the [NAB entry points
-info](https://github.com/numenta/NAB/wiki#nab-entry-points). Competitive results
-tied to open source code will be posted in the wiki on the
-[Scoreboard](https://github.com/numenta/NAB/wiki/NAB%20Scoreboard). Let us know
-about your work by emailing us at [nab@numenta.org](mailto:nab@numenta.org) or
+composed of over 50 labeled real-world and artificial timeseries data files
+plus a novel scoring mechanism designed for real-time applications.
+
+Included are the tools to allow you to run NAB on your own anomaly detection
+algorithms; see the [NAB entry points
+info](https://github.com/numenta/NAB/wiki/NAB-Entry-Points). Competitive
+results tied to open source code will be posted on the
+[Scoreboard](https://github.com/numenta/NAB#scoreboard). Let us know about
+your work by emailing us at [nab@numenta.org](mailto:nab@numenta.org) or
 submitting a pull request.
 
-This readme is a brief overview and contains details for setting up NAB. Please
-refer to the following for more details about NAB scoring, data, and motivation:
+This readme is a brief overview and contains details for setting up NAB.
+Please refer to the following for more details about NAB scoring, data, and
+motivation:
 
 - [Unsupervised real-time anomaly detection for streaming data](http://www.sciencedirect.com/science/article/pii/S0925231217309864) - The main paper, covering NAB and Numenta's HTM-based anomaly detection algorithm
 - [NAB Whitepaper](https://github.com/numenta/NAB/wiki#nab-whitepaper)
 - [Evaluating Real-time Anomaly Detection Algorithms](http://arxiv.org/abs/1510.03336) - Original publication of NAB
 
-We encourage you to publish your results on running NAB, and share them with us at [nab@numenta.org](nab@numenta.org). Please cite the following publication when referring to NAB:
+We encourage you to publish your results on running NAB, and share them with
+us at [nab@numenta.org](nab@numenta.org). Please cite the following
+publication when referring to NAB:
 
 Ahmad, S., Lavin, A., Purdy, S., & Agha, Z. (2017). Unsupervised real-time
 anomaly detection for streaming data. Neurocomputing, Available online 2 June
@@ -59,27 +62,29 @@ The NAB scores are normalized such that the maximum possible is 100.0 (i.e. the
 
 \**** We have included the results for RCF using an [AWS proprietary implementation](https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html); even though the algorithm code is not open source, the [algorithm description](http://proceedings.mlr.press/v48/guha16.pdf) is public and the code we used to run [NAB on RCF](nab/detectors/random_cut_forest) is open source.
 
-
 &dagger; Algorithm was an entry to the [2016 NAB Competition](http://numenta.com/blog/2016/08/10/numenta-anomaly-benchmark-nab-competition-2016-winners/).
 
-Please see [the wiki section on contributing algorithms](https://github.com/numenta/NAB/wiki/NAB-Contributions-Criteria#anomaly-detection-algorithms) for discussion on posting algorithms to the scoreboard.
+Please see [the wiki section on contributing
+algorithms](https://github.com/numenta/NAB/wiki/NAB-Contributions-Criteria#anomaly-detection-algorithms)
+for discussion on posting algorithms to the scoreboard.
 
 #### Corpus
 
-The NAB corpus of 58 timeseries data files is designed to provide data for research
-in streaming anomaly detection. It is comprised of both
-real-world and artifical timeseries data containing labeled anomalous periods of behavior.
+The NAB corpus of 58 timeseries data files is designed to provide data for
+research in streaming anomaly detection. It is comprised of both real-world
+and artifical timeseries data containing labeled anomalous periods of
+behavior.
 
 The majority of the data is real-world from a variety of sources such as AWS
 server metrics, Twitter volume, advertisement clicking metrics, traffic data,
-and more. All data is included in the repository, with more details in the [data
-readme](https://github.com/numenta/NAB/tree/master/data). We are in the process
-of adding more data, and actively searching for more data. Please contact us at
-[nab@numenta.org](mailto:nab@numenta.org) if you have similar data (ideally with
-known anomalies) that you would like to see incorporated into NAB.
+and more. All data is included in the repository, with more details in the
+[data readme](https://github.com/numenta/NAB/tree/master/data). Please
+contact us at [nab@numenta.org](mailto:nab@numenta.org) if you have similar
+data (ideally with known anomalies) that you would like to see incorporated
+into NAB.
 
-The NAB version will be updated whenever new data (and corresponding labels) is
-added to the corpus; NAB is currently in v1.0.
+The NAB version will be updated whenever new data (and corresponding labels)
+is added to the corpus or other significant changes are made.
 
 #### Additional Scores
 
@@ -96,8 +101,6 @@ run without likelihood, set the variable `self.useLikelihood` in
 to `False`.
 
 
-
-
 | Detector      |Standard Profile | Reward Low FP | Reward Low FN |
 |---------------|---------|------------------|---------------|
 | Numenta HTMusing NuPIC v0.5.6*   | 70.1             | 63.1       | 74.3          |
@@ -110,66 +113,57 @@ to `False`.
 
 &dagger; Algorithm was an entry to the [2016 NAB Competition](http://numenta.com/blog/2016/08/10/numenta-anomaly-benchmark-nab-competition-2016-winners/).
 
-Installing NAB 1.0
-------------------
+Installing NAB
+--------------
 
 ### Supported Platforms
 
 - OSX 10.9 and higher
 - Amazon Linux (via AMI)
 
-Other platforms may work but have not been tested.
-
+Other platforms may work. NAB has been tested on Windows 10 but is not
+officially supported.
 
 ### Initial requirements
 
 You need to manually install the following:
 
-- [Python 2.7](https://www.python.org/download/)
+- [Python 3.6](https://www.python.org/download/)
 - [pip](https://pip.pypa.io/en/latest/installing.html)
 - [NumPy](http://www.numpy.org/)
-- [NuPIC](http://www.github.com/numenta/nupic) (only required if running the Numenta detector)
 
-##### Download this repository
+#### Download this repository
 
 Use the Github links provided in the right sidebar.
 
-##### Install the Python requirements
-
-    cd NAB
-    (sudo) pip install -r requirements.txt
-
-This will install the required modules.
-
-##### Install NAB
-
-Recommended:
+#### Install NAB
 
-	pip install . --user
+##### Pip:
 
+From inside the checkout directory:
 
-> Note: If NuPIC is not already installed, the version specified in
-`NAB/requirements.txt` will be installed. If NuPIC is already installed, it
- will not be re-installed.
-
+    pip install -r requirements.txt
+	  pip install . --user
 
 If you want to manage dependency versions yourself, you can skip dependencies
 with:
 
     pip install . --user --no-deps
 
-
 If you are actively working on the code and are familiar with manual
 PYTHONPATH setup:
 
-	pip install -e . --install-option="--prefix=/some/other/path/"
+	  pip install -e . --install-option="--prefix=/some/other/path/"
 
+##### Anaconda:
+
+    conda env create
 
 ### Usage
 
 There are several different use cases for NAB:
 
-1. If you just want to look at all the results we reported in the paper, there
+1. If you want to look at all the results we reported in the paper, there
 is no need to run anything. All the data files are in the data subdirectory and
 all individual detections for reported algorithms are checked in to the results
 subdirectory. Please see the README files in those locations.
@@ -178,31 +172,28 @@ subdirectory. Please see the README files in those locations.
 `scripts` directory for `scripts/plot.py`
 
 1. If you have your own algorithm and want to run the NAB benchmark, please see
-the [NAB Entry Points](https://github.com/numenta/NAB/wiki#nab-entry-diagram)
+the [NAB Entry Points](https://github.com/numenta/NAB/wiki/NAB-Entry-Points)
 section in the wiki. (The easiest option is often to simply run your algorithm
 on the data and output results in the CSV format we specify. Then run the NAB
 scoring algorithm to compute the final scores. This is how we scored the Twitter
 algorithm, which is written in R.)
 
-1. If you are a NuPIC user and just want to run the Numenta HTM detector follow
+1. If you are a NuPIC user and want to run the Numenta HTM detector follow
 the directions below to "Run HTM with NAB".
 
 1. If you want to run everything including the bundled Skyline detector follow
 the directions below to "Run full NAB". Note that this will take hours as the
 Skyline code is quite slow.
 
-1. If you just want to run NAB on one or more data files (e.g. for debugging)
+1. If you want to run NAB on one or more data files (e.g. for debugging)
 follow the directions below to "Run a subset of NAB".
 
-
-##### Run HTM with NAB
-
-First make sure NuPIC is installed and working properly. Then:
+##### Run a detector on NAB
 
     cd /path/to/nab
-    python run.py -d numenta --detect --optimize --score --normalize
+    python run.py -d expose --detect --optimize --score --normalize
 
-This will run the Numenta detector only and produce normalized scores. Note that
+This will run the EXPoSE detector only and produce normalized scores. Note that
 by default it tries to use all the cores on your machine. The above command
 should take about 20-30 minutes on a current powerful laptop with 4-8 cores.
 For debugging you can run subsets of the data files by modifying and specifying
@@ -212,27 +203,27 @@ specific label files (see section below). Please type:
 
 to see all the options.
 
-Note that to replicate results exactly as in the paper you may need to checkout
-the specific version of NuPIC (and associated nupic.core) that is noted in the
-[Scoreboard](https://github.com/numenta/NAB/wiki/NAB%20Scoreboard):
+##### Running non-Python 3 detectors
+
+NAB is a Python 3 framework, and can only integrate Python 3 detectors. The following detectors must be run outside the NAB runtime and integrated for scoring in a later step. These detectors include:
 
-    cd /path/to/nupic/
-    git checkout -b nab {TAG NAME}
-    cd /path/to/nupic.core/
-    git checkout -b nab {TAG NAME}
+    numenta (Python 2)
+    numentaTM (Python 2)
+    htmjava (Python 2 / Java)
+    twitterADVec (R)
+    random_cut_forest (AWS Kinesis Analytics)
+
+Instructions on how to run the each detector in their native environment can be found in the `nab/detectors/${name}` directory.
 
 ##### Run full NAB
 
     cd /path/to/nab
     python run.py
 
-This will run everything and produce results files for all anomaly detection
-methods. Several algorithms are included in the repo, such as the Numenta
-HTM anomaly detection method, as well as methods from the [Etsy
-Skyline](https://github.com/etsy/skyline) anomaly detection library, a sliding
-window detector, Bayes Changepoint, and so on. This will also pass those results
-files to the scoring script to generate final NAB scores. **Note**: this option
-will take many many hours to run.
+This will run all detectors available in this repository and produce results
+files. To run non-Python3 detectors see "Running non-Python3 detectors" above.
+
+**Note**: this option may take many many hours to run.
 
 ##### Run subset of NAB data files
 
@@ -248,7 +239,7 @@ are interested in.
 NAB on a subset of labels:
 
     cd /path/to/nab
-    python run.py -d numenta --detect --windowsFile labels/combined_windows_tiny.json
+    python run.py -d expose --detect --windowsFile labels/combined_windows_tiny.json
 
 This will run the `detect` phase of NAB on the data files specified in the above
 JSON file. Note that scoring and normalization are not supported with this

diff --git a/environment.yml b/environment.yml
@@ -0,0 +1,20 @@
+name: NAB
+channels:
+  - defaults
+  - conda-forge
+
+dependencies:
+  - python=3.6
+  - pip
+
+  # See requirements.txt
+  - pandas==0.20.3
+  - simplejson==3.11.1
+  - boto3==1.9.134
+  - scikit-learn==0.21.1
+
+  - pip:
+    - boto3
+    - botocore
+    # Install NAB in development mode
+    - -e .
diff --git a/nab/corpus.py b/nab/corpus.py
@@ -157,7 +157,7 @@ def addColumn(self, columnName, data, write=False):
                                   modificiations or not.
     """
 
-    for relativePath in self.dataFiles.keys():
+    for relativePath in list(self.dataFiles.keys()):
       self.dataFiles[relativePath].modifyData(
         columnName, data[relativePath], write=write)
 
@@ -172,7 +172,7 @@ def removeColumn(self, columnName, write=False):
     @param write        (boolean) Flag to decide whether to write corpus
                                   modificiations or not.
     """
-    for relativePath in self.dataFiles.keys():
+    for relativePath in list(self.dataFiles.keys()):
       self.dataFiles[relativePath].modifyData(columnName, write=write)
 
   def copy(self, newRoot=None):
@@ -184,13 +184,13 @@ def copy(self, newRoot=None):
     if newRoot[-1] != os.path.sep:
       newRoot += os.path.sep
     if os.path.isdir(newRoot):
-      print "directory already exists"
+      print("directory already exists")
       return None
     else:
       createPath(newRoot)
 
     newCorpus = Corpus(newRoot)
-    for relativePath in self.dataFiles.keys():
+    for relativePath in list(self.dataFiles.keys()):
       newCorpus.addDataSet(relativePath, self.dataFiles[relativePath])
     return newCorpus
 
@@ -224,7 +224,7 @@ def getDataSubset(self, query):
                                       datafile.
     """
     ans = {}
-    for relativePath in self.dataFiles.keys():
+    for relativePath in list(self.dataFiles.keys()):
       if query in relativePath:
         ans[relativePath] = self.dataFiles[relativePath]
     return ans