Obviously part of something bigger but worth breaking out for reuse.
require 'stopwords'
#List all stop words
Stopwords::STOP_WORDS
#Test to see if a token is a stop word
Stopwords.is?('and')
=>true
#Ensures a token is both a 'word' and not a stop word
Stopwords.valid?('vector')
=>true
$ rake specs
Not part of the library but you should probably sanitize tokens before using them (if your tokenize doesn’t already)
SANITIZE_REGEXP = /('|\"|‘|’|\/|\\)/
text.downcase.gsub(SANITIZE_REGEXP, '')
Software Services shop (primarily Ruby) in Brooklyn, NY.