Skip to content

elegans-io/clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Calculate LDA with spark mllib from a dataset stored on elasticsearch

Build

Note: works with java 7 and 8 (not with jdk 9) and elasticsearch 5.2.2

sbt package

generation of a fat jar

export JAVA_OPTS="-Xms256m -Xmx4g"
sbt assembly

Classes

io.elegans.clustering.CalcLDA

calculate LDA with data from an elasticsearch index.
Usage: LDA with ES data [options]

  --help                   prints this usage text
  --inputfile <value>      the file with sentences (one per line), when specified elasticsearch is not used  default: None
  --hostname <value>       the hostname of the elasticsearch instance  default: localhost
  --port <value>           the port of the elasticsearch instance  default: 9200
  --group_by_field <value>
                           group the search results by field e.g. conversation, None => no grouping  default: None
  --search_path <value>    the search path on elasticsearch e.g. <index name>/<type name>  default: jenny-en-0/question
  --query <value>          a json string with the query  default: { "fields":["question", "answer", "conversation", "index_in_conversation", "_id" ] }
  --min_k <value>          min number of topics. default: 8
  --max_k <value>          max number of topics. default: 10
  --maxTermsPerTopic <value>
                           the max number of terms per topic. default: 10
  --maxIterations <value>  number of iterations of learning. default: 100
  --stopwordsFile <value>  filepath for a list of stopwords. Note: This must fit on a single machine.  default: Some(stopwords/en_stopwords.txt)
  --used_fields <value>    list of fields to use for LDA, if more than one they will be merged  default: List(question, answer)
  --outputDir <value>      the where to store the output files: topics and document per topics  default: /tmp
  --max_topics_per_doc <value>
                           write the first n topic classifications per document  default: 10
  --strInBase64            specify if the text is encoded in base64 (only supported by loading from file)  default: false

Output

For each topic number (from min_k to max_k) the program will produce:

  • a list of topics with terms and weights -> (num_of_topics, topic_id, term_id, term, weight) e.g. (10,0,1862,help,0.12943777743269205)
  • the topic classification for all documents -> (num_of_topics, doc_id, topic_id, weight) e.g. (10,841,1,0.21606734702770736)

For instance, if min_k=5 and max_k=10 the program will produce the following directories:

TOPICS_K.10_DN.19699_VS.21974  TOPICS_K.7_DN.19699_VS.21974  TOPICSxDOC_K.10_DN.19699_VS.21974  TOPICSxDOC_K.7_DN.19699_VS.21974
TOPICS_K.5_DN.19699_VS.21974   TOPICS_K.8_DN.19699_VS.21974  TOPICSxDOC_K.5_DN.19699_VS.21974   TOPICSxDOC_K.8_DN.19699_VS.21974
TOPICS_K.6_DN.19699_VS.21974   TOPICS_K.9_DN.19699_VS.21974  TOPICSxDOC_K.6_DN.19699_VS.21974   TOPICSxDOC_K.9_DN.19699_VS.21974

where:

  • TOPICS_K.10_DN.19699_VS.21974 will contains 10 topics calculated with 19699 documents and with a vocabulary size with 21974 entries
  • TOPICSxDOC_K.10_DN.19699_VS.21974 will contains the topic classification for all the 19699 documents with vocabulary size with 21974 entries

The program will print out on the standard output the log likelihood and the prior log probability e.g.:

K(5) LOGLIKELIHOOD(-4844389.232238058) LOG_PRIOR(-184.60585976284716) AVGLOGLIKELIHOOD(-245.9205661321924) AVGLOG_PRIOR(-0.009371331527633238) NumDocs(19699) VOCABULAR_SIZE(21974)
K(6) LOGLIKELIHOOD(-4833607.107354659) LOG_PRIOR(-207.6231790028576) AVGLOGLIKELIHOOD(-245.3732223643159) AVGLOG_PRIOR(-0.010539782679468887) NumDocs(19699) VOCABULAR_SIZE(21974)
K(7) LOGLIKELIHOOD(-4808667.540709441) LOG_PRIOR(-227.87874643463218) AVGLOGLIKELIHOOD(-244.1071902487152) AVGLOG_PRIOR(-0.011568036267558363) NumDocs(19699) VOCABULAR_SIZE(21974)
K(8) LOGLIKELIHOOD(-4778600.550597267) LOG_PRIOR(-244.18339819000468) AVGLOGLIKELIHOOD(-242.58086961760836) AVGLOG_PRIOR(-0.012395725579471276) NumDocs(19699) VOCABULAR_SIZE(21974)
K(9) LOGLIKELIHOOD(-4765956.969051744) LOG_PRIOR(-260.1458429346249) AVGLOGLIKELIHOOD(-241.9390308671376) AVGLOG_PRIOR(-0.013206043095315747) NumDocs(19699) VOCABULAR_SIZE(21974)
K(10) LOGLIKELIHOOD(-4758092.727191508) LOG_PRIOR(-275.11088412794464) AVGLOGLIKELIHOOD(-241.53981050771654) AVGLOG_PRIOR(-0.013965728419104758) NumDocs(19699) VOCABULAR_SIZE(21974)

Run spark job

sbt "sparkSubmit --class io.elegans.clustering.CalcLDA -- --hostname <es hostname> --group_by_field <field for grouping> --search_path <index_name/type_name> --min_k <min_topics> --max_k <max_topics> --stopwordFile </path/of/stopword_list.txt> --outputDir </path/of/existing/empty/directory>"

e.g.

sbt "sparkSubmit --class io.elegans.clustering.CalcLDA -- --hostname elastic-0.getjenny.com --group_by_field conversation --search_path english/question --min_k 10 --max_k 30 --stopwordFile /tmp/english_stopwords.txt --outputDir /tmp/lda_results"

run the program using the fat jar

spark-submit --class io.elegans.clustering.CalcLDA ./target/scala-2.11/clustering-assembly-master.jar --help

You might want to use something like --driver-memory 8g before --class .....

io.elegans.clustering.KMeansW2VClustering

calculate clusters with data from an elasticsearch index using w2v representation of phrases.
Usage: Clustering with ES data [options]

  --help                   prints this usage text
  --inputfile <value>      the file with sentences (one per line), when specified elasticsearch is not used  default: None
  --hostname <value>       the hostname of the elasticsearch instance  default: localhost
  --port <value>           the port of the elasticsearch instance  default: 9200
  --group_by_field <value>
                           group the search results by field e.g. conversation, None => no grouping  default: None
  --search_path <value>    the search path on elasticsearch e.g. <index name>/<type name>  default: jenny-en-0/question
  --query <value>          a json string with the query  default: { "fields":["question", "answer", "conversation", "index_in_conversation", "_id" ] }
  --stopwordsFile <value>  filepath for a list of stopwords. Note: This must fit on a single machine.  default: None
  --used_fields <value>    list of fields to use for LDA, if more than one they will be merged  default: List(question, answer)
  --outputDir <value>      the where to store the output files: topics and document per topics  default: /tmp
  --min_k <value>          min number of topics. default: 8
  --max_k <value>          max number of topics. default: 10
  --maxIterations <value>  number of iterations of learning. default: 1000
  --inputW2VModel <value>  the input word2vec model
  --avg                    this flag enable the vectors
  --tfidf                  this flag enable tfidf term weighting
  --scale                  this flag enable vector scaling to 1
  --strInBase64            specify if the text is encoded in base64 (only supported by loading from file)  default: false

io.elegans.clustering.TrainW2V

Train a W2V model taking input data from ES.
Usage: Train a W2V model [options]

  --help                   prints this usage text
  --hostname <value>       the hostname of the elasticsearch instance  default: localhost
  --port <value>           the port of the elasticsearch instance  default: 9200
  --group_by_field <value>
                           group the search results by field e.g. conversation, None => no grouping  default: None
  --search_path <value>    the search path on elasticsearch e.g. <index name>/<type name>  default: jenny-en-0/question
  --query <value>          a json string with the query  default: { "fields":["question", "answer", "conversation", "index_in_conversation", "_id" ] }
  --stopwordFile <value>   filepath for a list of stopwords. Note: This must fit on a single machine.  default: Some(stopwords/en_stopwords.txt)
  --used_fields <value>    list of fields to use for LDA, if more than one they will be merged  default: List(question, answer)
  --outputDir <value>      the where to store the output files: topics and document per topics  default: /tmp
  --vector_size <value>    the vector size  default: 300
  --word_window_size <value>
                           the word window size  default: 5
  --learningRate <value>   the learningRate  default: 0.025
  --strInBase64            specify if the text is encoded in base64 (only supported by loading from file)  default: false

io.elegans.clustering.W2VModelToSparkFormat

Load word2vec model in textual format separated by spaces (<term> <v0> .. <vn>) and save it in spark format.
Usage: W2VModelToSparkFormat [options]

  --help               prints this usage text
  --inputfile <value>  the file with the model
  --outputdir <value>  the port of the elasticsearch instance

io.elegans.clustering.ReduceW2VModel

Train a W2V model taking input data from ES.
Usage: Generate a reduced W2V model by selecting only the vectors of the words used in the dataset [options]

  --help                   prints this usage text
  --inputfile <value>      the file with sentences (one per line), when specified elasticsearch is not used  default: None
  --hostname <value>       the hostname of the elasticsearch instance  default: localhost
  --port <value>           the port of the elasticsearch instance  default: 9200
  --group_by_field <value>
                           group the search results by field e.g. conversation, None => no grouping  default: None
  --search_path <value>    the search path on elasticsearch e.g. <index name>/<type name>  default: jenny-en-0/question
  --query <value>          a json string with the query  default: { "fields":["question", "answer", "conversation", "index_in_conversation", "_id" ] }
  --stopwordsFile <value>  filepath for a list of stopwords. Note: This must fit on a single machine.  default: None
  --used_fields <value>    list of fields to use for LDA, if more than one they will be merged  default: List(question, answer)
  --inputmodel <value>     the file with the model
  --outputfile <value>     the output file  default: /tmp/w2v_model.txt
  --strInBase64            specify if the text is encoded in base64 (only supported by loading from file)  default: false

io.elegans.clustering.GetW2VSimilarSentences

perform a similarity search using cosine vector as distance function
Usage: Search similar sentences [options]

  --help                   prints this usage text
  --inputfile <value>      the file with sentences (one per line), when specified elasticsearch is not used  default: None
  --hostname <value>       the hostname of the elasticsearch instance  default: localhost
  --port <value>           the port of the elasticsearch instance  default: 9200
  --group_by_field <value>
                           group the search results by field e.g. conversation, None => no grouping  default: None
  --search_path <value>    the search path on elasticsearch e.g. <index name>/<type name>  default: jenny-en-0/question
  --query <value>          a json string with the query  default: { "fields":["question", "answer", "conversation", "index_in_conversation", "_id" ] }
  --stopwordsFile <value>  filepath for a list of stopwords. Note: This must fit on a single machine.  default: None
  --used_fields <value>    list of fields to use for LDA, if more than one they will be merged  default: List(question, answer)
  --outputDir <value>      the where to store the output files: topics and document per topics  default: /tmp
  --inputW2VModel <value>  the input word2vec model
  --input_sentences <value>
                           the list of input sentences  default: ${defaultParams.input_sentences}
  --similarity_threshold <value>
                           cutoff threshold  default: ${defaultParams.similarity_threshold}
  --avg                    this flag enable the vectors
  --tfidf                  this flag enable tfidf term weighting
  --strInBase64            specify if the text is encoded in base64 (only supported by loading from file)  default: false

io.elegans.clustering.GetW2VSimilarSentences

calculate term frequency.
Usage: Tokenize terms and count term frequency from a set of files [options]

  --help                   prints this usage text
  --inputfile <value>      the path e.g. tmp/dir  default: None
  --appname <value>        application name  default: CalcTerm freq
  --outputDir <value>      the where to store the output files: topics and document per topics  default: /tmp/calc_freq_output
  --levels <value>         the levels where to search for files e.g. for level=2 => tmp/dir/*/*  default: 3
  --numpartitions <value>  the number of partitions reading files  default: 1000
  --minfreq <value>        remove words with a frequency less that minfreq  default: 0

About

Clustering programs with Elasticsearch interface

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •