A collection of document-analysis processing.
In your Dart (or Flutter) project pubspec.yaml
add the dependency:
dependencies:
...
document_analysis: ^[latest version]
Because this is document-based analysis. Distance measurement must range between 0-1 (normalized, unlike Euclidean distance). Current distance measurement available:
Call: jaccardDistance(vector1, vector2)
- Input vector is
List<double>
Usage:
List<double> vector1 = [0, 1, 1.5, 3, 2, 0.5];
List<double> vector2 = [1, 3, 3.5, 4, 0.5, 0];
print("Jaccard: ${jaccardDistance(vector1, vector2)}");//0.333...
print("Cosine: ${cosineDistance(vector1, vector2)}");//0.156...
Current document-similarity function available are based on:
Call: wordFrequencySimilarity(doc1, doc2, distanceFunction: jaccardDistance)
:
doc1, doc2
: Input document (String)distanceFunction
: Vector distance measurement(vector1, vector2)=>double
Usage:
String doc1 = "Report: Xiaomi topples Fitbit and Apple as world's largest wearables vendor";
String doc2 = "Xiaomi topples Fitbit and Apple as world's largest wearables vendor: Strategy Analytics";
print("${wordFrequencySimilarity(doc1, doc2, distanceFunction: jaccardDistance)}");//0.769...
print("${wordFrequencySimilarity(doc1, doc2, distanceFunction: cosineDistance)}");//0.870...
Word-vector matrix from collection of documents, current available:
- Word Frequency
- TF-IDF
- Hybrid TF-IDF
Call: wordFrequencyMatrix([doc1, doc2])
[...]
: All document,List<String>
Usage:
String doc1 = "Report: Xiaomi topples Fitbit and Apple as world's largest wearables vendor";
String doc2 = "Xiaomi topples Fitbit and Apple as world's largest wearables vendor: Strategy Analytics";
print(wordFrequencyMatrix([doc1, doc2]));
//[[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0], [0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]
Tokenize document (String) into multiple metrics:
Call: documentTokenizer(List<String> documentList, {minLen = 1, String Function(String) stemmer, List<String> stopwords})
documentList
: All document in a ListminLen
: Minimum word occurrence to be considered in tokenizationstemmer
: Stemming functionstopwords
: Collection of common words that should be ignored in document analysis
Usage:
documentTokenizer([doc1, doc2, doc3]);
Outputs TokenizationOutput
.
class TokenizationOutput{
///Count for each words in all documents
Map<String, double> bagOfWords = {};
///How often a certain word occur across all documents (unique word occurence - max 1 per document)
Map<String, double> wordInDocumentOccurrence = {};
///List of 'Bag of Words' for each document
List<Map<String, double>> documentBOW = [];
///Total number of word in each document
List<int> documentTotalWord = [];
///Total distinct word in all documents
int numberOfDistintWords = 0;
///Total number of word in all documents
int totalNumberOfWords = 0;
}
Stemmer & Stopwords passable from document_similarity --> matrix_creator --> tokenizer.
- Hybrid TF-IDF based on
Sharifi, B., Hutton, M.-A. & Kalita, J. K., 2010. Experiments in microblog summarization. Social Computing (SocialCom), 2010 IEEE Second International Conference on, pp. 49-56.