Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
- Breaking backward compatibility bumps the major (and resets the minor and patch)
- New additions without breaking backward compatibility bumps the minor (and resets the patch)
- Bug fixes and misc changes bumps the patch
BUG FIXES
-
ngram_collocations
did not properly merge the quanteda outputs resulting in thelength
column being replicated multiple times. Additionally, length was integer whereas the other ngram measures are numeric resulting in a data.table warning inmelt
. Both of these issues have been addressed. -
colo
did not copy a single term to the clipboard with quotes. See issue #50.
NEW FEATURES
-
plot_upset
added to enable exploration of overlapping instersections betweenterm_count
categories: http://caleydo.org/tools/upset. -
get_text
added to extract the original text associated with particular tags. -
frequent_terms_co_occurrence
added to view the co-occurrence between frequent terms. A combination offrequent_terms
andtag_co_occurrence
. -
term_before
,term_after
, &term_first
added to get frequencies of terms relative to other terms or specific locations. -
token_count
added to count the occurrence of tokens within a vector of strings. This function differs fromterm_count
in thatterm_count
is regex based, allowing for fuzzy matching. This function only searches for lower cased tokens (words, number sequences, or punctuation) providing a well defined counting function that is faster thanterm_count
but less flexible. -
as_term_list
added. This is a convenience function to convert a vector of terms or a quantedadictionary
into a named list. -
combine_counts
added to enable combiningterm_count
andtoken_count
objects. -
match_word
added to match words to regular expressions. Roughly equivalent qdap'sterm_match
. -
read_term_list
/write_term_list
added to aid in the reading in/writing out and formatting of term list files. -
classification_template
added to manually add a classification script template. This template has a suggested termco based workflow that may be useful for classification projects. -
test_regex
added to test an atomic vector, list, or term list of regexes for validity. -
mutate_counts
added to apply a normalizing function to all the term columns of aterm_count
/token_count
object without stripping the attributes and class. -
drop_terms
added to allow the user to explore/iterate on a term list and drop terms prior prior to \code{term_count} use without manually editing an external term list file. -
tidy_counts
added to converts a wide matrix of counts to tidy form (tags are stretched long-wise with corresponding counts of tags). -
set_meta_tags
added for setting themetatags
attribute on aterm_count/
token_countobject. This can also be controlled by separators in the term/token list passed to
term_count/token_count
. -
select_counts
added for safely selectingterm_count/
token_countobject columns without stripping attributes. Works like
?dplyr::select`.
MINOR FEATURES
-
important_terms
picks up a plot method corresponding to thefrequent_terms
plot method. -
term_count
checks for duplicate categories within tiers for hierarchical term lists. -
read_term_list
checks for valid regex.
IMPROVEMENTS
-
validate_model
now usesclassify
before validating to assign tags. -
tag_co_occurrence
used a grid + base plotting approach that required restarting the graphics device between plots. This dependency has been replaces with a dependency on ggraph for plotting networks as grid objects. -
plot.validate_model
now shows tag counts in the sample to provide a relative importance of the accuracy in making decisions. -
Open, unescaped or regexes [(i.e.,
|)
unescaped pipe followed by a closing group character] are now caught and warned forread_term_list
and thusterm_count
. -
metatags
is an official attribute that can be used to group common tags together. This is common in qualitative coding where one tags text and then groups these subtags together into coherent metatags. This is used bytidy_counts
and can be used by other future features.
CHANGES
-
The stopwords package replaces the tm package for providing default stopword lists. The stopwords package is more comprehensive and lighter weight. This changes allows the removal of the tm package as a dependency. Suggested by Ken Benoit issue #69.
-
important_terms
now usesquanteda::dfm_tfidf
rather thantm::weightTfIdf
. This means the tf-idf weighting is done is base 10 log rather than base 2 as done with the tm package. Suggested by Ken Benoit issue #69. -
as_dtm
&as_tdm
moved to the gofastr package where they can be used by other packages and their classed objects. termco re-exports the two functions. -
summary.validate_model
used to returnn
which was the number of tags from thetermco
object. It now gives n.tags and n.classified to be more explicit about counts of potential tags and tags actually assigned byclassify
. -
colo
no longer uses non-standard evaluation; terms must be quoted. -
ngram_collocations
has been renamed tofrequent_ngrams
for better clarity in what the function does and as a counter part tofrequent_terms
. -
update_names
renamed torename_tags
to be consistent with naming conventions. -
term_cols
renamed totag_cols
to be consistent with naming conventions. -
token_count
has no print method of it's own any more. Theprint
method forterm_count
was made more generic and works for both sincetoken_count
inheerits fromterm_count
. This is easier to maintain.
NEW FEATURES
-
term_cols
&group_cols
added to quickly grab just term or grouping variable columns. -
as_dtm
&as_tdm
added to convert aterm_count
object into atm::DocumentTermMatrix
ortm::TermDocumentMatrix
object. -
update_names
added to allow for safe renaming of aterm_count
object's columns while also updating its attributes as well. -
term_list_template
added for generating and writing term list templates.
IMPROVEMENTS
-
classify
picks up a new defaultties.method
type of"probabilities"
. This used the probability distribution from all tags assigned to randomly break ties based on that distribution. -
term_count
gets an auto-collapse feature for hierarchicalterm.list
s with duplicate names. A message is printed telling the user this is happening. To get the hierarchical coverage useattributes(x2)[['pre_collapse_coverage']]
. -
accuracy
now uses standard model evaluation measures of macro/micro averaged accuracy, precision, and recall as outlined by Dan Jurafsky & Chris Manning. See https://www.youtube.com/watch?v=OwwdYHWRB5E&index=31&list=PL6397E4B26D00A269 for details on the methods.
CHANGES
-
plot.tag_co_occurrence
uses a bubble-dotplot for the right hand graph rather than the older bar plot. This allows for tag size to be displayed in addition to average number of other tags to determine if the tag co-occurrence is a meaningful number of tags to give additional attention to. Usetag = TRUE
for the old behavior. -
accuracy
was renamed toevaluate
to be more informative as well as a verb.
BUG FIXES
-
colo
returned list rather than string if a single term was passed. Spotted by Steve Simpson. See issue #12. -
term_count
did not handle hierarchicalterm.list
correctly due to a reordering done by data.table (whengroup.vars
not= TRUE
). This has been corrected. -
Column ordering was not respected by
print.term_count
. -
colo
did not copy to the clip board whencopy2clip
wasTRUE
and a single expression was passed to...
.
NEW FEATURES
-
important_terms
added to complimentfrequent_terms
allowing tf-idf weighted terms to rise to the top. -
collapse_tags
added to combine tags/columns fromterm_count
object without stripping theterm_count
class and attributes.
MINOR FEATURES
plot_counts
picks up adrop
argument to enable terms not found (ifx
is aas_terms
object created from aterm_count
object) to be retained in the bar plot. Suggested by Steve Simpson. See issue #18.
IMPROVEMENTS
colo
automatically adds a group parenthesis around...
regexes to protect the grouping explicitly. This is useful when a regex used or pipes (|
). This would create an unintended expression that was overly aggressive (see #20).
NEW FEATURES
validate_model
andassign_validation_task
added to allow for human assessment of how accurate a model is functioning.
CHANGES
probe_colo_list
,probe_colo_plot_list
, &probe_colo_plot
all usesearch_term_collocations
under the hood rather thansearch_term
+frequent_terms
.
BUG FIXES
-
plot.term_count
did not properly handle weighting. This has been fixed and allows for"count"
as a choice. -
search_term_which
(alsosearch_term
) did not treat teand
argument correctly.and
was treated identical to thenot
argument.
NEW FEATURES
-
split_data
added for easy creation of training and testing data. -
classification_project
added to make a classification modeling project template. -
plot_cum_percent
added for cumulative percent plot of frequent terms. -
probe_
family of functions added to easily make lists of function calls for exploration of the frequent terms in the context of the data. Functions include:probe_list
,probe_colo_list
,probe_colo_plot_list
, &probe_colo_plot
. -
hierarchical_coverage
added to allow exploration of the unique coverage of a text vector by a term after partitioning out the elements matched by previous terms. -
tag_co_occurrence
added to explore tag co-occurrences. -
search_term_collocations
added as a convenience wrapper forsearch_term
frequent_terms
. (Thanks to Steve Simpson)
MINOR FEATURES
plot_freq
picks up asize
argument.
IMPROVEMENTS
term_count
now can be used in a hierarchical fashion. A list of regexes can be passed and counted and then a second (or more) pass can be taken wit a new set of regexes on only those rows/text elements that were left untagged (countrowSums
is zero). This is accomplished by passing alist
oflist
s of regexes. Thanks to Steve Simpson for suggesting this feature.
This package is a small suite of functions used to count terms and substrings in strings.