diff --git a/DESCRIPTION b/DESCRIPTION index 9ef8c69..3ab20ae 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,5 +1,5 @@ Package: quanteda.textstats -Version: 0.96.6 +Version: 0.97 Title: Textual Statistics for the Quantitative Analysis of Textual Data Description: Textual statistics functions formerly in the 'quanteda' package. Textual statistics for characterizing and comparing textual data. Includes diff --git a/README.md b/README.md index 7cf2658..e812b3c 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ [![CRAN Version](https://www.r-pkg.org/badges/version/quanteda.textstats)](https://CRAN.R-project.org/package=quanteda.textstats) -[![](https://img.shields.io/badge/devel%20version-0.96.5-royalblue.svg)](https://github.com/quanteda/quanteda.textstats) +[![](https://img.shields.io/badge/devel%20version-0.97-royalblue.svg)](https://github.com/quanteda/quanteda.textstats) [![Downloads](https://cranlogs.r-pkg.org/badges/quanteda.textstats)](https://CRAN.R-project.org/package=quanteda.textstats) [![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/quanteda.textstats?color=orange)](https://CRAN.R-project.org/package=quanteda.textstats) diff --git a/cran-comments.md b/cran-comments.md index da6bdd8..96a515e 100644 --- a/cran-comments.md +++ b/cran-comments.md @@ -2,11 +2,12 @@ Purpose: -* To update the C++ code to better call the tbb library for parallel computing. +* To fix changes related to the quanteda v4.0 release and its move to relying on a version of TBB that is different from that provided in RcppParallel. ## Test environments -* local macOS 13.6, R 4.3.1 +* local macOS 14.4.1, R 4.3.3 +* macOS release via devtools::check_mac_release() * Windows release via devtools::check_win_release() * Windows devel via devtools::check_win_devel() * Windows old-release via devtools::check_win_oldrelease() diff --git a/docs/404.html b/docs/404.html new file mode 100644 index 0000000..8355fd5 --- /dev/null +++ b/docs/404.html @@ -0,0 +1,103 @@ + + +
+ + + + +As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
+We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, religion, or choice of text editor.
+Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
+This Code of Conduct is adapted from the Contributor Covenant (http:contributor-covenant.org), version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/
+GNU GENERAL PUBLIC LICENSE + Version 3, 29 June 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/> + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The GNU General Public License is a free, copyleft license for +software and other kinds of works. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +the GNU General Public License is intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. We, the Free Software Foundation, use the +GNU General Public License for most of our software; it applies also to +any other work released this way by its authors. You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + To protect your rights, we need to prevent others from denying you +these rights or asking you to surrender the rights. Therefore, you have +certain responsibilities if you distribute copies of the software, or if +you modify it: responsibilities to respect the freedom of others. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must pass on to the recipients the same +freedoms that you received. You must make sure that they, too, receive +or can get the source code. And you must show them these terms so they +know their rights. + + Developers that use the GNU GPL protect your rights with two steps: +(1) assert copyright on the software, and (2) offer you this License +giving you legal permission to copy, distribute and/or modify it. + + For the developers' and authors' protection, the GPL clearly explains +that there is no warranty for this free software. For both users' and +authors' sake, the GPL requires that modified versions be marked as +changed, so that their problems will not be attributed erroneously to +authors of previous versions. + + Some devices are designed to deny users access to install or run +modified versions of the software inside them, although the manufacturer +can do so. This is fundamentally incompatible with the aim of +protecting users' freedom to change the software. The systematic +pattern of such abuse occurs in the area of products for individuals to +use, which is precisely where it is most unacceptable. Therefore, we +have designed this version of the GPL to prohibit the practice for those +products. If such problems arise substantially in other domains, we +stand ready to extend this provision to those domains in future versions +of the GPL, as needed to protect the freedom of users. + + Finally, every program is threatened constantly by software patents. +States should not allow patents to restrict development and use of +software on general-purpose computers, but in those that do, we wish to +avoid the special danger that patents applied to a free program could +make it effectively proprietary. To prevent this, the GPL assures that +patents cannot be used to render the program non-free. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work's +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work's +users, your or third parties' legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation's users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party's predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor's "contributor version". + + A contributor's "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor's essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient's use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Use with the GNU Affero General Public License. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU Affero General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the special requirements of the GNU Affero General Public License, +section 13, concerning interaction through a network will apply to the +combination as such. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU General Public License can be used, that proxy's +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + <one line to give the program's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. + +Also add information on how to contact you by electronic and paper mail. + + If the program does terminal interaction, make it output a short +notice like this when it starts in an interactive mode: + + <program> Copyright (C) <year> <name of author> + This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, your program's commands +might be different; for a GUI interface, you would use an "about box". + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU GPL, see +<https://www.gnu.org/licenses/>. + + The GNU General Public License does not permit incorporating your program +into proprietary programs. If your program is a subroutine library, you +may consider it more useful to permit linking proprietary applications with +the library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. But first, please read +<https://www.gnu.org/licenses/why-not-lgpl.html>. ++ +
Contains the textstat functions formerly in quanteda. For more details, see https://quanteda.io.
+The normal way from CRAN, using your R GUI or
+
+install.packages("quanteda.textstats")
Or for the latest development version:
+
+# devtools package required to install quanteda from Github
+remotes::install_github("quanteda/quanteda.textstats")
Because this compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers.
+If you are using a Windows platform, this means you will need also to install the Rtools software available from CRAN.
+If you are using macOS, you should install the macOS tools, namely the Clang 6.x compiler and the GNU Fortran compiler (as quanteda.textstats requires gfortran to build). If you are still getting errors related to gfortran, follow the fixes here.
+Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software. 3(30), 774. https://doi.org/10.21105/joss.00774.
+For a BibTeX entry, use the output from citation(package = "quanteda.textstats")
.
NEWS.md
+ [
) works for textstat outputs, to fix #50.textstat_simil()
for new proxyC version v0.2.2, which affects how similarities are returned for NA
values. See #45.textstat_simil()
for new proxyC version v-0.2.0.NA
, without failure, for ICU versions older than 9 (#35 and #24).groups
in textstat_frequency()
to operate as in quanteda v3.stringsAsFactors
in data.frame()
.textstat_summary()
and associated functions and tests.Coercion methods for objects created by textstat_simil()
and
+textstat_dist()
.
# S3 method for textstat_proxy
+as.list(x, sorted = TRUE, n = NULL, diag = FALSE, ...)
+
+# S3 method for textstat_proxy
+as.data.frame(
+ x,
+ row.names = NULL,
+ optional = FALSE,
+ diag = FALSE,
+ upper = FALSE,
+ ...
+)
any R object.
sort results in descending order if TRUE
the top n
highest-ranking items will be returned. If n is
+NULL
, return all items.
logical; if FALSE
, exclude the item's comparison with itself
additional arguments to be passed to or from methods.
NULL
or a character vector giving the row
+ names for the data frame. Missing values are not allowed.
logical. If TRUE
, setting row names and
+ converting column names (to syntactic names: see
+ make.names
) is optional. Note that all of R's
+ base package as.data.frame()
methods use
+ optional
only for column names treatment, basically with the
+ meaning of data.frame(*, check.names = !optional)
.
+ See also the make.names
argument of the matrix
method.
logical; if TRUE
, return pairs as both (A, B) and (B, A)
as.data.list
for a textstat_simil
or
+textstat_dist
object returns a list equal in length to the columns of the
+simil or dist object, with the rows and their values as named elements. By default,
+this list excludes same-time pairs (when diag = FALSE
) and sorts the values
+in descending order (when sorted = TRUE
).
as.data.frame
for a textstat_simil
or
+textstat_dist
object returns a data.frame of pairwise combinations
+and the and their similarity or distance value.
R/textstat_simil.R
+ as.matrix.textstat_simil_sparse.Rd
as.matrix method for textstat_simil_sparse
+an object returned by textstat_simil when min_simil > 0
value that will replace the omitted cells
unused
a matrix object
+Check arguments passed to other functions via ...
+check_dots(..., method = NULL)
dots to check
the names of functions ...
is passed to
R/textstat_lexdiv.R
+ compute_lexdiv_stats.Rd
Internal functions used in textstat_lexdiv()
, for computing
+lexical diversity measures on dfms or tokens objects
compute_lexdiv_dfm_stats(x, measure = NULL, log.base = 10)
+
+compute_lexdiv_tokens_stats(
+ x,
+ measure = c("MATTR", "MSTTR"),
+ MATTR_window,
+ MSTTR_segment
+)
a dfm object
a list of lexical diversity measures.
a numeric value defining the base of the logarithm (for +measures using logs)
a numeric value defining the size of the moving window +for computation of the Moving-Average Type-Token Ratio (Covington & McFall, 2010)
a numeric value defining the size of the each segment +for the computation of the the Mean Segmental Type-Token Ratio (Johnson, 1944)
a data.frame
with a document
column containing the
+input document name, followed by columns with the lexical diversity
+statistic, in the order in which they were supplied as the measure
argument.
+compute_lexdiv_dfm_stats
in an internal function that
+computes the lexical diversity measures from a dfm input.
compute_lexdiv_tokens_stats
in an internal function that
+computes the lexical diversity measures from a dfm input.
R/textstat_lexdiv.R
+ compute_mattr.Rd
From a tokens object, computes the Moving-Average Type-Token Ratio (MATTR)
+from Covington & McFall (2010), averaging all of the sequential moving
+windows of tokens of size MATTR_window
across the text, returning the
+average as the MATTR.
compute_mattr(x, MATTR_window = 100L)
a tokens object
integer; the size of the moving window for computation of +TTR, between 1 and the number of tokens of the document
R/textstat_lexdiv.R
+ compute_msttr.Rd
Compute the Mean Segmental Type-Token Ratio (Johnson 1944) for a tokens input.
+compute_msttr(x, MSTTR_segment)
input tokens
a numeric value defining the size of the each segment +for the computation of the the Mean Segmental Type-Token Ratio (Johnson, 1944)
R/textstat_readability.R
+ data_char_wordlists.Rd
data_char_wordlists
provides word lists used in some readability indexes;
+it is a named list of character vectors where each list element
+corresponds to a different readability index.
data_char_wordlists
A list of length two:
DaleChall
The long Dale-Chall list of 3,000 familiar (English) +words needed to compute the Dale-Chall Readability Formula.
Spache
The revised Spache word list (see Klare 1975, 73; Spache +1974) needed to compute the Spache Revised Formula of readability (Spache +1953).
Chall, J.S., & Dale, E. (1995). Readability Revisited: The New +Dale-Chall Readability Formula. Brookline Books.
+Dale, E. & Chall, J.S. (1948). A Formula for Predicting +Readability. Educational Research Bulletin, 27(1): 11--20.
+Dale, E. & Chall, J.S. (1948). A Formula for Predicting Readability: +Instructions. Educational Research Bulletin, 27(2): 37--54.
+Klare, G.R. (1975). Assessing Readability. Reading Research Quarterly +10(1), 62--102.
+Spache, G. (1953). A New Readability Formula for Primary-Grade Reading +Materials. The Elementary School Journal, 53, 410--413.
+Spache, G. (1974). Good reading for poor readers. (Rvd. 9th Ed.) +Champaign, Illinois: Garrard, 1974.
+R/textstat_lexdiv.R
+ dfm_split_hyphenated_features.Rd
Takes a dfm that contains features with hyphenated words, such as
+"split-second" and turns them into features that split the elements
+in the same was as tokens(x, remove_hyphens = TRUE)
would have done.
dfm_split_hyphenated_features(x)
input dfm
R/textstat_simil.R
+ diag2na.Rd
Converts the diagonal, or the same-pair equivalent in an object +where the columns have been selected, to NA.
+diag2na(x)
the return from textstat_simil()
or textstat_dist()
sparse Matrix format with same-pair values replaced with NA
R/textstat_simil.R
+ head.textstat_proxy.Rd
For a similarity or distance object computed via textstat_simil or
+textstat_dist, returns the first or last n
rows.
a textstat_simil/textstat_dist object
a single, positive integer. If positive, size for the resulting +object: number of first/last documents for the dfm. If negative, all but +the n last/first number of documents of x.
unused
A matrix corresponding to the subset defined
+by n
.
+ All functions+ + |
+ |
---|---|
+ + | +Word lists for readability statistics |
+
+ + | +Identify and score multi-word expressions |
+
+ + | +Compute entropies of documents or features |
+
+ + | +Tabulate feature frequencies |
+
+ + | +Calculate keyness statistics |
+
+ + | +Calculate lexical diversity |
+
+ + | +Calculate readability |
+
+ + | +Similarity and distance computation between documents or features |
+
+ + | +Summarize documents as syntactic and lexical feature counts |
+
Tally the Scrabble letter values of text given a user-supplied function, such +as the sum (default) or mean of the character values.
+nscrabble(x, FUN = sum)
a character vector
function to be applied to the character values in the text;
+default is sum
, but could also be mean
or a user-supplied
+function. Missing values are automatically removed.
a (named) integer vector of Scrabble letter values, computed using
+FUN
, corresponding to the input text(s)
Character values are only defined for non-accented Latin a-z, A-Z +letters. Lower-casing is unnecessary.
+We would be happy to add more languages to this extremely useful +function if you send us the values for your language!
+nscrabble(c("muzjiks", "excellency"))
+#> [1] 29 24
+nscrabble(quanteda::data_corpus_inaugural[1:5], mean)
+#> 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson
+#> 1.706789 1.721875 1.624590 1.678183 1.663654
+
Extends nsyllable()
methods for tokens objects.
# S3 method for tokens
+nsyllable(
+ x,
+ language = "en",
+ syllable_dictionary = nsyllable::data_syllables_en,
+ use.names = FALSE
+)
character vector whose +syllables will be counted. This will count all syllables in a character +vector without regard to separating tokens, so it is recommended that x be +individual terms.
specify the language for syllable counts by ISO 639-1 code. The
+default is English, using the data object data_syllables_en
, an English
+pronunciation dictionary from CMU.
optional named integer vector of syllable counts
+where the names are lower case tokens. This can be used to override the
+language setting, when set to NULL
(the default). If a syllable
+dictionary is supplied, this will override the language
argument.
logical; if TRUE
, assign the tokens as the names of the
+syllable count vector
# \dontshow{
+library("nsyllable")
+txt <- c(one = "super freakily yes",
+ two = "merrily all go aerodynamic")
+toks <- quanteda::tokens(txt)
+nsyllable(toks)
+#> $one
+#> [1] 2 3 1
+#>
+#> $two
+#> [1] 3 1 1 5
+#>
+# }
+
R/quanteda.textstats-package.R
+ quanteda.textstats-package.Rd
Textual statistics functions formerly in the 'quanteda' package. Textual statistics for characterizing and comparing textual data. Includes functions for measuring term and document frequency, the co-occurrence of words, similarity and distance between features and documents, feature entropy, keyword occurrence, readability, and lexical diversity. These functions extend the 'quanteda' package and are specially designed for sparse textual data.
+Useful links:
Report bugs at https://github.com/quanteda/quanteda.textstats/issues
R/textstat_collocations.R
+ textstat_collocations.Rd
Identify and score multi-word expressions, or adjacent fixed-length +collocations, from text.
+textstat_collocations(
+ x,
+ method = "lambda",
+ size = 2,
+ min_count = 2,
+ smoothing = 0.5,
+ tolower = TRUE,
+ ...
+)
a character, corpus, or tokens object whose collocations will be
+scored. The tokens object should include punctuation, and if any words
+have been removed, these should have been removed with padding = TRUE
.
+While identifying collocations for tokens objects is supported, you will
+get better results with character or corpus objects due to relatively
+imperfect detection of sentence boundaries from texts already tokenized.
association measure for detecting collocations. Currently this
+is limited to "lambda"
. See Details.
integer; the length of the collocations +to be scored
numeric; minimum frequency of collocations that will be +scored
numeric; a smoothing parameter added to the observed counts +(default is 0.5)
logical; if TRUE
, form collocations as lower-cased
+combinations
additional arguments passed to tokens()
textstat_collocations
returns a data.frame of collocations and
+their scores and statistics. This consists of the collocations, their
+counts, length, and \(\lambda\) and \(z\) statistics. When size
is a
+vector, then count_nested
counts the lower-order collocations that occur
+within a higher-order collocation (but this does not affect the
+statistics).
Documents are grouped for the purposes of scoring, but collocations will not
+span sentences. If x
is a tokens object and some tokens have been
+removed, this should be done using [tokens_remove](x, pattern, padding = TRUE)
so that counts will still be accurate, but the pads will prevent those
+collocations from being scored.
The lambda
computed for a size = \(K\)-word target multi-word expression
+the coefficient for the \(K\)-way interaction parameter in the saturated
+log-linear model fitted to the counts of the terms forming the set of
+eligible multi-word expressions. This is the same as the "lambda" computed in
+Blaheta and Johnson's (2001), where all multi-word expressions are considered
+(rather than just verbs, as in that paper). The z
is the Wald
+\(z\)-statistic computed as the quotient of lambda
and the Wald statistic
+for lambda
as described below.
In detail:
+Consider a \(K\)-word target expression \(x\), and let \(z\) be any
+\(K\)-word expression. Define a comparison function \(c(x,z)=(j_{1},
+\dots, j_{K})=c\) such that the \(k\)th element of \(c\) is 1 if the
+\(k\)th word in \(z\) is equal to the \(k\)th word in \(x\), and 0
+otherwise. Let \(c_{i}=(j_{i1}, \dots, j_{iK})\), \(i=1, \dots,
+2^{K}=M\), be the possible values of \(c(x,z)\), with \(c_{M}=(1,1,
+\dots, 1)\). Consider the set of \(c(x,z_{r})\) across all expressions
+\(z_{r}\) in a corpus of text, and let \(n_{i}\), for \(i=1,\dots,M\),
+denote the number of the \(c(x,z_{r})\) which equal \(c_{i}\), plus the
+smoothing constant smoothing
. The \(n_{i}\) are the counts in a
+\(2^{K}\) contingency table whose dimensions are defined by the
+\(c_{i}\).
\(\lambda\): The \(K\)-way interaction parameter in the saturated +loglinear model fitted to the \(n_{i}\). It can be calculated as
+$$\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}$$
+where \(b_{i}\) is the number of the elements of \(c_{i}\) which are +equal to 1.
+Wald test \(z\)-statistic \(z\) is calculated as:
+$$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}$$
+Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multi-word verbs. +Presented at the ACLEACL Workshop on the Computational Extraction, Analysis +and Exploitation of Collocations.
+library("quanteda")
+#> Package version: 4.0.1
+#> Unicode version: 14.0
+#> ICU version: 71.1
+#> Parallel computing: disabled
+#> See https://quanteda.io for tutorials and examples.
+corp <- data_corpus_inaugural[1:2]
+head(cols <- textstat_collocations(corp, size = 2, min_count = 2), 10)
+#> collocation count count_nested length lambda z
+#> 1 have been 5 0 2 5.704259 7.354588
+#> 2 has been 3 0 2 5.565217 6.409333
+#> 3 of the 24 0 2 1.673501 6.382475
+#> 4 i have 5 0 2 3.743580 6.268303
+#> 5 which i 6 0 2 3.172217 6.135144
+#> 6 will be 4 0 2 3.868500 5.930143
+#> 7 less than 2 0 2 6.279494 5.529680
+#> 8 public good 2 0 2 6.279494 5.529680
+#> 9 you will 2 0 2 4.917893 5.431752
+#> 10 may be 3 0 2 4.190711 5.328038
+head(cols <- textstat_collocations(corp, size = 3, min_count = 2), 10)
+#> collocation count count_nested length lambda z
+#> 1 of which the 2 0 3 6.1259648 2.8317522
+#> 2 in which i 3 0 3 2.1689288 1.1741918
+#> 3 i have in 2 0 3 2.3809129 1.0618774
+#> 4 and of the 2 0 3 0.8847383 0.7498730
+#> 5 me by the 2 0 3 1.4726869 0.6560780
+#> 6 to the great 2 0 3 1.2891870 0.5660311
+#> 7 voice of my 2 0 3 1.2270130 0.5298220
+#> 8 which ought to 2 0 3 1.4083232 0.5278314
+#> 9 of the confidence 2 0 3 1.1220858 0.4948962
+#> 10 the united states 2 0 3 1.2597834 0.4272349
+
+# extracting multi-part proper nouns (capitalized terms)
+toks1 <- tokens(data_corpus_inaugural)
+toks2 <- tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE)
+toks3 <- tokens_select(toks2, pattern = "^([A-Z][a-z\\-]{2,})", valuetype = "regex",
+ case_insensitive = FALSE, padding = TRUE)
+tstat <- textstat_collocations(toks3, size = 3, tolower = FALSE)
+head(tstat, 10)
+#> collocation count count_nested length lambda z
+#> 1 United States Congress 2 0 3 -2.174793 -1.025182
+#> 2 Arlington National Cemetery 2 0 3 -6.301876 -2.086677
+#> 3 Chief Justice Roberts 2 0 3 -7.818352 -3.033147
+#> 4 Vice President Bush 2 0 3 -11.741818 -4.537337
+
+# vectorized size
+txt <- c(". . . . a b c . . a b c . . . c d e",
+ "a b . . a b . . a b . . a b . a b",
+ "b c d . . b c . b c . . . b c")
+textstat_collocations(txt, size = 2:3)
+#> collocation count count_nested length lambda z
+#> 1 a b 7 2 2 5.652489e+00 2.745546e+00
+#> 2 b c 6 3 2 5.609472e+00 2.721287e+00
+#> 3 c d 2 2 2 4.976734e+00 2.354187e+00
+#> 4 a b c 2 0 3 -1.110223e-16 -3.103168e-17
+
+# compounding tokens from collocations
+toks <- tokens("This is the European Union.")
+colls <- tokens("The new European Union is not the old European Union.") %>%
+ textstat_collocations(size = 2, min_count = 1, tolower = FALSE)
+colls
+#> collocation count count_nested length lambda z
+#> 1 European Union 2 0 2 4.317488 2.027787
+#> 2 The new 1 0 2 3.931826 1.797564
+#> 3 Union is 1 0 2 3.931826 1.797564
+#> 4 is not 1 0 2 3.931826 1.797564
+#> 5 not the 1 0 2 3.931826 1.797564
+#> 6 the old 1 0 2 3.931826 1.797564
+#> 7 new European 1 0 2 2.708050 1.454456
+#> 8 old European 1 0 2 2.708050 1.454456
+tokens_compound(toks, colls, case_insensitive = FALSE)
+#> Tokens consisting of 1 document.
+#> text1 :
+#> [1] "This" "is" "the" "European_Union"
+#> [5] "."
+#>
+
+#' # from a collocations object
+(coll <- textstat_collocations(tokens("a b c a b d e b d a b")))
+#> collocation count count_nested length lambda z
+#> 1 a b 3 0 2 3.412247 1.936083
+#> 2 b d 2 0 2 3.218876 1.799406
+phrase(coll)
+#> [[1]]
+#> [1] "a" "b"
+#>
+#> [[2]]
+#> [1] "b" "d"
+#>
+
Compute entropies of documents or features
+textstat_entropy(x, margin = c("documents", "features"), base = 2)
a dfm
character indicating for which margin to compute entropy
base for logarithm function
a data.frame of entropies for the given document or feature
+library("quanteda")
+textstat_entropy(data_dfm_lbgexample)
+#> document entropy
+#> 1 R1 3.386943
+#> 2 R2 3.386943
+#> 3 R3 3.386943
+#> 4 R4 3.386943
+#> 5 R5 3.386943
+#> 6 V1 3.386943
+textstat_entropy(data_dfm_lbgexample, "features")
+#> feature entropy
+#> 1 A 0.0000000
+#> 2 B 0.0000000
+#> 3 C 0.0000000
+#> 4 D 0.0000000
+#> 5 E 0.0000000
+#> 6 F 0.1686609
+#> 7 G 0.1708952
+#> 8 H 0.4371120
+#> 9 I 0.6476138
+#> 10 J 1.0338027
+#> 11 K 1.4131631
+#> 12 L 1.5669101
+#> 13 M 1.5996467
+#> 14 N 1.5656144
+#> 15 O 1.5806321
+#> 16 P 1.6267307
+#> 17 Q 1.6414915
+#> 18 R 1.6034693
+#> 19 S 1.5561626
+#> 20 T 1.5311306
+#> 21 U 1.4979274
+#> 22 V 1.3664642
+#> 23 W 1.1291805
+#> 24 X 1.0439334
+#> 25 Y 1.0338027
+#> 26 Z 1.0726302
+#> 27 ZA 1.0458291
+#> 28 ZB 0.7876499
+#> 29 ZC 0.5357150
+#> 30 ZD 0.3435197
+#> 31 ZE 0.1708952
+#> 32 ZF 0.1686609
+#> 33 ZG 0.0000000
+#> 34 ZH 0.0000000
+#> 35 ZI 0.0000000
+#> 36 ZJ 0.0000000
+#> 37 ZK 0.0000000
+
Produces counts and document frequencies summaries of the features in a +dfm, optionally grouped by a docvars variable or other supplied +grouping variable.
+textstat_frequency(
+ x,
+ n = NULL,
+ groups = NULL,
+ ties_method = c("min", "average", "first", "random", "max", "dense"),
+ ...
+)
a dfm object
(optional) integer specifying the top n
features to be returned,
+within group if groups
is specified
grouping variable for sampling, equal in length to the number
+of documents. This will be evaluated in the docvars data.frame, so that
+docvars may be referred to by name without quoting. This also changes
+previous behaviours for groups
. See news(Version >= "3.0", package = "quanteda")
for details.
character string specifying how ties are treated. See
+base::rank()
for details. Unlike that function, however, the default is
+"min"
, so that frequencies of 10, 10, 11 would be ranked 1, 1, 3.
additional arguments passed to dfm_group()
. This can
+be useful in passing force = TRUE
, for instance, if you are grouping a
+dfm that has been weighted.
a data.frame containing the following variables:
feature
(character) the feature
frequency
count of the feature
rank
rank of the feature, where 1 indicates the greatest +frequency
docfreq
document frequency of the feature, as a count (the +number of documents in which this feature occurred at least once)
docfreq
document frequency of the feature, as a count
group
(only if groups
is specified) the label of the group.
+If the features have been grouped, then all counts, ranks, and document
+frequencies are within group. If groups is not specified, the group
+column is omitted from the returned data.frame.
textstat_frequency
returns a data.frame of features and
+their term and document frequencies within groups.
library("quanteda")
+set.seed(20)
+dfmat1 <- dfm(tokens(c("a a b b c d", "a d d d", "a a a")))
+
+textstat_frequency(dfmat1)
+#> feature frequency rank docfreq group
+#> 1 a 6 1 3 all
+#> 2 d 4 2 2 all
+#> 3 b 2 3 1 all
+#> 4 c 1 4 1 all
+textstat_frequency(dfmat1, groups = c("one", "two", "one"), ties_method = "first")
+#> feature frequency rank docfreq group
+#> 1 a 5 1 2 one
+#> 2 b 2 2 1 one
+#> 3 c 1 3 1 one
+#> 4 d 1 4 1 one
+#> 5 d 3 1 1 two
+#> 6 a 1 2 1 two
+textstat_frequency(dfmat1, groups = c("one", "two", "one"), ties_method = "average")
+#> feature frequency rank docfreq group
+#> 1 a 5 1.0 2 one
+#> 2 b 2 2.0 1 one
+#> 3 c 1 3.5 1 one
+#> 4 d 1 3.5 1 one
+#> 5 d 3 1.0 1 two
+#> 6 a 1 2.0 1 two
+
+dfmat2 <- corpus_subset(data_corpus_inaugural, President == "Obama") %>%
+ tokens(remove_punct = TRUE) %>%
+ tokens_remove(stopwords("en")) %>%
+ dfm()
+tstat1 <- textstat_frequency(dfmat2)
+head(tstat1, 10)
+#> feature frequency rank docfreq group
+#> 1 us 44 1 2 all
+#> 2 must 25 2 2 all
+#> 3 can 20 3 2 all
+#> 4 nation 18 4 2 all
+#> 5 people 18 4 2 all
+#> 6 new 17 6 2 all
+#> 7 time 16 7 2 all
+#> 8 every 15 8 2 all
+#> 9 america 14 9 2 all
+#> 10 now 11 10 2 all
+
+dfmat3 <- head(data_corpus_inaugural) %>%
+ tokens(remove_punct = TRUE) %>%
+ tokens_remove(stopwords("en")) %>%
+ dfm()
+textstat_frequency(dfmat3, n = 2, groups = President)
+#> feature frequency rank docfreq group
+#> 1 people 20 1 1 Adams
+#> 2 government 16 2 1 Adams
+#> 3 public 18 1 2 Jefferson
+#> 4 may 18 1 2 Jefferson
+#> 5 public 6 1 1 Madison
+#> 6 nations 6 1 1 Madison
+#> 7 can 9 1 1 Washington
+#> 8 every 9 1 1 Washington
+
+
+if (FALSE) {
+# plot 20 most frequent words
+library("ggplot2")
+ggplot(tstat1[1:20, ], aes(x = reorder(feature, frequency), y = frequency)) +
+ geom_point() +
+ coord_flip() +
+ labs(x = NULL, y = "Frequency")
+
+# plot relative frequencies by group
+dfmat3 <- data_corpus_inaugural %>%
+ corpus_subset(Year > 2000) %>%
+ tokens(remove_punct = TRUE) %>%
+ tokens_remove(stopwords("en")) %>%
+ dfm() %>%
+ dfm_group(groups = President) %>%
+ dfm_weight(scheme = "prop")
+
+# calculate relative frequency by president
+tstat2 <- textstat_frequency(dfmat3, n = 10, groups = President)
+
+# plot frequencies
+ggplot(data = tstat2, aes(x = factor(nrow(tstat2):1), y = frequency)) +
+ geom_point() +
+ facet_wrap(~ group, scales = "free") +
+ coord_flip() +
+ scale_x_discrete(breaks = nrow(tstat2):1,
+ labels = tstat2$feature) +
+ labs(x = NULL, y = "Relative frequency")
+}
+
Calculate "keyness", a score for features that occur differentially across +different categories. Here, the categories are defined by reference to a +"target" document index in the dfm, with the reference group +consisting of all other documents.
+a dfm containing the features to be examined for keyness
the document index (numeric, character or logical) identifying +the document forming the "target" for computing keyness; all other +documents' feature frequencies will be combined for use as a reference
(signed) association measure to be used for computing keyness.
+Currently available: "chi2"
; "exact"
(Fisher's exact test); "lr"
for
+the likelihood ratio; "pmi"
for pointwise mutual information. Note that
+the "exact" test is very computationally intensive and therefore much
+slower than the other methods.
logical; if TRUE
sort features scored in descending order
+of the measure, otherwise leave in original feature order
if "default"
, Yates correction is applied to
+"chi2"
; William's correction is applied to "lr"
; and no
+correction is applied for the "exact"
and "pmi"
measures.
+Specifying a value other than the default can be used to override the
+defaults, for instance to apply the Williams correction to the chi2
+measure. Specifying a correction for the "exact"
and "pmi"
+measures has no effect and produces a warning.
not used
a data.frame of computed statistics and associated p-values, where
+the features scored name each row, and the number of occurrences for both
+the target and reference groups. For measure = "chi2"
this is the
+chi-squared value, signed positively if the observed value in the target
+exceeds its expected value; for measure = "exact"
this is the
+estimate of the odds ratio; for measure = "lr"
this is the
+likelihood ratio \(G2\) statistic; for "pmi"
this is the pointwise
+mutual information statistics.
textstat_keyness
returns a data.frame of features and
+their keyness scores and frequency counts.
Bondi, M. & Scott, M. (eds) (2010). Keyness in +Texts. Amsterdam, Philadelphia: John Benjamins.
+Stubbs, M. (2010). Three Concepts of Keywords. In Keyness in +Texts, Bondi, M. & Scott, M. (eds): 1--42. Amsterdam, Philadelphia: +John Benjamins.
+Scott, M. & Tribble, C. (2006). Textual Patterns: Keyword and Corpus +Analysis in Language Education. Amsterdam: Benjamins: 55.
+Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational +Linguistics, 19(1): 61--74.
+library("quanteda")
+
+# compare pre- v. post-war terms using grouping
+period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war")
+dfmat1 <- tokens(data_corpus_inaugural) %>%
+ dfm() %>%
+ dfm_group(groups = period)
+head(dfmat1) # make sure 'post-war' is in the first row
+#> Document-feature matrix of: 2 documents, 9,437 features (34.79% sparse) and 0 docvars.
+#> features
+#> docs fellow-citizens of the senate and house representatives :
+#> post-war 0 1514 2089 2 1552 3 0 115
+#> pre-war 39 5666 8094 13 3854 8 19 29
+#> features
+#> docs among vicissitudes
+#> post-war 22 0
+#> pre-war 86 5
+#> [ reached max_nfeat ... 9,427 more features ]
+head(tstat1 <- textstat_keyness(dfmat1), 10)
+#> feature chi2 p n_target n_reference
+#> 1 we 764.4484 0 1048 779
+#> 2 . 300.1887 0 2014 3141
+#> 3 us 207.5181 0 289 216
+#> 4 - 205.0408 0 242 157
+#> 5 america 200.5544 0 148 54
+#> 6 : 187.9821 0 115 29
+#> 7 our 183.5362 0 917 1307
+#> 8 world 171.9660 0 196 123
+#> 9 americans 163.1532 0 76 7
+#> 10 today 137.7753 0 84 21
+tail(tstat1, 10)
+#> feature chi2 p n_target n_reference
+#> 9428 upon -58.39810 2.142730e-14 39 332
+#> 9429 public -58.87042 1.687539e-14 12 213
+#> 9430 constitution -59.66134 1.121325e-14 9 200
+#> 9431 it -60.68425 6.661338e-15 266 1132
+#> 9432 states -63.87803 1.332268e-15 29 305
+#> 9433 be -72.68161 0.000000e+00 278 1224
+#> 9434 should -88.14817 0.000000e+00 16 309
+#> 9435 which -177.10087 0.000000e+00 96 911
+#> 9436 of -197.08830 0.000000e+00 1514 5666
+#> 9437 the -331.99069 0.000000e+00 2089 8094
+
+# compare pre- v. post-war terms using logical vector
+dfmat2 <- dfm(tokens(data_corpus_inaugural))
+head(textstat_keyness(dfmat2, docvars(data_corpus_inaugural, "Year") >= 1945), 10)
+#> feature chi2 p n_target n_reference
+#> 1 we 764.4484 0 1048 779
+#> 2 . 300.1887 0 2014 3141
+#> 3 us 207.5181 0 289 216
+#> 4 - 205.0408 0 242 157
+#> 5 america 200.5544 0 148 54
+#> 6 : 187.9821 0 115 29
+#> 7 our 183.5362 0 917 1307
+#> 8 world 171.9660 0 196 123
+#> 9 americans 163.1532 0 76 7
+#> 10 today 137.7753 0 84 21
+
+# compare Trump 2017 to other post-war preseidents
+dfmat3 <- dfm(tokens(corpus_subset(data_corpus_inaugural, period == "post-war")))
+head(textstat_keyness(dfmat3, target = "2017-Trump"), 10)
+#> feature chi2 p n_target n_reference
+#> 1 protected 81.83024 0.000000e+00 5 1
+#> 2 while 51.79484 6.161738e-13 6 7
+#> 3 obama 51.05861 8.965051e-13 3 0
+#> 4 we've 51.05861 8.965051e-13 3 0
+#> 5 will 48.03251 4.192091e-12 40 332
+#> 6 everyone 29.76164 4.885651e-08 4 5
+#> 7 your 28.60175 8.890179e-08 11 51
+#> 8 america 27.57968 1.507539e-07 18 130
+#> 9 breath 27.27421 1.765507e-07 2 0
+#> 10 exists 27.27421 1.765507e-07 2 0
+
+# using the likelihood ratio method
+head(textstat_keyness(dfm_smooth(dfmat3), measure = "lr", target = "2017-Trump"), 10)
+#> feature G2 p n_target n_reference
+#> 1 will 22.609878 1.984616e-06 41 351
+#> 2 america 12.306921 4.512817e-04 19 149
+#> 3 your 10.868622 9.780727e-04 12 70
+#> 4 while 9.707425 1.835249e-03 7 26
+#> 5 again 9.345219 2.235679e-03 10 56
+#> 6 protected 8.909125 2.837491e-03 6 20
+#> 7 american 7.996610 4.686501e-03 12 86
+#> 8 back 7.113978 7.648521e-03 7 35
+#> 9 dreams 5.908744 1.506591e-02 6 30
+#> 10 country 5.725811 1.671732e-02 10 77
+
Calculate the lexical diversity of text(s).
+textstat_lexdiv(
+ x,
+ measure = c("TTR", "C", "R", "CTTR", "U", "S", "K", "I", "D", "Vm", "Maas", "MATTR",
+ "MSTTR", "all"),
+ remove_numbers = TRUE,
+ remove_punct = TRUE,
+ remove_symbols = TRUE,
+ remove_hyphens = FALSE,
+ log.base = 10,
+ MATTR_window = 100L,
+ MSTTR_segment = 100L,
+ ...
+)
an dfm or tokens input object for whose documents +lexical diversity will be computed
a character vector defining the measure to compute
logical; if TRUE
remove features or tokens that
+consist only of numerals (the Unicode "Number" [N]
class)
logical; if TRUE
remove all features or tokens
+that consist only of the Unicode "Punctuation" [P]
class)
logical; if TRUE
remove all features or tokens
+that consist only of the Unicode "Punctuation" [S]
class)
logical; if TRUE
split words that are connected
+by hyphenation and hyphenation-like characters in between words, e.g.
+"self-storage" becomes two features or tokens "self" and "storage". Default
+is FALSE to preserve such words as is, with the hyphens.
a numeric value defining the base of the logarithm (for +measures using logarithms)
a numeric value defining the size of the moving window +for computation of the Moving-Average Type-Token Ratio (Covington & McFall, 2010)
a numeric value defining the size of the each segment +for the computation of the the Mean Segmental Type-Token Ratio (Johnson, 1944)
not used directly
A data.frame of documents and their lexical diversity scores.
+textstat_lexdiv
calculates the lexical diversity of documents
+using a variety of indices.
In the following formulas, \(N\) refers to the total number of +tokens, \(V\) to the number of types, and \(f_v(i, N)\) to the numbers +of types occurring \(i\) times in a sample of length \(N\).
"TTR"
:The ordinary Type-Token Ratio: $$TTR = + \frac{V}{N}$$
"C"
:Herdan's C (Herdan, 1960, as cited in Tweedie & +Baayen, 1998; sometimes referred to as LogTTR): $$C = + \frac{\log{V}}{\log{N}}$$
"R"
:Guiraud's Root TTR (Guiraud, 1954, as cited in +Tweedie & Baayen, 1998): $$R = \frac{V}{\sqrt{N}}$$
"CTTR"
:Carroll's Corrected TTR: $$CTTR = + \frac{V}{\sqrt{2N}}$$
"U"
:Dugast's Uber Index (Dugast, 1978, as cited in +Tweedie & Baayen, 1998): $$U = \frac{(\log{N})^2}{\log{N} - \log{V}}$$
"S"
:Summer's index: $$S = + \frac{\log{\log{V}}}{\log{\log{N}}}$$
"K"
:Yule's K (Yule, 1944, as presented in Tweedie & +Baayen, 1998, Eq. 16) is calculated by: $$K = 10^4 \times + \left[ -\frac{1}{N} + \sum_{i=1}^{V} f_v(i, N) \left( \frac{i}{N} \right)^2 \right] $$
"I"
:Yule's I (Yule, 1944) is calculated by: $$I = \frac{V^2}{M_2 - V}$$ +$$M_2 = \sum_{i=1}^{V} i^2 * f_v(i, N)$$
"D"
:Simpson's D (Simpson 1949, as presented in +Tweedie & Baayen, 1998, Eq. 17) is calculated by: +$$D = \sum_{i=1}^{V} f_v(i, N) \frac{i}{N} \frac{i-1}{N-1}$$
"Vm"
:Herdan's \(V_m\) (Herdan 1955, as presented in +Tweedie & Baayen, 1998, Eq. 18) is calculated by: +$$V_m = \sqrt{ \sum_{i=1}^{V} f_v(i, N) (i/N)^2 - \frac{i}{V} }$$
"Maas"
:Maas' indices (\(a\), \(\log{V_0}\) & +\(\log{}_{e}{V_0}\)): $$a^2 = \frac{\log{N} - + \log{V}}{\log{N}^2}$$ $$\log{V_0} = + \frac{\log{V}}{\sqrt{1 - \frac{\log{V}}{\log{N}}^2}}$$ The measure was derived from a formula by +Mueller (1969, as cited in Maas, 1972). \(\log{}_{e}{V_0}\) is equivalent +to \(\log{V_0}\), only with \(e\) as the base for the logarithms. Also +calculated are \(a\), \(\log{V_0}\) (both not the same as before) and +\(V'\) as measures of relative vocabulary growth while the text +progresses. To calculate these measures, the first half of the text and the +full text will be examined (see Maas, 1972, p. 67 ff. for details). Note: +for the current method (for a dfm) there is no computation on separate +halves of the text.
"MATTR"
:The Moving-Average Type-Token Ratio (Covington & +McFall, 2010) calculates TTRs for a moving window of tokens from the first +to the last token, computing a TTR for each window. The MATTR is the mean +of the TTRs of each window.
"MSTTR"
:Mean Segmental Type-Token Ratio (sometimes referred +to as Split TTR) splits the tokens into segments of the given size, +TTR for each segment is calculated and the mean of these values returned. +When this value is < 1.0, it splits the tokens into equal, non-overlapping +sections of that size. When this value is > 1, it defines the segments as +windows of that size. Tokens at the end which do not make a full segment +are ignored.
Covington, M.A. & McFall, J.D. (2010). Cutting the Gordian Knot: The +Moving-Average Type-Token Ratio (MATTR) Journal of Quantitative +Linguistics, 17(2), 94--100. +doi:10.1080/09296171003643098
+Herdan, G. (1955). A New Derivation and Interpretation of Yule's 'Characteristic' K. Zeitschrift +für angewandte Mathematik und Physik, 6(4): 332--334.
+Maas, H.D. (1972). Über den Zusammenhang zwischen Wortschatzumfang und +Länge eines Textes. Zeitschrift für Literaturwissenschaft und Linguistik, +2(8), 73--96.
+McCarthy, P.M. & Jarvis, S. (2007). vocd: A Theoretical and Empirical +Evaluation. Language Testing, 24(4), 459--488. +doi:10.1177/0265532207080767
+McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment. +Behaviour Research Methods, 42(2), 381--392.
+Michalke, M. (2014). koRpus: An R Package for Text Analysis (Version +0.05-4). Available from https://reaktanz.de/?c=hacking&s=koRpus.
+Simpson, E.H. (1949). Measurement of Diversity. Nature, 163: 688. +doi:10.1038/163688a0
+Tweedie. F.J. and Baayen, R.H. (1998). How Variable May a Constant Be? +Measures of Lexical Richness in Perspective. Computers and the +Humanities, 32(5), 323--352. doi:10.1023/A:1001749303137
+Yule, G. U. (1944) The Statistical Study of Literary Vocabulary. +Cambridge: Cambridge University Press.
+library("quanteda")
+
+txt <- c("Anyway, like I was sayin', shrimp is the fruit of the sea. You can
+ barbecue it, boil it, broil it, bake it, saute it.",
+ "There's shrimp-kabobs,
+ shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's
+ pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup,
+ shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp
+ sandwich.")
+tokens(txt) %>%
+ textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
+#> document TTR CTTR K
+#> 1 text1 0.7916667 2.742414 381.9444
+#> 2 text2 0.6060606 2.461830 1248.8522
+dfm(tokens(txt)) %>%
+ textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
+#> document TTR CTTR K
+#> 1 text1 0.7916667 2.742414 381.9444
+#> 2 text2 0.6060606 2.461830 1248.8522
+
+toks <- tokens(corpus_subset(data_corpus_inaugural, Year > 2000))
+textstat_lexdiv(toks, c("CTTR", "TTR", "MATTR"), MATTR_window = 100)
+#> document CTTR TTR MATTR
+#> 1 2001-Bush 10.37904 0.3689198 0.6885984
+#> 2 2005-Bush 11.26505 0.3500724 0.6781998
+#> 3 2009-Obama 12.91628 0.3736402 0.7070275
+#> 4 2013-Obama 11.99681 0.3709369 0.7029654
+#> 5 2017-Trump 10.01461 0.3728344 0.6670238
+#> 6 2021-Biden 10.59754 0.3081150 0.6816012
+
Sparse classes for similarity and distance matrices created by
+textstat_simil()
and textstat_dist()
.
Sparse classes for similarity and distance matrices created by
+textstat_simil()
and
+textstat_dist()
.
Print/show method for objects created by textstat_simil
and
+textstat_dist
.
validate_min_simil(object)
+
+# S4 method for textstat_proxy
+show(object)
the textstat_proxy object to be printed
.Data
a sparse Matrix object, symmetric if selection is
+NULL
method
the method used for computing similarity or distance
min_simil
numeric; a threshold for the similarity values below which similarity +values are not computed
margin
identifies the margin of the dfm on which similarity or
+difference was computed: "documents"
for documents or
+"features"
for word/term features.
type
either "textstat_simil"
or "textstat_dist"
selection
target units, if any
R/textstat_simil.R
+ textstat_proxy.Rd
This is an underlying function for textstat_dist
and
+textstat_simil
but returns TsparseMatrix
.
textstat_proxy(
+ x,
+ y = NULL,
+ margin = c("documents", "features"),
+ method = c("cosine", "correlation", "jaccard", "ejaccard", "dice", "edice", "hamann",
+ "simple matching", "euclidean", "chisquared", "hamming", "kullback", "manhattan",
+ "maximum", "canberra", "minkowski"),
+ p = 2,
+ min_proxy = NULL,
+ rank = NULL,
+ use_na = FALSE
+)
if a dfm object is provided, proximity between documents or
+features in x
and y
is computed.
identifies the margin of the dfm on which similarity or
+difference will be computed: "documents"
for documents or
+"features"
for word/term features.
character; the method identifying the similarity or distance +measure to be used; see Details.
The power of the Minkowski distance.
the minimum proximity value to be recoded.
an integer value specifying top-n most proximity values to be +recorded.
if TRUE
, return NA
for proximity to empty
+vectors. Note that use of NA
makes the proximity matrices denser.
Calculate the readability of text(s) using one of a variety of computed +indexes.
+textstat_readability(
+ x,
+ measure = "Flesch",
+ remove_hyphens = TRUE,
+ min_sentence_length = 1,
+ max_sentence_length = 10000,
+ intermediate = FALSE,
+ ...
+)
a character or corpus object containing the texts
character vector defining the readability measure to calculate. +Matches are case-insensitive. See other valid measures under Details.
if TRUE
, treat constituent words in hyphenated as
+separate terms, for purposes of computing word lengths, e.g.
+"decision-making" as two terms of lengths 8 and 6 characters respectively,
+rather than as a single word of 15 characters
set the minimum and maximum +sentence lengths (in tokens, excluding punctuation) to include in the +computation of readability. This makes it easy to exclude "sentences" that +may not really be sentences, such as section titles, table elements, and +other cruft that might be in the texts following conversion.
+For finer-grained control, consider filtering sentences prior first,
+including through pattern-matching, using corpus_trim()
.
if TRUE
, include intermediate quantities in the output
not used
textstat_readability
returns a data.frame of documents and
+their readability scores.
The following readability formulas have been implemented, where
Nw = \(n_{w}\) = number of words
Nc = \(n_{c}\) = number of characters
Nst = \(n_{st}\) = number of sentences
Nsy = \(n_{sy}\) = number of syllables
Nwf = \(n_{wf}\) = number of words matching the Dale-Chall List +of 3000 "familiar words"
ASL = Average Sentence Length: number of words / number of sentences
AWL = Average Word Length: number of characters / number of words
AFW = Average Familiar Words: count of words matching the Dale-Chall +list of 3000 "familiar words" / number of all words
Nwd = \(n_{wd}\) = number of "difficult" words not matching the +Dale-Chall list of "familiar" words
"ARI"
:Automated Readability Index (Senter and Smith 1967) +$$0.5 ASL + 4.71 AWL - 21.34$$
"ARI.Simple"
:A simplified version of Senter and Smith's (1967) Automated Readability Index. +$$ASL + 9 AWL$$
"Bormuth.MC"
:Bormuth's (1969) Mean Cloze Formula. +$$0.886593 - 0.03640 \times AWL + 0.161911 \times AFW - 0.21401 \times + ASL - 0.000577 \times ASL^2 - 0.000005 \times ASL^3$$
"Bormuth.GP"
:Bormuth's (1969) Grade Placement score.
+$$4.275 + 12.881M - 34.934M^2 + 20.388 M^3 + 26.194 CCS -
+ 2.046 CCS^2 - 11.767 CCS^3 - 42.285(M \times CCS) + 97.620(M \times CCS)^2 -
+ 59.538(M \times CCS)^2$$
+where \(M\) is the Bormuth Mean Cloze Formula as in
+"Bormuth"
above, and \(CCS\) is the Cloze Criterion Score (Bormuth,
+1968).
"Coleman"
:Coleman's (1971) Readability Formula 1. +$$1.29 \times \frac{100 \times n_{wsy=1}}{n_{w}} - 38.45$$
+where \(n_{wsy=1}\) = Nwsy1 = the number of one-syllable words. The +scaling by 100 in this and the other Coleman-derived measures arises +because the Coleman measures are calculated on a per 100 words basis.
"Coleman.C2"
:Coleman's (1971) Readability Formula 2. +$$1.16 \times \frac{100 \times n_{wsy=1}}{ + Nw + 1.48 \times \frac{100 \times n_{st}}{n_{w}} - 37.95}$$
"Coleman.Liau.ECP"
:Coleman-Liau Estimated Cloze Percent +(ECP) (Coleman and Liau 1975). +$$141.8401 - 0.214590 \times 100 + \times AWL + 1.079812 \times \frac{n_{st} \times 100}{n_{w}}$$
"Coleman.Liau.grade"
:Coleman-Liau Grade Level (Coleman +and Liau 1975). +$$-27.4004 \times \mathtt{Coleman.Liau.ECP} \times 100 + + 23.06395$$
"Coleman.Liau.short"
:Coleman-Liau Index (Coleman and Liau 1975). +$$5.88 \times AWL + 29.6 \times \frac{n_{st}}{n_{w}} - 15.8$$
"Dale.Chall"
:The New Dale-Chall Readability formula (Chall +and Dale 1995). +$$64 - (0.95 \times 100 \times \frac{n_{wd}}{n_{w}}) - (0.69 \times ASL)$$
"Dale.Chall.Old"
:The original Dale-Chall Readability formula +(Dale and Chall (1948). +$$0.1579 \times 100 \times \frac{n_{wd}}{n_{w}} + 0.0496 \times ASL [+ 3.6365]$$
+The additional constant 3.6365 is only added if (Nwd / Nw) > 0.05.
"Dale.Chall.PSK"
:The Powers-Sumner-Kearl Variation of the +Dale and Chall Readability formula (Powers, Sumner and Kearl, 1958). +$$0.1155 \times + 100 \frac{n_{wd}}{n_{w}}) + (0.0596 \times ASL) + 3.2672 $$
"Danielson.Bryan"
:Danielson-Bryan's (1963) Readability Measure 1. $$ + (1.0364 \times \frac{n_{c}}{n_{blank}}) + + (0.0194 \times \frac{n_{c}}{n_{st}}) - + 0.6059$$
+where \(n_{blank}\) = Nblank = the number of blanks.
"Danielson.Bryan2"
:Danielson-Bryan's (1963) Readability Measure 2. $$ + 131.059- (10.364 \times \frac{n_{c}}{n_{blank}}) + (0.0194 + \times \frac{n_{c}}{n_{st}})$$
+where \(n_{blank}\) = Nblank = the number of blanks.
"Dickes.Steiwer"
:Dickes-Steiwer Index (Dicks and Steiwer 1977). $$ + 235.95993 - (7.3021 \times AWL) - (12.56438 \times ASL) - + (50.03293 \times TTR)$$
+where TTR is the Type-Token Ratio (see textstat_lexdiv()
)
"DRP"
:Degrees of Reading Power. $$(1 - Bormuth.MC) * + 100$$
+where Bormuth.MC refers to Bormuth's (1969) Mean Cloze Formula (documented above)
"ELF"
:Easy Listening Formula (Fang 1966): $$\frac{n_{wsy>=2}}{n_{st}}$$
+where \(n_{wsy>=2}\) = Nwmin2sy = the number of words with 2 syllables or more.
"Farr.Jenkins.Paterson"
:Farr-Jenkins-Paterson's +Simplification of Flesch's Reading Ease Score (Farr, Jenkins and Paterson 1951). $$ + -31.517 - (1.015 \times ASL) + (1.599 \times + \frac{n_{wsy=1}}{n_{w}})$$
+where \(n_{wsy=1}\) = Nwsy1 = the number of one-syllable words.
"Flesch"
:Flesch's Reading Ease Score (Flesch 1948). +$$206.835 - (1.015 \times ASL) - (84.6 \times \frac{n_{sy}}{n_{w}})$$
"Flesch.PSK"
:The Powers-Sumner-Kearl's Variation of Flesch Reading Ease Score +(Powers, Sumner and Kearl, 1958). $$ (0.0778 \times + ASL) + (4.55 \times \frac{n_{sy}}{n_{w}}) - + 2.2029$$
"Flesch.Kincaid"
:Flesch-Kincaid Readability Score (Flesch and Kincaid 1975). $$ + 0.39 \times ASL + 11.8 \times \frac{n_{sy}}{n_{w}} - + 15.59$$
"FOG"
:Gunning's Fog Index (Gunning 1952). $$0.4 + \times (ASL + 100 \times \frac{n_{wsy>=3}}{n_{w}})$$
+where \(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3-syllables or more. +The scaling by 100 arises because the original FOG index is based on +just a sample of 100 words)
"FOG.PSK"
:The Powers-Sumner-Kearl Variation of Gunning's +Fog Index (Powers, Sumner and Kearl, 1958). $$3.0680 \times + (0.0877 \times ASL) +(0.0984 \times 100 \times \frac{n_{wsy>=3}}{n_{w}})$$
+where \(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3-syllables or more. +The scaling by 100 arises because the original FOG index is based on +just a sample of 100 words)
"FOG.NRI"
:The Navy's Adaptation of Gunning's Fog Index (Kincaid, Fishburne, Rogers and Chissom 1975). +$$(\frac{(n_{wsy<3} + 3 \times n_{wsy=3})}{(100 \times \frac{N_{st}}{N_{w}})} - + 3) / 2 $$
+where \(n_{wsy<3}\) = Nwless3sy = the number of words with less than 3 syllables, and +\(n_{wsy=3}\) = Nw3sy = the number of 3-syllable words. The scaling by 100 +arises because the original FOG index is based on just a sample of 100 words)
"FORCAST"
:FORCAST (Simplified Version of FORCAST.RGL) (Caylor and +Sticht 1973). $$ 20 - \frac{n_{wsy=1} \times + 150)}{(n_{w} \times 10)}$$
+where \(n_{wsy=1}\) = Nwsy1 = the number of one-syllable words. The scaling by 150 +arises because the original FORCAST index is based on just a sample of 150 words.
"FORCAST.RGL"
:FORCAST.RGL (Caylor and Sticht 1973). +$$20.43 - 0.11 \times \frac{n_{wsy=1} \times + 150)}{(n_{w} \times 10)}$$
+where \(n_{wsy=1}\) = Nwsy1 = the number of one-syllable words. The scaling by 150 arises +because the original FORCAST index is based on just a sample of 150 words.
"Fucks"
:Fucks' (1955) Stilcharakteristik (Style +Characteristic). $$AWL * ASL$$
"Linsear.Write"
:Linsear Write (Klare 1975). +$$\frac{[(100 - (\frac{100 \times n_{wsy<3}}{n_{w}})) + + (3 \times \frac{100 \times n_{wsy>=3}}{n_{w}})]}{(100 \times + \frac{n_{st}}{n_{w}})}$$
+where \(n_{wsy<3}\) = Nwless3sy = the number of words with less than 3 syllables, and +\(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3-syllables or more. The scaling +by 100 arises because the original Linsear.Write measure is based on just a sample of 100 words)
"LIW"
:Björnsson's (1968) Läsbarhetsindex (For Swedish +Texts). $$ASL + \frac{100 \times n_{wsy>=7}}{n_{w}}$$
+where \(n_{wsy>=7}\) = Nwmin7sy = the number of words with 7-syllables or more. The scaling +by 100 arises because the Läsbarhetsindex index is based on just a sample of 100 words)
"nWS"
:Neue Wiener Sachtextformeln 1 (Bamberger and +Vanecek 1984). $$19.35 \times \frac{n_{wsy>=3}}{n_{w}} + + 0.1672 \times ASL + 12.97 \times \frac{b_{wchar>=6}}{n_{w}} - 3.27 \times + \frac{n_{wsy=1}}{n_{w}} - 0.875$$
+where \(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3 syllables or more, +\(n_{wchar>=6}\) = Nwmin6char = the number of words with 6 characters or more, and +\(n_{wsy=1}\) = Nwsy1 = the number of one-syllable words.
"nWS.2"
:Neue Wiener Sachtextformeln 2 (Bamberger and +Vanecek 1984). $$20.07 \times \frac{n_{wsy>=3}}{n_{w}} + 0.1682 \times ASL + + 13.73 \times \frac{n_{wchar>=6}}{n_{w}} - 2.779$$
+where \(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3 syllables or more, and +\(n_{wchar>=6}\) = Nwmin6char = the number of words with 6 characters or more.
"nWS.3"
:Neue Wiener Sachtextformeln 3 (Bamberger and +Vanecek 1984). $$29.63 \times \frac{n_{wsy>=3}}{n_{w}} + 0.1905 \times + ASL - 1.1144$$
+where \(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3 syllables or more.
"nWS.4"
:Neue Wiener Sachtextformeln 4 (Bamberger and +Vanecek 1984). $$27.44 \times \frac{n_{wsy>=3}}{n_{w}} + 0.2656 \times + ASL - 1.693$$
+where \(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3 syllables or more.
"RIX"
:Anderson's (1983) Readability Index. $$ + \frac{n_{wsy>=7}}{n_{st}}$$
+where \(n_{wsy>=7}\) = Nwmin7sy = the number of words with 7-syllables or more.
"Scrabble"
:Scrabble Measure. $$Mean + Scrabble Letter Values of All Words$$. +Scrabble values are for English. There is no reference for this, as we +created it experimentally. It's not part of any accepted readability +index!
"SMOG"
:Simple Measure of Gobbledygook (SMOG) (McLaughlin 1969). $$ 1.043 + \times \sqrt{n_{wsy>=3}} \times \frac{30}{n_{st}} + 3.1291$$
+where \(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3 syllables or more. +This measure is regression equation D in McLaughlin's original paper.
"SMOG.C"
:SMOG (Regression Equation C) (McLaughlin's 1969) $$0.9986 \times + \sqrt{Nwmin3sy \times \frac{30}{n_{st}} + + 5} + 2.8795$$
+where \(n_{wsy>=3}\) = Nwmin3sy = the number of words with 3 syllables or more. +This measure is regression equation C in McLaughlin's original paper.
"SMOG.simple"
:Simplified Version of McLaughlin's (1969) SMOG Measure. $$ + \sqrt{Nwmin3sy \times \frac{30}{n_{st}}} + + 3$$
"SMOG.de"
:Adaptation of McLaughlin's (1969) SMOG Measure for German Texts. +$$ \sqrt{Nwmin3sy \times \frac{30}{n_{st}}-2}$$
"Spache"
:Spache's (1952) Readability Measure. $$ 0.121 \times + ASL + 0.082 \times \frac{n_{wnotinspache}}{n_{w}} + + 0.659$$
+where \(n_{wnotinspache}\) = Nwnotinspache = number of unique words not in the Spache word list.
"Spache.old"
:Spache's (1952) Readability Measure (Old). $$0.141 + \times ASL + 0.086 \times \frac{n_{wnotinspache}}{n_{w}} + + 0.839$$
+where \(n_{wnotinspache}\) = Nwnotinspache = number of unique words not in the Spache word list.
"Strain"
:Strain Index (Solomon 2006). $$n_{sy} / + \frac{n_{st}}{3} /10$$
+The scaling by 3 arises because the original Strain index is based on just the first 3 sentences.
"Traenkle.Bailer"
:Tränkle & Bailer's (1984) Readability Measure 1. +$$224.6814 - (79.8304 \times AWL) - (12.24032 \times + ASL) - (1.292857 \times 100 \times \frac{n_{prep}}{n_{w}}$$
+where \(n_{prep}\) = Nprep = the number of prepositions. The scaling by 100 arises because the original +Tränkle & Bailer index is based on just a sample of 100 words.
"Traenkle.Bailer2"
:Tränkle & Bailer's (1984) Readability Measure 2. +$$Tränkle.Bailer2 = 234.1063 - (96.11069 \times AWL + ) - (2.05444 \times 100 \times \frac{n_{prep}}{n_{w}}) - + (1.02805 \times 100 \times \frac{n_{conj}}{n_{w}}$$
+where \(n_{prep}\) = Nprep = the number of prepositions, +\(n_{conj}\) = Nconj = the number of conjunctions, +The scaling by 100 arises because the original Tränkle & Bailer index is based on +just a sample of 100 words)
"Wheeler.Smith"
:Wheeler & Smith's (1954) Readability Measure. +$$ ASL \times 10 \times \frac{n_{wsy>=2}}{n_{words}}$$
+where \(n_{wsy>=2}\) = Nwmin2sy = the number of words with 2 syllables or more.
"meanSentenceLength"
:Average Sentence Length (ASL). +$$\frac{n_{w}}{n_{st}}$$
"meanWordSyllables"
:Average Word Syllables (AWL). +$$\frac{n_{sy}}{n_{w}}$$
Anderson, J. (1983). Lix and rix: Variations on a little-known readability
+index. Journal of Reading, 26(6),
+490--496. https://www.jstor.org/stable/40031755
Bamberger, R. & Vanecek, E. (1984). Lesen-Verstehen-Lernen-Schreiben. +Wien: Jugend und Volk.
+Björnsson, C. H. (1968). Läsbarhet. Stockholm: Liber.
+Bormuth, J.R. (1969). Development of Readability Analysis.
+Bormuth, J.R. (1968). Cloze test readability: Criterion reference
+scores. Journal of educational
+measurement, 5(3), 189--196. https://www.jstor.org/stable/1433978
Caylor, J.S. (1973). Methodologies for Determining Reading Requirements of
+Military Occupational Specialities. https://eric.ed.gov/?id=ED074343
Caylor, J.S. & Sticht, T.G. (1973). Development of a Simple Readability
+Index for Job Reading Material
+https://archive.org/details/ERIC_ED076707
Coleman, E.B. (1971). Developing a technology of written instruction: Some +determiners of the complexity of prose. Verbal learning research and the +technology of written instruction, 155--204.
+Coleman, M. & Liau, T.L. (1975). A Computer Readability Formula Designed +for Machine Scoring. Journal of Applied Psychology, 60(2), 283. +doi:10.1037/h0076540
+Dale, E. and Chall, J.S. (1948). A Formula for Predicting Readability:
+Instructions. Educational Research
+Bulletin, 37-54. https://www.jstor.org/stable/1473169
Chall, J.S. and Dale, E. (1995). Readability Revisited: The New Dale-Chall +Readability Formula. Brookline Books.
+Dickes, P. & Steiwer, L. (1977). Ausarbeitung von Lesbarkeitsformeln für +die Deutsche Sprache. Zeitschrift für Entwicklungspsychologie und +Pädagogische Psychologie 9(1), 20--28.
+Danielson, W.A., & Bryan, S.D. (1963). Computer Automation of Two +Readability +Formulas. +Journalism Quarterly, 40(2), 201--206. doi:10.1177/107769906304000207
+DuBay, W.H. (2004). The Principles of Readability.
+Fang, I. E. (1966). The "Easy listening formula". Journal of Broadcasting +& Electronic Media, 11(1), 63--68. doi:10.1080/08838156609363529
+Farr, J. N., Jenkins, J.J., & Paterson, D.G. (1951). Simplification of +Flesch Reading Ease Formula. Journal of Applied Psychology, 35(5): 333. +doi:10.1037/h0057532
+Flesch, R. (1948). A New Readability Yardstick. Journal of Applied +Psychology, 32(3), 221. doi:10.1037/h0057532
+Fucks, W. (1955). Der Unterschied des Prosastils von Dichtern und anderen +Schriftstellern. Sprachforum, 1, 233-244.
+Gunning, R. (1952). The Technique of Clear Writing. New York: +McGraw-Hill.
+Klare, G.R. (1975). Assessing Readability. Reading Research Quarterly, +10(1), 62-102. doi:10.2307/747086
+Kincaid, J. P., Fishburne Jr, R.P., Rogers, R.L., & Chissom, B.S. (1975). +Derivation of New Readability Formulas (Automated Readability Index, FOG count and Flesch Reading Ease Formula) for Navy Enlisted Personnel.
+McLaughlin, G.H. (1969). SMOG Grading: A New Readability Formula. +Journal of Reading, 12(8), 639-646.
+Michalke, M. (2014). koRpus: An R Package for Text Analysis (Version 0.05-4). +Available from https://reaktanz.de/?c=hacking&s=koRpus.
+Powers, R.D., Sumner, W.A., and Kearl, B.E. (1958). A Recalculation of +Four Adult Readability Formulas. Journal of Educational Psychology, +49(2), 99. doi:10.1037/h0043254
+Senter, R. J., & Smith, E. A. (1967). Automated readability index. +Wright-Patterson Air Force Base. Report No. AMRL-TR-6620.
+*Solomon, N. W. (2006). Qualitative Analysis of Media Language. India.
+Spache, G. (1953). "A new readability formula for primary-grade reading
+materials." The Elementary School Journal, 53, 410--413.
+https://www.jstor.org/stable/998915
Tränkle, U. & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von +Lesbarkeitsformeln für die deutsche Sprache. Zeitschrift für +Entwicklungspsychologie und Pädagogische Psychologie, 16(3), 231--244.
+Wheeler, L.R. & Smith, E.H. (1954). A Practical Readability Formula for the
+Classroom Teacher in the Primary Grades. Elementary English, 31,
+397--399. https://www.jstor.org/stable/41384251
*Nimaldasan is the pen name of N. Watson Solomon, Assistant Professor of +Journalism, School of Media Studies, SRM University, India.
+txt <- c(doc1 = "Readability zero one. Ten, Eleven.",
+ doc2 = "The cat in a dilapidated tophat.")
+textstat_readability(txt, measure = "Flesch")
+#> document Flesch
+#> 1 doc1 1.2575
+#> 2 doc2 45.6450
+textstat_readability(txt, measure = c("FOG", "FOG.PSK", "FOG.NRI"))
+#> document FOG FOG.PSK FOG.NRI
+#> 1 doc1 17.000000 4.608659 -1.3875
+#> 2 doc2 9.066667 3.254382 -1.2600
+
+textstat_readability(quanteda::data_corpus_inaugural[48:58],
+ measure = c("Flesch.Kincaid", "Dale.Chall.old"))
+#> document Flesch.Kincaid Dale.Chall.old
+#> 1 1977-Carter 11.670742 8.218925
+#> 2 1981-Reagan 9.755604 7.580752
+#> 3 1985-Reagan 10.420294 7.430830
+#> 4 1989-Bush 7.147029 6.584037
+#> 5 1993-Clinton 10.381579 7.340028
+#> 6 1997-Clinton 9.828863 7.388557
+#> 7 2001-Bush 8.933091 7.216451
+#> 8 2005-Bush 11.041969 7.622865
+#> 9 2009-Obama 10.234345 7.456305
+#> 10 2013-Obama 11.734767 7.845061
+#> 11 2017-Trump 9.171244 6.777431
+
R/textstat-methods.R
+ textstat_select.Rd
Users can subset output object of textstat_collocations
,
+textstat_keyness
or textstat_frequency
based on
+"glob"
, "regex"
or "fixed"
patterns using this method.
a textstat
object
whether to "keep"
or "remove"
the rows that
+match the pattern
the type of pattern matching: "glob"
for "glob"-style
+wildcard expressions; "regex"
for regular expressions; or "fixed"
for
+exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
+pattern
or dictionary values
library("quanteda")
+
+period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war")
+dfmat <- tokens(data_corpus_inaugural) %>%
+ dfm() %>%
+ dfm_group(groups = period)
+tstat <- textstat_keyness(dfmat)
+textstat_select(tstat, 'america*')
+#> feature chi2 p n_target n_reference
+#> 5 america 200.5543560 0.000000e+00 148 54
+#> 9 americans 163.1532091 0.000000e+00 76 7
+#> 17 america's 93.4124870 0.000000e+00 37 0
+#> 86 american 24.4057333 7.803611e-07 78 94
+#> 1127 americas 0.6901897 4.060998e-01 2 1
+#> 1698 american's 0.2300602 6.314792e-01 1 0
+#> 5393 americanism -0.3961920 5.290624e-01 0 1
+
+
R/textstat_simil.R
+ textstat_simil.Rd
These functions compute matrixes of distances and similarities between
+documents or features from a dfm()
and return a matrix of
+similarities or distances in a sparse format. These methods are fast
+and robust because they operate directly on the sparse dfm objects.
+The output can easily be coerced to an ordinary matrix, a data.frame of
+pairwise comparisons, or a dist format.
textstat_simil(
+ x,
+ y = NULL,
+ selection = NULL,
+ margin = c("documents", "features"),
+ method = c("correlation", "cosine", "jaccard", "ejaccard", "dice", "edice", "hamann",
+ "simple matching"),
+ min_simil = NULL,
+ ...
+)
+
+textstat_dist(
+ x,
+ y = NULL,
+ selection = NULL,
+ margin = c("documents", "features"),
+ method = c("euclidean", "manhattan", "maximum", "canberra", "minkowski"),
+ p = 2,
+ ...
+)
a dfm objects; y
is an optional target matrix matching
+x
in the margin on which the similarity or distance will be computed.
(deprecated - use y
instead).
identifies the margin of the dfm on which similarity or
+difference will be computed: "documents"
for documents or
+"features"
for word/term features.
character; the method identifying the similarity or distance +measure to be used; see Details.
numeric; a threshold for the similarity values below which similarity +values will not be returned
unused
The power of the Minkowski distance.
A sparse matrix from the Matrix package that will be symmetric
+unless y
is specified.
textstat_simil
options are: "correlation"
(default),
+"cosine"
, "jaccard"
, "ejaccard"
, "dice"
,
+"edice"
, "simple matching"
, and "hamann"
.
textstat_dist
options are: "euclidean"
(default),
+"manhattan"
, "maximum"
, "canberra"
,
+and "minkowski"
.
If you want to compute similarity on a "normalized" dfm object
+(controlling for variable document lengths, for methods such as correlation
+for which different document lengths matter), then wrap the input dfm in
+[dfm_weight](x, "prop")
.
The output objects from textstat_simil()
and textstat_dist()
can be
+transformed easily into a list format using
+as.list()
, which returns a list for each unique
+element of the second of the pairs, a data.frame using
+as.data.frame()
, which returns pairwise
+scores, as.dist()
for a dist object,
+or as.matrix()
to convert it into an ordinary matrix.
# similarities for documents
+library("quanteda")
+dfmat <- corpus_subset(data_corpus_inaugural, Year > 2000) %>%
+ tokens(remove_punct = TRUE) %>%
+ tokens_remove(stopwords("english")) %>%
+ dfm()
+(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
+#> textstat_simil object; method = "cosine"
+#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump 2021-Biden
+#> 2001-Bush 1.000 0.520 0.541 0.556 0.452 0.562
+#> 2005-Bush 0.520 1.000 0.458 0.516 0.435 0.480
+#> 2009-Obama 0.541 0.458 1.000 0.637 0.448 0.616
+#> 2013-Obama 0.556 0.516 0.637 1.000 0.455 0.606
+#> 2017-Trump 0.452 0.435 0.448 0.455 1.000 0.513
+#> 2021-Biden 0.562 0.480 0.616 0.606 0.513 1.000
+as.matrix(tstat1)
+#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump 2021-Biden
+#> 2001-Bush 1.0000000 0.5204355 0.5411649 0.5561972 0.4518935 0.5619136
+#> 2005-Bush 0.5204355 1.0000000 0.4575297 0.5163644 0.4349030 0.4797651
+#> 2009-Obama 0.5411649 0.4575297 1.0000000 0.6373318 0.4481950 0.6158540
+#> 2013-Obama 0.5561972 0.5163644 0.6373318 1.0000000 0.4546945 0.6061256
+#> 2017-Trump 0.4518935 0.4349030 0.4481950 0.4546945 1.0000000 0.5133378
+#> 2021-Biden 0.5619136 0.4797651 0.6158540 0.6061256 0.5133378 1.0000000
+as.list(tstat1)
+#> $`2001-Bush`
+#> 2021-Biden 2013-Obama 2009-Obama 2005-Bush 2017-Trump
+#> 0.5619136 0.5561972 0.5411649 0.5204355 0.4518935
+#>
+#> $`2005-Bush`
+#> 2001-Bush 2013-Obama 2021-Biden 2009-Obama 2017-Trump
+#> 0.5204355 0.5163644 0.4797651 0.4575297 0.4349030
+#>
+#> $`2009-Obama`
+#> 2013-Obama 2021-Biden 2001-Bush 2005-Bush 2017-Trump
+#> 0.6373318 0.6158540 0.5411649 0.4575297 0.4481950
+#>
+#> $`2013-Obama`
+#> 2009-Obama 2021-Biden 2001-Bush 2005-Bush 2017-Trump
+#> 0.6373318 0.6061256 0.5561972 0.5163644 0.4546945
+#>
+#> $`2017-Trump`
+#> 2021-Biden 2013-Obama 2001-Bush 2009-Obama 2005-Bush
+#> 0.5133378 0.4546945 0.4518935 0.4481950 0.4349030
+#>
+#> $`2021-Biden`
+#> 2009-Obama 2013-Obama 2001-Bush 2017-Trump 2005-Bush
+#> 0.6158540 0.6061256 0.5619136 0.5133378 0.4797651
+#>
+as.list(tstat1, diag = TRUE)
+#> $`2001-Bush`
+#> 2001-Bush 2021-Biden 2013-Obama 2009-Obama 2005-Bush 2017-Trump
+#> 1.0000000 0.5619136 0.5561972 0.5411649 0.5204355 0.4518935
+#>
+#> $`2005-Bush`
+#> 2005-Bush 2001-Bush 2013-Obama 2021-Biden 2009-Obama 2017-Trump
+#> 1.0000000 0.5204355 0.5163644 0.4797651 0.4575297 0.4349030
+#>
+#> $`2009-Obama`
+#> 2009-Obama 2013-Obama 2021-Biden 2001-Bush 2005-Bush 2017-Trump
+#> 1.0000000 0.6373318 0.6158540 0.5411649 0.4575297 0.4481950
+#>
+#> $`2013-Obama`
+#> 2013-Obama 2009-Obama 2021-Biden 2001-Bush 2005-Bush 2017-Trump
+#> 1.0000000 0.6373318 0.6061256 0.5561972 0.5163644 0.4546945
+#>
+#> $`2017-Trump`
+#> 2017-Trump 2021-Biden 2013-Obama 2001-Bush 2009-Obama 2005-Bush
+#> 1.0000000 0.5133378 0.4546945 0.4518935 0.4481950 0.4349030
+#>
+#> $`2021-Biden`
+#> 2021-Biden 2009-Obama 2013-Obama 2001-Bush 2017-Trump 2005-Bush
+#> 1.0000000 0.6158540 0.6061256 0.5619136 0.5133378 0.4797651
+#>
+
+# min_simil
+(tstat2 <- textstat_simil(dfmat, method = "cosine", margin = "documents", min_simil = 0.6))
+#> textstat_simil object; method = "cosine"
+#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump 2021-Biden
+#> 2001-Bush 1 . . . . .
+#> 2005-Bush . 1 . . . .
+#> 2009-Obama . . 1.000 0.637 . 0.616
+#> 2013-Obama . . 0.637 1.000 . 0.606
+#> 2017-Trump . . . . 1 .
+#> 2021-Biden . . 0.616 0.606 . 1.000
+as.matrix(tstat2)
+#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump 2021-Biden
+#> 2001-Bush 1 NA NA NA NA NA
+#> 2005-Bush NA 1 NA NA NA NA
+#> 2009-Obama NA NA 1.0000000 0.6373318 NA 0.6158540
+#> 2013-Obama NA NA 0.6373318 1.0000000 NA 0.6061256
+#> 2017-Trump NA NA NA NA 1 NA
+#> 2021-Biden NA NA 0.6158540 0.6061256 NA 1.0000000
+
+# similarities for for specific documents
+textstat_simil(dfmat, dfmat["2017-Trump", ], margin = "documents")
+#> textstat_simil object; method = "correlation"
+#> 2017-Trump
+#> 2001-Bush 0.375
+#> 2005-Bush 0.355
+#> 2009-Obama 0.356
+#> 2013-Obama 0.373
+#> 2017-Trump 1.000
+#> 2021-Biden 0.449
+textstat_simil(dfmat, dfmat["2017-Trump", ], method = "cosine", margin = "documents")
+#> textstat_simil object; method = "cosine"
+#> 2017-Trump
+#> 2001-Bush 0.452
+#> 2005-Bush 0.435
+#> 2009-Obama 0.448
+#> 2013-Obama 0.455
+#> 2017-Trump 1.000
+#> 2021-Biden 0.513
+textstat_simil(dfmat, dfmat[c("2009-Obama", "2013-Obama"), ], margin = "documents")
+#> textstat_simil object; method = "correlation"
+#> 2009-Obama 2013-Obama
+#> 2001-Bush 0.452 0.479
+#> 2005-Bush 0.352 0.432
+#> 2009-Obama 1.000 0.561
+#> 2013-Obama 0.561 1.000
+#> 2017-Trump 0.356 0.373
+#> 2021-Biden 0.548 0.543
+
+# compute some term similarities
+tstat3 <- textstat_simil(dfmat, dfmat[, c("fair", "health", "terror")], method = "cosine",
+ margin = "features")
+head(as.matrix(tstat3), 10)
+#> fair health terror
+#> president 0.3396831 0.6240377 0.09805807
+#> clinton 0.4714045 0.4330127 0.00000000
+#> distinguished 0.5773503 0.7071068 0.00000000
+#> guests 0.5773503 0.7071068 0.00000000
+#> fellow 0.4256283 0.7298004 0.14744196
+#> citizens 0.7064173 0.6488857 0.07647191
+#> peaceful 0.5163978 0.6324555 0.00000000
+#> transfer 0.3333333 0.4082483 0.00000000
+#> authority 0.8164966 0.5000000 0.00000000
+#> rare 0.4082483 0.5000000 0.00000000
+as.list(tstat3, n = 6)
+#> $fair
+#> continue chance differences dangers choose charity
+#> 1 1 1 1 1 1
+#>
+#> $health
+#> generations without work common fathers nation
+#> 0.9733285 0.9594032 0.9527861 0.9486833 0.9486833 0.9282422
+#>
+#> $terror
+#> bestowed sacrifices ancestors generosity cooperation forty-four
+#> 1 1 1 1 1 1
+#>
+
+
+# distances for documents
+(tstat4 <- textstat_dist(dfmat, margin = "documents"))
+#> textstat_dist object; method = "euclidean"
+#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump 2021-Biden
+#> 2001-Bush 0 52.8 49.9 48.3 47.6 57.2
+#> 2005-Bush 52.8 0 60.8 56.9 57.4 66.0
+#> 2009-Obama 49.9 60.8 0 48.0 54.9 56.1
+#> 2013-Obama 48.3 56.9 48.0 0 53.7 56.5
+#> 2017-Trump 47.6 57.4 54.9 53.7 0 60.0
+#> 2021-Biden 57.2 66.0 56.1 56.5 60.0 0
+as.matrix(tstat4)
+#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump 2021-Biden
+#> 2001-Bush 0.00000 52.84884 49.94997 48.31149 47.61302 57.22762
+#> 2005-Bush 52.84884 0.00000 60.84406 56.85948 57.41080 66.00000
+#> 2009-Obama 49.94997 60.84406 0.00000 47.98958 54.91812 56.12486
+#> 2013-Obama 48.31149 56.85948 47.98958 0.00000 53.73081 56.45352
+#> 2017-Trump 47.61302 57.41080 54.91812 53.73081 0.00000 59.98333
+#> 2021-Biden 57.22762 66.00000 56.12486 56.45352 59.98333 0.00000
+as.list(tstat4)
+#> $`2001-Bush`
+#> 2021-Biden 2005-Bush 2009-Obama 2013-Obama 2017-Trump
+#> 57.22762 52.84884 49.94997 48.31149 47.61302
+#>
+#> $`2005-Bush`
+#> 2021-Biden 2009-Obama 2017-Trump 2013-Obama 2001-Bush
+#> 66.00000 60.84406 57.41080 56.85948 52.84884
+#>
+#> $`2009-Obama`
+#> 2005-Bush 2021-Biden 2017-Trump 2001-Bush 2013-Obama
+#> 60.84406 56.12486 54.91812 49.94997 47.98958
+#>
+#> $`2013-Obama`
+#> 2005-Bush 2021-Biden 2017-Trump 2001-Bush 2009-Obama
+#> 56.85948 56.45352 53.73081 48.31149 47.98958
+#>
+#> $`2017-Trump`
+#> 2021-Biden 2005-Bush 2009-Obama 2013-Obama 2001-Bush
+#> 59.98333 57.41080 54.91812 53.73081 47.61302
+#>
+#> $`2021-Biden`
+#> 2005-Bush 2017-Trump 2001-Bush 2013-Obama 2009-Obama
+#> 66.00000 59.98333 57.22762 56.45352 56.12486
+#>
+as.dist(tstat4)
+#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump
+#> 2005-Bush 52.84884
+#> 2009-Obama 49.94997 60.84406
+#> 2013-Obama 48.31149 56.85948 47.98958
+#> 2017-Trump 47.61302 57.41080 54.91812 53.73081
+#> 2021-Biden 57.22762 66.00000 56.12486 56.45352 59.98333
+
+# distances for specific documents
+textstat_dist(dfmat, dfmat["2017-Trump", ], margin = "documents")
+#> textstat_dist object; method = "euclidean"
+#> 2017-Trump
+#> 2001-Bush 47.6
+#> 2005-Bush 57.4
+#> 2009-Obama 54.9
+#> 2013-Obama 53.7
+#> 2017-Trump 0
+#> 2021-Biden 60.0
+(tstat5 <- textstat_dist(dfmat, dfmat[c("2009-Obama" , "2013-Obama"), ], margin = "documents"))
+#> textstat_dist object; method = "euclidean"
+#> 2009-Obama 2013-Obama
+#> 2001-Bush 49.9 48.3
+#> 2005-Bush 60.8 56.9
+#> 2009-Obama 0 48.0
+#> 2013-Obama 48.0 0
+#> 2017-Trump 54.9 53.7
+#> 2021-Biden 56.1 56.5
+as.matrix(tstat5)
+#> 2009-Obama 2013-Obama
+#> 2001-Bush 49.94997 48.31149
+#> 2005-Bush 60.84406 56.85948
+#> 2009-Obama 0.00000 47.98958
+#> 2013-Obama 47.98958 0.00000
+#> 2017-Trump 54.91812 53.73081
+#> 2021-Biden 56.12486 56.45352
+as.list(tstat5)
+#> $`2009-Obama`
+#> 2005-Bush 2021-Biden 2017-Trump 2001-Bush 2013-Obama
+#> 60.84406 56.12486 54.91812 49.94997 47.98958
+#>
+#> $`2013-Obama`
+#> 2005-Bush 2021-Biden 2017-Trump 2001-Bush 2009-Obama
+#> 56.85948 56.45352 53.73081 48.31149 47.98958
+#>
+
+if (FALSE) {
+# plot a dendrogram after converting the object into distances
+plot(hclust(as.dist(tstat4)))
+}
+
R/textstat_summary.R
+ textstat_summary.Rd
Count syntactic and lexical features of documents such as tokens, types, +sentences, and character categories.
+textstat_summary(x, ...)
corpus to be summarized
additional arguments passed through to dfm()
Count the total number of characters, tokens and sentences as well as special +tokens such as numbers, punctuation marks, symbols, tags and emojis.
chars = number of characters; equal to nchar()
sents
+= number of sentences; equal ntoken(tokens(x), what = "sentence")
tokens = number of tokens; equal to ntoken()
types = number of unique tokens; equal to ntype()
puncts = number of punctuation marks (^\p{P}+$
)
numbers = number of numeric tokens
+(^\p{Sc}{0,1}\p{N}+([.,]*\p{N})*\p{Sc}{0,1}$
)
symbols = number of symbols (^\p{S}$
)
tags = number of tags; sum of pattern_username
and pattern_hashtag
+in quanteda::quanteda_options()
emojis = number of emojis (^\p{Emoji_Presentation}+$
)
if (Sys.info()["sysname"] != "SunOS") {
+library("quanteda")
+corp <- data_corpus_inaugural[1:5]
+textstat_summary(corp)
+toks <- tokens(corp)
+textstat_summary(toks)
+dfmat <- dfm(toks)
+textstat_summary(dfmat)
+}
+#> document chars sents tokens types puncts numbers symbols urls tags
+#> 1 1789-Washington NA NA 1537 603 107 0 0 0 0
+#> 2 1793-Washington NA NA 147 95 12 0 0 0 0
+#> 3 1797-Adams NA NA 2577 801 259 0 0 0 0
+#> 4 1801-Jefferson NA NA 1923 687 197 0 0 0 0
+#> 5 1805-Jefferson NA NA 2380 781 214 0 0 0 0
+#> emojis
+#> 1 0
+#> 2 0
+#> 3 0
+#> 4 0
+#> 5 0
+