From 263cbc4dc87d591f0f33c1307466e853432d2769 Mon Sep 17 00:00:00 2001 From: davidycliao Date: Sat, 28 Dec 2024 22:42:11 +0000 Subject: [PATCH 1/2] UPDATE 0.0.7 tutorial --- _pkgdown.yml | 10 +++--- vignettes/tutorial.Rmd | 72 ++++++++++++++++++++++++++++++------------ 2 files changed, 57 insertions(+), 25 deletions(-) diff --git a/_pkgdown.yml b/_pkgdown.yml index d3923e99..eec0482c 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -277,12 +277,12 @@ navbar: - icon: fa-rocket text: Quick Start menu: + - text: "flaiR Installation" + href: articles/quickstart.html#flair-installation - text: "NLP Tasks" href: articles/quickstart.html#nlp-tasks - - text: "Class and Ojbect" - href: articles/quickstart.html#class-and-ojbect - - text: "More Details about Installation" - href: articles/quickstart.html#more-details-about-installation + - text: "Training and Finetuning" + href: articles/quickstart.html#training-and-fine-tuning - icon: fa-project-diagram @@ -348,6 +348,8 @@ navbar: - icon: fa-newspaper-o text: News menu: + - text: "0.0.7" + href: news/index.html#flair-007-2024-12-26 - text: "0.0.6" href: news/index.html#flair-006-2023-10-29 - text: "0.0.5" diff --git a/vignettes/tutorial.Rmd b/vignettes/tutorial.Rmd index 6ae6f6c5..f7ef0591 100644 --- a/vignettes/tutorial.Rmd +++ b/vignettes/tutorial.Rmd @@ -29,17 +29,20 @@ library(reticulate) reticulate::py_install("flair") ``` -# The Overview + +# Flair NLP and flaiR for Social Science
-**Flair NLP** is an open-source library for Natural Language Processing (NLP) developed by [Zalando Research](https://github.com/zalandoresearch/). Known for its state-of-the-art solutions, such as contextual string embeddings for NLP tasks like Named Entity Recognition (NER), Part-of-Speech tagging (POS), and more, it has garnered the attention of the NLP community for its ease of use and powerful functionalities. +Flair NLP is an open-source Natural Language Processing (NLP) library developed by [Zalando Research](https://github.com/zalandoresearch/). Known for its state-of-the-art solutions, it excels in contextual string embeddings, Named Entity Recognition (NER), and Part-of-Speech tagging (POS). Flair offers robust text analysis tools through multiple embedding approaches, including Flair contextual string embeddings, transformer-based embeddings from Hugging Face, and traditional models like GloVe and fasttext. Additionally, it provides pre-trained models for various languages and seamless integration with fine-tuned transformers hosted on Hugging Face. -In addition, Flair NLP offers pre-trained models for various languages and tasks, and is compatible with fine-tuned transformers hosted on Hugging Face. +flaiR bridges these powerful NLP features from Python to the R environment, making advanced text analysis accessible for social science researcher by combining Flair's ease of use with R's familiar interface for integration with popular R packages such as [quanteda](https://quanteda.io) and more.
-- [**Sentence and Token Object**](#tutorial.html#sentence-and-token) +# The Overview + +- [**Sentence and Token Object in FlaiR**](#tutorial.html#sentence-and-token) - [**Sequence Taggings**](#tutorial.html#sequence-taggings) @@ -55,13 +58,16 @@ In addition, Flair NLP offers pre-trained models for various languages and tasks - [**Extending conText's Embedding Regression**](#tutorial.html#extending-contexts-embedding-regression) ------------------------------------------------------------------------- +  + +----- + # Sentence and Token Sentence and Token are fundamental classes. -## **Sentence** +## Sentence
@@ -85,7 +91,7 @@ print(sentence) ``` -## **Token** +## Token
@@ -183,7 +189,7 @@ print(sentence) [^1]: Flair is built on PyTorch, which is a library in Python. -## **Corpus** +## Corpus The Corpus object in Flair is a fundamental data structure that represents a dataset containing text samples, usually comprising of a training set, a development set (or validation set), and a test set. It's designed to work smoothly with Flair's models for tasks like named entity recognition, text classification, and more. @@ -288,7 +294,10 @@ In the later sections, there will be more similar processing using the `Corpus`.
------------------------------------------------------------------------- +  + +----- + # Sequence Taggings @@ -551,7 +560,9 @@ head(results, n = 10) ``` ------------------------------------------------------------------------- +  + +----- # Embedding @@ -1022,15 +1033,15 @@ figure ``` ------------------------------------------------------------------------- +  +----- -# Training a Binary Classifier +# Training a Binary Classifier In this section, we'll train a sentiment analysis model that can categorize text as either positive or negative. This case study is adapted from pages 116 to 130 of Tadej Magajna's book, '[Natural Language Processing with Flair](https://www.packtpub.com/product/natural-language-processing-with-flair/9781801072311)'. The process for training text classifiers in Flair mirrors the process followed for sequence labeling models. Specifically, the steps to train text classifiers are: - - Load a tagged corpus and compute the label dictionary map. - Prepare the document embeddings. - Initialize the `TextClassifier` class. @@ -1044,7 +1055,6 @@ Training text classification models requires a set of text documents (typically,
- ```{r} library(flaiR) # load IMDB from flair_datasets module @@ -1082,7 +1092,6 @@ flaiR covers all the different types of document embeddings that we can use. Her - ```{r} DocumentPoolEmbeddings <- flair_embeddings()$DocumentPoolEmbeddings WordEmbeddings <- flair_embeddings()$WordEmbeddings @@ -1195,7 +1204,10 @@ print(sentence$labels) ``` ------------------------------------------------------------------------- +  + +----- + # Training RNNs @@ -1312,7 +1324,6 @@ library(flaiR) ``` - ## Fine-tuning a Transformers Model **Step 1** Load Necessary Modules from Flair @@ -1499,7 +1510,10 @@ More R tutorial and documentation see [here](https://github.com/davidycliao/flai ------------------------------------------------------------------------- +  + +----- + # Extending conText's Embedding Regression @@ -1972,16 +1986,17 @@ bt_model <- conText(formula = immigration ~ party + gender, While this tutorial doesn't determine a definitive best approach, it's important to understand the key distinctions between word embedding methods. BERT, FastText, Flair Stacked Embeddings, and GloVe can be categorized into two groups: dynamic and static embeddings. -Dynamic embeddings, such as BERT and Flair, adapt their word representations based on context using high-dimensional vector spaces (BERT uses 768 dimensions in its base model). BERT employs self-attention mechanisms and subword tokenization, while Flair uses character-level modeling. Both effectively handle out-of-vocabulary words through these mechanisms. +Dynamic embeddings, particularly BERT and Flair, adapt their word representations based on context using high-dimensional vector spaces (BERT uses 768 dimensions in its base model). BERT employs self-attention mechanisms and subword tokenization, while Flair uses character-level modeling. Both effectively handle out-of-vocabulary words through these mechanisms. -However, it's worth noting that in their case study, where they provide selected words, in our case study we directly extract individual word vectors from BERT and Flair (forward/backward) embeddings using those selected words. This approach doesn't truly achieve the intended contextual modeling. A more meaningful approach would be to extract embeddings at the quasi-sentence or paragraph level. Alternatively, pooling the entire document before extracting embeddings could be more valuable. +However, there is a notable difference between their case study and here. While they provide selected words, we directly extract individual word vectors from BERT and Flair (forward/backward) embeddings using the same set of words. This doesn't truly utilize BERT and Flair embeddings' capability of modeling context. A more meaningful approach would be to extract embeddings at the quasi-sentence or paragraph level, or alternatively, to pool the entire document before extracting embeddings. -These context-based approaches differ significantly from GloVe's methodology, which relies on pre-computed global word-word co-occurrence statistics to generate static word vectors. +These context-based approaches stand in stark contrast to GloVe's methodology, which relies on pre-computed global word-word co-occurrence statistics to generate static word vectors. ```{r echo=FALSE, message = TRUE, warning = TRUE, out.width="95%"} + st <- as.data.frame(st_model@normed_coefficients) st["model"] <- "Flair Stacked Embeddings" ft <- as.data.frame(ft_model@normed_coefficients) @@ -2010,5 +2025,20 @@ ggplot(merged_df, aes(x = coefficient, y = normed.estimate)) + facet_wrap(~model, nrow = 2, ncol = 2, scales = "free_y") ``` +  + +----- + +# Cite + +``` +@Manual{, + title = {Flair NLP and flaiR for Social Science}, + author = {Yen-Chieh Liao, Sohini Timbadia and Stefan Müller}, + year = {2024}, + url = {https://davidycliao.github.io/flaiR/articles/tutorial.html} +} +``` + From fd90cf963b64f73085ec675b83feaee5a151e285 Mon Sep 17 00:00:00 2001 From: davidycliao Date: Sun, 29 Dec 2024 03:25:36 +0000 Subject: [PATCH 2/2] UPDATE 0.0.7 clean some type. --- .github/workflows/r.yml | 165 ------------------------------ .github/workflows/r_macos.yml | 174 -------------------------------- .github/workflows/r_ubuntu.yaml | 165 ------------------------------ R/predict_label.R | 114 +++++++++++---------- man/predict_label.Rd | 44 ++++---- vignettes/tutorial.Rmd | 45 ++++++++- 6 files changed, 122 insertions(+), 585 deletions(-) diff --git a/.github/workflows/r.yml b/.github/workflows/r.yml index 1636412e..a3309209 100644 --- a/.github/workflows/r.yml +++ b/.github/workflows/r.yml @@ -5,171 +5,6 @@ # # See https://github.com/r-lib/actions/tree/master/examples#readme for # additional example workflows available for the R community. -# on: -# push: -# branches: [main, master] -# pull_request: -# branches: [main, master] -# -# name: R-CMD-check -# -# jobs: -# R-CMD-check: -# runs-on: ${{ matrix.config.os }} -# -# name: ${{ matrix.config.os }} (${{ matrix.config.r }}) -# -# strategy: -# fail-fast: false -# matrix: -# config: -# - {os: macos-latest, r: 'release'} -# - {os: windows-latest, r: 'release'} -# - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'} -# - {os: ubuntu-latest, r: 'release'} -# env: -# GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} -# R_KEEP_PKG_SOURCE: yes -# -# steps: -# - uses: actions/checkout@v2 -# -# - uses: r-lib/actions/setup-pandoc@v2 -# -# -# - name: Setup Python (Only on ubuntu-latest) -# if: matrix.config.os == 'ubuntu-latest' -# uses: actions/setup-python@v2 -# with: -# python-version: '3.x' -# -# - name: Install Python venv and dependencies (Only on ubuntu-latest) -# if: matrix.config.os == 'ubuntu-latest' -# run: | -# sudo apt-get update -# sudo apt-get install -y python3-venv -# python -m venv ~/.venv -# echo "RETICULATE_PYTHON=~/.venv/bin/python" >> $GITHUB_ENV -# source ~/.venv/bin/activate -# - uses: r-lib/actions/setup-r@v2 -# with: -# r-version: ${{ matrix.config.r }} -# http-user-agent: ${{ matrix.config.http-user-agent }} -# use-public-rspm: true -# -# - name: Install reticulate (Only on ubuntu-latest) -# if: matrix.config.os == 'ubuntu-latest' -# run: | -# Rscript -e "install.packages('reticulate', repos = 'https://cloud.r-project.org/')" -# -# - uses: r-lib/actions/setup-r-dependencies@v2 -# with: -# extra-packages: any::rcmdcheck -# needs: check -# -# - uses: r-lib/actions/check-r-package@v2 -# with: -# upload-snapshots: true -# -# -# on: -# push: -# branches: [main, master] -# pull_request: -# branches: [main, master] -# -# name: R-CMD-check -# -# jobs: -# R-CMD-check: -# runs-on: ${{ matrix.config.os }} -# -# name: ${{ matrix.config.os }} (${{ matrix.config.r }}) -# -# strategy: -# fail-fast: false -# matrix: -# config: -# - {os: macos-latest, r: 'release'} -# - {os: windows-latest, r: 'release'} -# - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'} -# - {os: ubuntu-latest, r: 'release'} -# env: -# GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} -# R_KEEP_PKG_SOURCE: yes -# -# steps: -# - uses: actions/checkout@v2 -# -# - uses: r-lib/actions/setup-pandoc@v2 -# -# - name: Setup Python -# uses: actions/setup-python@v2 -# with: -# python-version: '3.9' # Ensure Python 3.x is being used -# -# - name: Check Python Version -# run: | -# python --version -# -# - name: Install Python dependencies -# run: | -# python -m pip install --upgrade pip -# pip install flair -# -# - name: Setup Python (Only on ubuntu-latest) -# if: matrix.config.os == 'ubuntu-latest' -# uses: actions/setup-python@v2 -# with: -# python-version: '3.10.13' -# -# - name: Install Python venv and dependencies (Only on ubuntu-latest) -# if: matrix.config.os == 'ubuntu-latest' -# run: | -# sudo apt-get update -# sudo apt-get install -y python3-venv -# python -m venv ~/.venv -# echo "RETICULATE_PYTHON=~/.venv/bin/python" >> $GITHUB_ENV -# source ~/.venv/bin/activate -# -# - uses: r-lib/actions/setup-r@v2 -# with: -# r-version: ${{ matrix.config.r }} -# http-user-agent: ${{ matrix.config.http-user-agent }} -# use-public-rspm: true -# -# - name: Install reticulate (Only on ubuntu-latest) -# if: matrix.config.os == 'ubuntu-latest' -# run: | -# Rscript -e "install.packages('reticulate', repos = 'https://cloud.r-project.org/')" -# -# - name: Install Pandoc (Only on Windows) -# if: matrix.config.os == 'windows-latest' -# run: | -# choco install pandoc -# -# - name: Install Python dependencies (Only on Windows) -# if: matrix.config.os == 'windows-latest' -# run: | -# python -m pip install --upgrade pip -# pip install scipy==1.12.0 # test -# pip install flair -# -# - name: Install Python dependencies (Only on macOS) -# if: matrix.config.os == 'macos-latest' -# run: | -# python -m pip install --upgrade pip -# pip install scipy==1.12.0 # test -# pip install flair -# -# - uses: r-lib/actions/setup-r-dependencies@v2 -# with: -# extra-packages: rcmdcheck -# -# # - uses: r-lib/actions/check-r-package@v2 -# # with: -# # upload-snapshots: true - name: R-CMD-check diff --git a/.github/workflows/r_macos.yml b/.github/workflows/r_macos.yml index 4c34a6c9..da8705d5 100644 --- a/.github/workflows/r_macos.yml +++ b/.github/workflows/r_macos.yml @@ -5,180 +5,6 @@ # # See https://github.com/r-lib/actions/tree/master/examples#readme for # additional example workflows available for the R community. -# name: R-MacOS -# -# on: -# push: -# branches: [ "main" ] -# pull_request: -# branches: [ "main" ] -# -# permissions: -# contents: read -# -# jobs: -# build: -# runs-on: macos-latest -# -# strategy: -# matrix: -# r-version: ['4.4.0', '4.3.2'] -# -# steps: -# - uses: actions/checkout@v3 -# -# - name: Update Homebrew -# run: | -# brew update -# -# - name: Install pandoc -# run: | -# for i in {1..3}; do -# brew install pandoc && break || sleep 15 -# done -# -# - name: Install gfortran and configure Makevars -# run: | -# brew install gcc -# mkdir -p ~/.R -# touch ~/.R/Makevars -# echo "FC=$(brew --prefix)/bin/gfortran" >> ~/.R/Makevars -# echo "F77=$(brew --prefix)/bin/gfortran" >> ~/.R/Makevars -# echo "FLIBS=-L$(brew --prefix)/lib/gcc/current -lgfortran -lquadmath -lm" >> ~/.R/Makevars -# echo "LDFLAGS=-L$(brew --prefix)/lib/gcc/current" >> ~/.R/Makevars -# -# - name: Set up R ${{ matrix.r-version }} -# uses: r-lib/actions/setup-r@v2 -# with: -# r-version: ${{ matrix.r-version }} -# -# - name: Install R dependencies -# run: | -# Rscript -e "install.packages(c('remotes', 'rcmdcheck', 'reticulate', 'renv', 'knitr', 'rmarkdown', 'lsa', 'purrr', 'testthat', 'htmltools'), repos='https://cran.r-project.org')" -# Rscript -e "if (getRversion() >= '4.4.0') remotes::install_version('Matrix', version = '1.5.3') else install.packages('Matrix', type = 'binary')" -# Rscript -e "remotes::install_version('htmltools', version = '0.5.8')" -# Rscript -e "renv::restore()" -# -# - name: Set up Python -# uses: actions/setup-python@v2 -# with: -# python-version: '3.10.x' -# -# - name: Install Python virtualenv -# run: pip install virtualenv -# -# - name: Create Python virtual environment -# run: virtualenv flair_env -# -# - name: Install Python dependencies in virtual environment -# run: | -# source flair_env/bin/activate -# pip install --upgrade pip -# pip install scipy==1.12.0 -# pip install flair -# -# - name: Remove Python cache files -# run: find . -name '*.pyc' -delete -# -# - name: Check (with virtual environment) -# run: | -# source flair_env/bin/activate -# R CMD build --no-build-vignettes . -# shell: bash -# name: R-MacOS -# -# on: -# push: -# branches: [ "main" ] -# pull_request: -# branches: [ "main" ] -# -# permissions: -# contents: read -# -# jobs: -# build: -# runs-on: macos-latest -# strategy: -# matrix: -# r-version: ['4.4.0', '4.3.2'] -# fail-fast: false -# -# steps: -# - uses: actions/checkout@v3 -# -# - name: Update Homebrew -# run: brew update -# -# - name: Install pandoc -# run: | -# for i in {1..3} -# do -# brew install pandoc && break || sleep 15 -# done -# -# - name: Install gfortran and configure Makevars -# run: | -# brew install gcc -# mkdir -p ~/.R -# touch ~/.R/Makevars -# echo "FC=$(brew --prefix)/bin/gfortran" >> ~/.R/Makevars -# echo "F77=$(brew --prefix)/bin/gfortran" >> ~/.R/Makevars -# echo "FLIBS=-L$(brew --prefix)/lib/gcc/current -lgfortran -lquadmath -lm" >> ~/.R/Makevars -# echo "LDFLAGS=-L$(brew --prefix)/lib/gcc/current" >> ~/.R/Makevars -# -# - name: Set up R ${{ matrix.r-version }} -# uses: r-lib/actions/setup-r@v2 -# with: -# r-version: ${{ matrix.r-version }} -# -# - name: Install R dependencies -# run: | -# # 基礎包安裝 -# Rscript -e 'install.packages(c("remotes", "rcmdcheck", "reticulate", "renv", "knitr", "rmarkdown", "lsa", "purrr", "testthat"), repos="https://cran.r-project.org")' -# -# # 根據 R 版本有條件地安裝 Matrix -# Rscript -e ' -# if (getRversion() >= "4.4.0") { -# install.packages("Matrix") -# } else { -# remotes::install_version("Matrix", version = "1.5.1", repos = "https://cran.r-project.org") -# } -# ' -# -# # 安裝指定版本的 htmltools -# Rscript -e 'remotes::install_version("htmltools", version = "0.5.8")' -# -# # 最後執行 renv::restore() -# Rscript -e 'renv::restore()' -# -# - name: Set up Python -# uses: actions/setup-python@v2 -# with: -# python-version: '3.10.x' -# -# - name: Install Python virtualenv -# run: pip install virtualenv -# -# - name: Create Python virtual environment -# run: virtualenv flair_env -# -# - name: Install Python dependencies in virtual environment -# run: | -# source flair_env/bin/activate -# pip install --upgrade pip -# pip install scipy==1.12.0 -# pip install flair -# -# - name: Remove Python cache files -# run: find . -name '*.pyc' -delete -# -# - name: Check (with virtual environment) -# run: | -# source flair_env/bin/activate -# R CMD build --no-build-vignettes . -# shell: bash - name: R-MacOS diff --git a/.github/workflows/r_ubuntu.yaml b/.github/workflows/r_ubuntu.yaml index 6f1fdecf..4133b3b3 100644 --- a/.github/workflows/r_ubuntu.yaml +++ b/.github/workflows/r_ubuntu.yaml @@ -1,168 +1,3 @@ -# name: R-ubuntu -# -# on: -# push: -# branches: -# - main -# pull_request: -# branches: -# - main -# -# jobs: -# R-CMD-check: -# runs-on: ubuntu-20.04 -# -# strategy: -# matrix: -# r-version: ['4.3.2', '4.2.0', '4.2.1'] -# -# steps: -# - uses: actions/checkout@v3 -# -# - name: Cache R dependencies -# uses: actions/cache@v2 -# with: -# path: ~/R/x86_64-pc-linux-gnu-library/ -# key: ${{ runner.os }}-r-${{ hashFiles('**/renv.lock') }} -# restore-keys: ${{ runner.os }}-r- -# -# - name: Setup R -# uses: r-lib/actions/setup-r@v2 -# with: -# use-public-rspm: true -# -# - name: Restore R environment -# run: | -# Rscript -e "if (!requireNamespace('renv', quietly = TRUE)) install.packages('renv')" -# Rscript -e "renv::restore()" -# -# - name: Install additional R packages -# run: Rscript -e 'install.packages(c("knitr", "rmarkdown", "lsa", "purrr", "ggplot2"))' -# shell: bash -# -# - name: Set up Python -# uses: actions/setup-python@v2 -# with: -# python-version: '3.10.x' -# -# - name: Install Python virtualenv -# run: pip install virtualenv -# -# - name: Create Python virtual environment -# run: virtualenv flair_env -# -# - name: Install Python dependencies in virtual environment -# run: | -# source flair_env/bin/activate -# pip install --upgrade pip -# pip install scipy==1.12.0 # test -# pip install flair -# -# - name: Remove Python cache files -# run: find . -name '*.pyc' -delete -# -# - name: Check R environment status -# run: Rscript -e "renv::status()" -# -# - name: Synchronize R environment -# run: Rscript -e "renv::sync()" -# -# - name: Check R package (with virtual environment) -# run: | -# source flair_env/bin/activate -# R CMD build . --no-build-vignettes -# R CMD check *tar.gz --no-build-vignettes --no-manual --no-examples -# shell: bash -# -# # -# name: R-ubuntu -# -# on: -# push: -# branches: -# - main -# pull_request: -# branches: -# - main -# -# jobs: -# R-CMD-check: -# runs-on: ubuntu-20.04 -# strategy: -# matrix: -# r-version: ['4.3.2', '4.2.0', '4.2.1'] -# -# env: -# R_LIBS_USER: /home/runner/work/_temp/Library -# TZ: UTC -# _R_CHECK_SYSTEM_CLOCK_: FALSE -# NOT_CRAN: true -# RSPM: https://packagemanager.posit.co/cran/__linux__/focal/latest -# RENV_CONFIG_REPOS_OVERRIDE: https://packagemanager.posit.co/cran/__linux__/focal/latest -# -# steps: -# - uses: actions/checkout@v3 -# -# - name: Cache R dependencies -# uses: actions/cache@v2 -# with: -# path: ~/R/x86_64-pc-linux-gnu-library/ -# key: ${{ runner.os }}-r-${{ matrix.r-version }}-${{ hashFiles('**/renv.lock') }} -# restore-keys: ${{ runner.os }}-r-${{ matrix.r-version }}- -# -# - name: Setup R -# uses: r-lib/actions/setup-r@v2 -# with: -# use-public-rspm: true -# r-version: ${{ matrix.r-version }} -# -# - name: Restore R environment -# run: | -# if (!requireNamespace('renv', quietly = TRUE)) install.packages('renv') -# renv::restore() -# shell: Rscript {0} -# -# - name: Install additional R packages -# env: -# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # Use the default GitHub token for authentication -# run: | -# install.packages(c("knitr", "rmarkdown", "lsa", "purrr", "ggplot2")) -# install.packages('remotes') -# remotes::install_github("davidycliao/flaiR", auth_token = Sys.getenv("GITHUB_TOKEN"), force = TRUE) -# shell: Rscript {0} -# -# - name: Set up Python -# uses: actions/setup-python@v2 -# with: -# python-version: '3.10.x' -# -# - name: Install Python virtualenv -# run: pip install virtualenv -# -# - name: Create Python virtual environment -# run: virtualenv flair_env -# -# - name: Install Python dependencies in virtual environment -# run: | -# source flair_env/bin/activate -# pip install --upgrade pip -# pip install scipy==1.12.0 -# pip install flair -# pip install gensim -# -# - name: Remove Python cache files -# run: find . -name '*.pyc' -delete -# -# - name: Check R environment status -# run: renv::status() -# shell: Rscript {0} -# -# # - name: Check R package (with virtual environment) -# # run: | -# # source flair_env/bin/activate -# # R CMD build . --no-build-vignettes -# # R CMD check *tar.gz --no-build-vignettes --no-manual --no-tests --no-examples -# # shell: bash name: R-ubuntu diff --git a/R/predict_label.R b/R/predict_label.R index a8108ae1..d28fe799 100644 --- a/R/predict_label.R +++ b/R/predict_label.R @@ -1,88 +1,96 @@ #' Predict Text Label Using Flair Classifier #' -#' This function predicts the label of input text using a Flair classifier, -#' with options for confidence thresholding to adjust predictions to NEUTRAL. -#' #' @param text A character string containing the text to be labeled #' @param classifier A Flair TextClassifier object for making predictions -#' @param threshold_score Numeric value between 0 and 1 representing the confidence -#' threshold for label classification. Defaults to 0.5 if not specified -#' @param threshold Logical indicating whether to apply the threshold_score to -#' adjust predictions to NEUTRAL. Defaults to FALSE +#' @param sentence Optional Flair Sentence object. If NULL, one will be created from text #' -#' @return A list containing the following elements: +#' @return A list containing: #' \describe{ -#' \item{label}{Character string of the final predicted label (AGAINST/FAVOR/NEUTRAL)} -#' \item{score}{Numeric confidence score from the classifier} -#' \item{token_number}{Integer count of tokens in the input text} -#' \item{threshold_score}{Numeric value of the threshold used} -#' \item{original_label}{Character string of the classifier's original prediction -#' before thresholding} +#' \item{label}{Character string of predicted label} +#' \item{score}{Numeric confidence score from classifier} +#' \item{token_number}{Integer count of tokens in input text} #' } #' #' @examples #' \dontrun{ -#' # Load a pre-trained classifier -#' classifier <- flair$models$TextClassifier$load('stance-classifier') -#' -#' # Predict label without thresholding +#' # Example 1: Using text input +#' classifier <- flair_models()$TextClassifier$load('stance-classifier') #' result1 <- predict_label( #' text = "I strongly support this policy", #' classifier = classifier #' ) #' -#' # Predict with custom threshold +#' # Example 2: Using pre-created and tagged sentence +#' sent <- Sentence("I love Berlin and New York.") +#' tagger <- flair_models()$SequenceTagger$load('pos') +#' tagger$predict(sent) +#' print(sent) # Shows tokens with POS tags +#' #' result2 <- predict_label( -#' text = "I somewhat agree with the proposal", +#' text = NULL, #' classifier = classifier, -#' threshold_score = 0.7, -#' threshold = TRUE +#' sentence = sent #' ) #' } #' -#' @details -#' The function will throw an error if the classifier is NULL or not a -#' Flair TextClassifier. -#' #' @import flaiR #' @export -predict_label <- function(text, classifier, threshold_score = NULL, threshold = FALSE) { - # Check if classifier is provided and valid +predict_label <- function(text, classifier, sentence = NULL) { + + # Check if classifier is valid if (is.null(classifier) || !isTRUE(class(classifier)[1] == "flair.models.text_classification_model.TextClassifier")) { stop("Invalid or missing classifier. Please provide a pre-trained Flair TextClassifier model.") } - # Create a sentence object - sentence <- Sentence(text) + # Check if Sentence exists and is correctly loaded + if (!("python.builtin.type" %in% class(Sentence))) { + stop("Sentence class not found or not properly loaded. Please ensure flaiR is properly loaded.") + } - # Use the classifier to predict - classifier$predict(sentence) + # Check if either text or sentence is provided + if (is.null(text) && is.null(sentence)) { + stop("Either text or sentence must be provided") + } - # Get the predicted label and score - predicted_label <- sentence$labels[[1]]$value - score <- sentence$labels[[1]]$score # 移除 as.numeric - token_number <- length(sentence$tokens) + # Create or validate sentence + if (is.null(sentence)) { + tryCatch({ + sentence <- Sentence(text) + }, error = function(e) { + stop("Failed to create Sentence object: ", e$message) + }) + } else { + # Enhanced sentence validation + if (!inherits(sentence, "flair.data.Sentence")) { + stop("Invalid sentence object. Must be a Flair Sentence instance.") + } - # Set default threshold_score if NULL - if (is.null(threshold_score)) { - threshold_score <- 0.5 + if (!("tokens" %in% names(sentence)) || length(sentence$tokens) == 0) { + stop("Invalid sentence object: No tokens found.") + } } - # Modify label based on the score threshold and original label - original_label <- predicted_label - if (threshold && score < threshold_score) { - if (predicted_label %in% c("AGAINST", "FAVOR")) { - predicted_label <- "NEUTRAL" - } + # Use the classifier to predict + tryCatch({ + classifier$predict(sentence) + }, error = function(e) { + stop("Prediction failed: ", e$message) + }) + + # Verify prediction results + if (length(sentence$labels) == 0) { + stop("No prediction labels generated") } - # Construct the prediction result - prediction = list(label = predicted_label, - score = score, - token_number = token_number, - threshold_score = threshold_score, - original_label = original_label) + # Get prediction details + predicted_label <- sentence$labels[[1]]$value + score <- sentence$labels[[1]]$score + token_number <- length(sentence$tokens) - # Return the prediction result - return(prediction) + # Return results + return(list( + label = predicted_label, + score = score, + token_number = token_number + )) } diff --git a/man/predict_label.Rd b/man/predict_label.Rd index 76632910..0acd9b57 100644 --- a/man/predict_label.Rd +++ b/man/predict_label.Rd @@ -4,55 +4,45 @@ \alias{predict_label} \title{Predict Text Label Using Flair Classifier} \usage{ -predict_label(text, classifier, threshold_score = NULL, threshold = FALSE) +predict_label(text, classifier, sentence = NULL) } \arguments{ \item{text}{A character string containing the text to be labeled} \item{classifier}{A Flair TextClassifier object for making predictions} -\item{threshold_score}{Numeric value between 0 and 1 representing the confidence -threshold for label classification. Defaults to 0.5 if not specified} - -\item{threshold}{Logical indicating whether to apply the threshold_score to -adjust predictions to NEUTRAL. Defaults to FALSE} +\item{sentence}{Optional Flair Sentence object. If NULL, one will be created from text} } \value{ -A list containing the following elements: +A list containing: \describe{ -\item{label}{Character string of the final predicted label (AGAINST/FAVOR/NEUTRAL)} -\item{score}{Numeric confidence score from the classifier} -\item{token_number}{Integer count of tokens in the input text} -\item{threshold_score}{Numeric value of the threshold used} -\item{original_label}{Character string of the classifier's original prediction -before thresholding} +\item{label}{Character string of predicted label} +\item{score}{Numeric confidence score from classifier} +\item{token_number}{Integer count of tokens in input text} } } \description{ -This function predicts the label of input text using a Flair classifier, -with options for confidence thresholding to adjust predictions to NEUTRAL. -} -\details{ -The function will throw an error if the classifier is NULL or not a -Flair TextClassifier. +Predict Text Label Using Flair Classifier } \examples{ \dontrun{ -# Load a pre-trained classifier -classifier <- flair$models$TextClassifier$load('stance-classifier') - -# Predict label without thresholding +# Example 1: Using text input +classifier <- flair_models()$TextClassifier$load('stance-classifier') result1 <- predict_label( text = "I strongly support this policy", classifier = classifier ) -# Predict with custom threshold +# Example 2: Using pre-created and tagged sentence +sent <- Sentence("I love Berlin and New York.") +tagger <- flair_models()$SequenceTagger$load('pos') +tagger$predict(sent) +print(sent) # Shows tokens with POS tags + result2 <- predict_label( - text = "I somewhat agree with the proposal", + text = NULL, classifier = classifier, - threshold_score = 0.7, - threshold = TRUE + sentence = sent ) } diff --git a/vignettes/tutorial.Rmd b/vignettes/tutorial.Rmd index f7ef0591..f4bdc6d5 100644 --- a/vignettes/tutorial.Rmd +++ b/vignettes/tutorial.Rmd @@ -1364,7 +1364,6 @@ old_text <- map(cc_muller_old$text, Sentence) old_labels <- as.character(cc_muller_old$class) old_text <- map2(old_text, old_labels, ~ { - .x$add_label("classification", .y) .x }) @@ -1508,7 +1507,51 @@ After fine-tuning for 1 epoch, the model showed improved performance on the same More R tutorial and documentation see [here](https://github.com/davidycliao/flaiR). +## Using Your Own Fine-tuned Model in flaiR +This seciton demonstrates how to utilize your custom fine-tuned model in flaiR for text classification tasks. Let's explore this process step by step. + +__Setting Up Your Environment__ + +First, we need to load the flaiR package and prepare our model: + + +```{r} +library(flaiR) +classifier <- flair_models()$TextClassifier$load('vignettes/inst/new-muller-campaign-communication/best-model.pt') +``` + +It's important to verify your model's compatibility with `$model_card`. You can check this by examining the version requirements: + +```{r} +print(classifier$model_card) + +``` + +```{r} +# Check required versions +print(classifier$model_card$transformers_version) # Required transformers version +print(classifier$model_card$flair_version) # Required Flair version +``` + +__Making Predictions__ + +To make predictions, we first need to prepare our text by creating a Sentence object. This is a key component in Flair's architecture that handles text processing: + +```{r} +# Get the Sentence class from flaiR +Sentence <- flair_data()$Sentence + +# Create a Sentence object with your text +sentence <- Sentence("And to boost the housing we need, we will start to build a new generation of garden cities.") + +# Make prediction +classifier$predict(sentence) + +# Access prediction results +prediction <- sentence$labels[[1]]$value # Get predicted label +confidence <- sentence$labels[[1]]$score # Get confidence score +```