Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Walk Perturbation using BERT #275

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

sajantanand
Copy link

We using masked token prediction to perform a random walk on a sentence; i.e., we choose a word and block it out, relying on a language model like BERT to determine what word should be placed in that blank. By repeating this several times, we can reach a new sentence that is perturbed from the original one.

@sajantanand
Copy link
Author

sajantanand commented Sep 1, 2021

Pytest currently fails, but I am not sure why this is. I can generate the json file using the provided code.

old_sentences = copy.deepcopy(sentences)
assert len(sentences) == k**steps
return sentences

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this could be refactored a bit for clarity.

import torch
import copy
import re
import numpy as np
import random
from typing import List


def _mask_word(sentence, word_to_mask, tokenizer):
    """ helper function,
    replace word in a sentence with mask-token, as prep for BERT tokenizer"""
    start_index = sentence.find(word_to_mask)
    return sentence[0:start_index] + tokenizer.mask_token + sentence[
        start_index + len(word_to_mask):]


def get_k_replacement_words(tokenized_text, model, tokenizer, k=5):
    """return k most similar words from the model, for a tokenized mask-word in a sentence.

    Args:
        tokenized_text (str): sentence with a word masked out
        model ([type]): model
        tokenizer ([type]): tokenizer
        k (int, optional): how many similar words to find for a given tokenized-word. Defaults to 5.

    Returns:
        [list]: list of top k words
    """
    inputs = tokenizer.encode_plus(tokenized_text, return_tensors='pt')
    index_to_mask = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)
    outputs = model(**inputs)
    softmax = F.softmax(outputs.logits, dim=-1)
    mask_word = softmax[0, index_to_mask, :]
    return torch.topk(mask_word, k)[1][0]


def single_sentence_random_step(sentence, tokenizer, model, k=5):
    """For a given sentence, choose a random word to mask, and 
    replace it with a word the top-k most similar words in BERT model.
    Return k sentences, each with a different replacement word for the mask.

    Args:
        sentence ([type]): sentence to perform random walk on
        tokenizer ([type]): tokenizer
        model ([type]): model
        k (int, optional): how many replacement words to try. Defaults to 5.

    Returns:
        [list]: k-sentences with masked word replaced with top-k most similar words
    """
    text_split = re.split('[ ?.,!;"]', sentence)

    # pick a random word to mask
    word_to_mask = random.choice(text_split)
    while len(word_to_mask) == 0:
        word_to_mask = random.choice(text_split)
    # mask word
    new_text = _mask_word(sentence, word_to_mask, tokenizer)

    # get k replacement words

    top_k = get_k_replacement_words(new_text, tokenizer, model, k=k)

    # replace mask-token with the word from the top-k replacements
    return [
        new_text.replace(tokenizer.mask_token, tokenizer.decode([token]))
        for token in top_k
    ]


def single_round(sentences: List[str], tokenizer, model) -> List[str]:
    """For a given list of sentences, do a random walk on each sentence.

    Args:
        sentences ([type]): list of sentnces to perform random walk on
        tokenizer ([type]): tokenizer
        model ([type]): model

    Returns:
        [List]: list of random-walked sentences
    """
    new_sentences = []

    for sentence in sentences:
        new_sentences.extend(
            single_sentence_random_step(sentence, tokenizer, model))

    return new_sentences


def random_walk(original_text: str, steps: int, k: int, tokenizer,
                model) -> List[str]:
    sentences = [original_text]

    # Do k steps of random walk procedure
    for _ in range(steps):
        sentences.append(single_round(sentences, tokenizer, model))

    assert len(sentences) == k**steps
    return sentences

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the code as suggested.

@sajantanand
Copy link
Author

sajantanand commented Sep 4, 2021

@kaustubhdhole can you explain how the pytest ran by github (pytest -s --t=light --f=light) works? I can pass the pytest locally using the command in the ReadME (pytest -s --t=random_walk), but the one run by github that includes the "light" filter does not succeed.

"""
inputs = tokenizer.encode_plus(tokenized_text, return_tensors='pt', truncation=True, max_length = 512)
index_to_mask = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)
if index_to_mask[0].numel() == 0: # Since we are truncating the input to be 512 tokens (BERT's max), we need to make sure the mask is in these first 512.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: don't let text go over 80 characters in width (break comments into several lines, and run yapf or black)

Returns:
[list]: k-sentences with masked word replaced with top-k most similar words
"""
#text_split = re.split('[ ?.,!;"]', sentence)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove comment.

TaskType.TEXT_CLASSIFICATION
]
languages = ["en"]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to include keywords: https://github.com/GEM-benchmark/NL-Augmenter/blob/main/docs/keywords.md
like:
keywords = [ "model-based", "lexical", "possible-meaning-alteration", "high-coverage", "high-generations" ]

@AbinayaM02
Copy link
Collaborator

AbinayaM02 commented Sep 14, 2021

@kaustubhdhole can you explain how the pytest ran by github (pytest -s --t=light --f=light) works? I can pass the pytest locally using the command in the ReadME (pytest -s --t=random_walk), but the one run by github that includes the "light" filter does not succeed.

Hi @sajantanand: We're running only the test cases for light transformations in the Github actions. Since yours is a heavy transformation, you need to run the test locally and check if passes.

Edit: I think you haven't set the heavy flag to True in your transformation so it is considered as light.

Copy link
Contributor

@juand-r juand-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sajantanand,
I was assigned to review your transformation. I think this should be accepted, but have a few comments and suggestions below.

Correctness: The code passed all the tests.
Interface: Looks good.
Applicable Tasks and Keywords: Languages: good. Tasks: I'm not sure TEXT_TO_TEXT_GENERATION would really work correctly; see my comments below. Keywords: please add!
Specificity: Not specific to a particular task. This could be used for data augmentation; however, it should be noted it is low-precision, and generations are sometimes unnatural.
Novelty: This is not already implement in NL-Augmenter.
Adding new libraries: sentence-transformers and transformers specifed in requirements.txt (with versions).
Description: The README looks good. (But if there are any publications using this kind of transformation, please add them)
Data and code source: You could rename section Extras from the README to "Data and code provenance", and add license information for the models.
Paraphrasers and data augmenters: while this is low-precision, I still think it is an interesting transformation for data augmentation.
Test cases: 5 cases, good.
Evaluating robustness: not present.
Languages: English only. This could be expanded to other languages in the future.
Documentation: most functions have docstrings, and the ones that don't are short and easy to read.

One suggestion: depending on how this is to be used, it might be helpful to have the option to not change named entities. Actually, as this currently stands, wouldn't there be possible issues when using this for some text to text task such as summarization? (e.g., couldn't it modify some entity in the source but not in the reference, or modify them in different ways?)

@sajantanand
Copy link
Author

I have addressed Roy's comments, added keywords, added the $heavy$ flag, and am currently running the evaluation scripts. These scripts do take quite a while using the default parameters, as 32 sentences are generated for each input. If the scripts finish running on collab, I'll update the ReadMe with the results. Unfortunately I don't have the compute resources to run this locally.

Now I'll address @juand-r's helpful comments.

  • Tasks: I am not sure any tasks type other than Task.TEXT_TO_TEXT_GENERATION makes sense as the transformation generates a new sentence of possibly-altered meaning. I agree that the model can generate new text that isn't consistent with the inputted text for the reason you mentioned. A single word is selected and replaced, so if there were a Named entity, only one iteration would be replaced. One approach we have to deal with this is to set a high $sim_req$ when initializing the class, requiring generated sentences to be similar in meaning to the original sentence.
  • Description: I don't know of any publications that do similar transformations, so I've instead linked some other instances of random walks in NLP.
  • Data and code source: I've made this change.

As to your suggestion, I don't know of any way to exclude named entities. That being said, I am not very experienced in NLP work, so I'm open to any implementation suggestions.

@juand-r
Copy link
Contributor

juand-r commented Sep 21, 2021

Thanks for adding the references on random walks for sentence similarity.

I thought the list of tasks specify the kind of tasks this transformation could be used for in practice? So for my comment regarding Task.TEXT_TO_TEXT_GENERATION -- say you have a summarization task, with a dataset with text pairs (source text, reference summary). Then if the transformation changes named entities which appear in both the source and the reference summary, but in an inconsistent way, this would introduce noise if used as data augmentation (and it would not be a reliable evaluation dataset either). This would probably not be as much of an issue with TEXT_CLASSIFICATION (although I suppose it could sometimes; e.g., say the task is sentiment classification and you accidentally swap the sentiment by modifying words like "good" or "great" to "bad").

You can find a list of named entities (and their character or token offsets) with an off-the-shelf package like spacy. Then you would only need to keep track of which words to exclude when looking for replacements.

@sajantanand
Copy link
Author

sajantanand commented Oct 5, 2021

@juand-r Sorry for the long delay in response. I added an option to exclude named entities using spacy. I also added the functionality so that long sentences (over the 512 character limit of BERT) are split into chunks that are then randomly perturbed. Hopefully this task can get a second review and be merged soon.

@sajantanand
Copy link
Author

@kaustubhdhole @vgtomahawk any chance this PR can get a second review? I added rudimentary evaluation results from Colab.

@sajantanand
Copy link
Author

@kaustubhdhole @mille-s One last bump to try and get a second review. Thanks for doing so much work organizing this!

@james-simon
Copy link
Contributor

@kaustubhdhole @mille-s This transformation has one accepting review, and as far as we can tell, the only reason it wasn't merged was that the second reviewer never reviewed it. We think this is an interesting transformation. We understand it's late in the project, but for the sake of closure, could you either give this a second review or definitively reject it?

Thanks!

@mille-s
Copy link
Contributor

mille-s commented Nov 17, 2021

@james-simon @sajantanand Sorry, very busy times! It looks like this transformation is part of a small group that escaped our radars, really sorry about that. At this point we carried out the analysis and plan to finalise the paper soon, so it is a bit difficult to add new things I'm afraid. I think that at this point the best option is to merge it later (we're actually running a GEM hackathon these days to add a lot of transfos/filters within the next weeks); since you have other accepted perturbations and are thus co-authoring the paper I hope this is not too much of an inconvenience. @kaustubhdhole does this sound good?

@sajantanand
Copy link
Author

@mille-s Sounds good to me! Let me know if there are any changes that need to be made before this is eventually merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants