Skip to content

Pegasus Paraphraser is a text paraphrasing system built using the tuner007/pegasus_paraphrase model to generate simplified versions of input text by splitting it into sentences and leveraging an encoder-decoder architecture.

License

Notifications You must be signed in to change notification settings

fatimaazfar/Pegasus-Paraphraser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Pegasus-Paraphraser

The process of paraphrasing text using a model like tuner007/pegasus_paraphrase involves several stages, from tokenization to text generation and decoding. Below is a detailed explanation of the algorithm and the underlying mathematical and technical processes involved:

1. Tokenization

Objective: Convert input text into a format suitable for the model.

Steps:

  • Input Text: X = {x1, x2, ..., xn}
  • Tokenization: The tokenizer converts each word in the input text into a corresponding token from the tokenizer’s vocabulary.
    • Token: T = {t1, t2, ..., tm}
    • Tokenizer Mapping: ti = Tokenizer(xi)
  • Padding and Truncation: Ensure all token sequences are of the same length by truncating or padding.
  • Conversion to Tensors: Convert the token sequence into PyTorch tensors.
    • Tensor Representation: T = torch.tensor(T)

2. Text Generation Using the Model

Objective: Generate paraphrased text from the tokenized input.

Steps:

  • Encoder-Decoder Architecture: The Pegasus model uses an encoder-decoder structure similar to the Transformer architecture.
  • Encoder: Encodes the input token sequence into a sequence of hidden states.
    • Encoder Input: H = Encoder(T)
  • Decoder: Generates the output sequence from the encoded hidden states.
    • Decoder Input: Uses the hidden states and previously generated tokens to produce the next token.
    • Output Tokens: O = {o1, o2, ..., ok}

Mathematical Details:

  1. Self-Attention Mechanism:
    • Query, Key, Value Matrices: Q, K, V
    • Attention Scores: Attention(Q, K, V) = softmax((QK^T) / sqrt(dk)) V
    • Multi-Head Attention: Combines multiple attention heads to capture different aspects of the input.
  2. Feed-Forward Networks:
    • Linear Transformations and Activation Functions: Applied to the outputs of the attention layers.
    • Layer Normalization and Residual Connections: Ensure stability and efficient gradient propagation.
  3. Beam Search:
    • Objective: Find the most probable sequence of tokens.
    • Score Calculation: Accumulate probabilities of token sequences.
    • Selection: Keep the top num_beams sequences at each step.

3. Decoding and Cleanup

Objective: Convert the generated token sequence back to human-readable text.

Steps:

  • Token Decoding: Convert the generated tokens back to text.
    • Tokenizer Mapping: yi = Tokenizer.decode(oi)
  • Join Tokens: Combine the tokens to form the final paraphrased sentence.
  • Special Tokens Handling: Remove any special tokens used during tokenization and generation.
    • Skip Special Tokens: final_output = skip_special_tokens(O)

Detailed Steps of the Algorithm:

  1. Tokenization:

    • Given input text X, apply the tokenizer:
      T = Tokenizer.encode(X, padding='longest', truncation=True, return_tensors='pt')
      
  2. Model Forward Pass:

    • Pass the tokenized input through the model’s encoder to obtain hidden states H:

      H = model.encoder(T)
      
    • Use the decoder to generate output tokens O:

      O = model.generate(H, max_length=L, num_beams=5, num_return_sequences=1, temperature=1.5)
      
  3. Decoding:

    • Decode the generated tokens back to text:
      paraphrased_texts = [tokenizer.decode(o, skip_special_tokens=True) for o in O]
      

Example Paraphrasing Process:

Given the input text:

The process of photosynthesis is how plants convert light energy into chemical energy, storing it in the bonds of sugar molecules. This complex process involves multiple steps including light absorption, electron transport, and the Calvin cycle.

Tokenization:

  • Tokenized input might look like:
    [234, 565, 98, 2354, ...]
    

Generation:

  • Using the encoder-decoder mechanism, the model generates a new sequence of tokens:
    [321, 98, 467, 1234, ...]
    

Decoding:

  • The generated tokens are converted back to text, resulting in the paraphrased sentence:
    Photosynthesis is the process by which plants transform light energy into chemical energy, storing it in sugar molecules. This intricate process includes several stages such as light absorption, electron transport, and the Calvin cycle.
    

Conclusion:

The tuner007/pegasus_paraphrase model leverages the Pegasus architecture, which is based on the Transformer model with encoder-decoder mechanisms. By tokenizing the input, encoding it into hidden states, generating new token sequences using the decoder, and finally decoding these tokens back to text, the model can paraphrase complex sentences while maintaining their original meaning. This process involves sophisticated attention mechanisms, feed-forward networks, and beam search techniques to ensure high-quality paraphrased output.

About

Pegasus Paraphraser is a text paraphrasing system built using the tuner007/pegasus_paraphrase model to generate simplified versions of input text by splitting it into sentences and leveraging an encoder-decoder architecture.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published