Merge pull request #12 from ionite34/dev

Fixes and performance improvements
ionite34 · May 18, 2022 · ed180c4 · ed180c4
2 parents 26ca087 + 4edefd6
commit ed180c4
Show file tree

Hide file tree

Showing 19 changed files with 228 additions and 316 deletions.
diff --git a/.github/codecov.yml b/.github/codecov.yml
@@ -23,6 +23,4 @@ comment:
 
 ignore:
   - "tests/**"
-  - "test_*.py"
-  - "**/__main__.py"
-  - "**/__init__.py"
+  - "test_*.py"
diff --git a/README.md b/README.md
@@ -10,19 +10,50 @@
 
 ### Augmented Recurrent Neural G2P with Inflectional Orthography
 
-Grapheme-to-phoneme (G2P) conversion is the process of converting the written form of words (Graphemes) to their 
-pronunciations (Phonemes). Deep learning models for text-to-speech (TTS) synthesis using phoneme / mixed symbols
-typically require a G2P conversion method for both training and inference.
-
-Aquila Resolve presents a new approach for accurate and efficient English G2P resolution. 
-Input text graphemes are translated into their phonetic pronunciations, 
-using [ARPAbet](https://wikipedia.org/wiki/ARPABET) as the [phoneme symbol set](#Symbol-Set).
+Aquila Resolve presents a new approach for accurate and efficient English to 
+[ARPAbet](https://wikipedia.org/wiki/ARPABET) G2P resolution.
 The pipeline employs a context layer, multiple transformer and n-gram morpho-orthographical search layers, 
-and an autoregressive recurrent neural transformer base.
-
-The current implementation offers state-of-the-art accuracy for out-of-vocabulary (OOV) words, as well as contextual
+and an autoregressive recurrent neural transformer base. The current implementation offers state-of-the-art accuracy for out-of-vocabulary (OOV) words, as well as contextual
 analysis for correct inferencing of [English Heteronyms](https://en.wikipedia.org/wiki/Heteronym_(linguistics)).
 
+The package is offered in a pre-trained state that is ready for use as a dependency or in
+notebook environments. There are no additional resources needed, other than the model checkpoint which is
+automatically downloaded on the first usage. See [Installation](#Installation) more information.
+
+### 1. Dynamic Word Mappings based on context:
+
+```pycon
+g2p.convert('I read the book, did you read it?')
+# >> '{AY1} {R EH1 D} {DH AH0} {B UH1 K}, {D IH1 D} {Y UW1} {R IY1 D} {IH1 T}?'
+```
+```pycon
+g2p.convert('The researcher was to subject the subject to a test.')
+# >> '{DH AH0} {R IY1 S ER0 CH ER0} {W AA1 Z} {T UW1} {S AH0 B JH EH1 K T} {DH AH0} {S AH1 B JH IH0 K T} {T UW1} {AH0} {T EH1 S T}.'
+```
+
+|                                                                                                                                                              | 'The subject was told to read. Eight records were read in total.'                                      |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
+| *Ground Truth*                                                                                                                                               | The `S AH1 B JH IH0 K T` was told to `R IY1 D`. Eight `R EH1 K ER0 D Z` were `R EH1 D` in total.       |
+| Aquila Resolve                                                                                                                                               | The `S AH1 B JH IH0 K T` was told to `R IY1 D`. Eight `R EH1 K ER0 D Z` were `R EH1 D` in total.       |
+| [Deep Phonemizer](https://github.com/as-ideas/DeepPhonemizer)<br/>([en_us_cmudict_forward.pt](https://github.com/as-ideas/DeepPhonemizer#pretrained-models)) | The **S AH B JH EH K T** was told to **R EH D**. Eight **R AH K AO R D Z** were `R EH D` in total.     |
+| [CMUSphinx Seq2Seq](https://github.com/cmusphinx/g2p-seq2seq)<br/>([checkpoint](https://github.com/cmusphinx/g2p-seq2seq#running-g2p))                       | The `S AH1 B JH IH0 K T` was told to `R IY1 D`. Eight **R IH0 K AO1 R D Z** were **R IY1 D** in total. |
+| [ESpeakNG](https://github.com/espeak-ng/espeak-ng) <br/> (with [phonecodes](https://github.com/jhasegaw/phonecodes))                                         | The **S AH1 B JH EH K T** was told to `R IY1 D`. Eight `R EH1 K ER0 D Z` were **R IY1 D** in total.    |
+
+### 2. Leading Accuracy for unseen words:
+
+```pycon
+g2p.convert('Did you kalpe the Hevinet?')
+# >> '{AY1} {R EH1 D} {DH AH0} {B UH1 K}, {D IH1 D} {Y UW1} {R IY1 D} {IH1 T}?'
+```
+
+|                                                                                                                                                              | "tensorflow"                | "agglomerative"                    | "necrophages"                    |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|------------------------------------|----------------------------------|
+| Aquila Resolve                                                                                                                                               | `T EH1 N S ER0 F L OW2`     | `AH0 G L AA1 M ER0 EY2 T IH0 V`    | `N EH1 K R OW0 F EY2 JH IH0 Z`   |
+| [Deep Phonemizer](https://github.com/as-ideas/DeepPhonemizer)<br/>([en_us_cmudict_forward.pt](https://github.com/as-ideas/DeepPhonemizer#pretrained-models)) | `T EH N S ER F L OW`        | **AH G L AA M ER AH T IH V**       | `N EH K R OW F EY JH IH Z`       |
+| [CMUSphinx Seq2Seq](https://github.com/cmusphinx/g2p-seq2seq)<br/>([checkpoint](https://github.com/cmusphinx/g2p-seq2seq#running-g2p))                       | **T EH1 N S ER0 L OW0 F**   | **AH0 G L AA1 M ER0 T IH0 V**      | **N AE1 K R AH0 F IH0 JH IH0 Z** |
+| [ESpeakNG](https://github.com/espeak-ng/espeak-ng) <br/> (with [phonecodes](https://github.com/jhasegaw/phonecodes))                                         | **T EH1 N S OW0 R F L OW2** | **AA G L AA1 M ER0 R AH0 T IH2 V** | **N EH1 K R AH0 F IH JH EH0 Z**  |
+
+
 ## Installation
 
 ```bash
@@ -32,8 +63,8 @@ pip install aquila-resolve
 > automatically downloaded on the first use of relevant public methods that require inferencing. For example,
 > when [instantiating `G2p`](#Usage). You can also start this download manually by calling `Aquila_Resolve.download()`.
 > 
-> If you are in an environment where remote file downloads are not possible, you can also download the checkpoint 
-> manually and instantiate `G2p` with the flag: `G2p(custom_checkpoint='path/model.pt')`
+> If you are in an environment where remote file downloads are not possible, you can also transfer the checkpoint 
+> manually, placing `model.pt` within the `Aquila_Resolve.data` module folder.
 
 ## Usage
 
@@ -48,10 +79,10 @@ g2p.convert('The book costs $5, will you read it?')
 
 > Additional optional parameters are available when defining a `G2p` instance:
 
-| Parameter          | Default  | Description                                                                                                                                                                                                             |
-|--------------------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `device`           | `'cpu'`  | Device for Pytorch inference model                                                                                                                                                                                      |
-| `process_numbers`  | `True`   | Toggles conversion of some numbers and symbols to their spoken pronunciation forms. See [numbers.py](src/Aquila_Resolve/text/numbers.py) for details on what is covered.                                                |
+| Parameter         | Default | Description                                                                                                                                                              |
+|-------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `device`          | `'cpu'` | Device for Pytorch inference model. GPU is supported using `'cuda'`                                                                                                      |
+| `process_numbers` | `True`  | Toggles conversion of some numbers and symbols to their spoken pronunciation forms. See [numbers.py](src/Aquila_Resolve/text/numbers.py) for details on what is covered. |
 
 ## Model Architecture
 

diff --git a/setup.cfg b/setup.cfg
@@ -1,12 +1,12 @@
 [metadata]
 name = Aquila-Resolve
-version = 0.1.2-dev1
+version = 0.1.2
 author = ionite
 author_email = dev@ionite.io
 description = Augmented Recurrent Neural Grapheme-to-Phoneme conversion with Inflectional Orthography.
 long_description = file: README.md
 long_description_content_type = text/markdown
-url = https://github.com/ionite34/Aquila-Resolve'
+url = https://github.com/ionite34/Aquila-Resolve
 license = Apache 2.0
 license_file = LICENSE
 classifiers =

diff --git a/src/Aquila_Resolve/__init__.py b/src/Aquila_Resolve/__init__.py
@@ -4,7 +4,7 @@
 Grapheme to Phoneme Resolver
 
 """
-__version__ = "0.1.2-dev1"
+__version__ = "0.1.2"
 
 from .g2p import G2p
 from .data.remote import download
diff --git a/src/Aquila_Resolve/data/__init__.py b/src/Aquila_Resolve/data/__init__.py
@@ -2,7 +2,7 @@
 
 if sys.version_info < (3, 9):
     # In Python versions below 3.9, this is needed
-    from importlib_resources import files
+    from importlib_resources import files  # pragma: no cover
 else:
     # Since python 3.9+, importlib.resources.files is built-in
     from importlib.resources import files

diff --git a/src/Aquila_Resolve/g2p.py b/src/Aquila_Resolve/g2p.py
@@ -9,15 +9,14 @@
 from nltk.stem.snowball import SnowballStemmer
 
 from .h2p import H2p
-from .h2p import replace_first
+from .text.replace import replace_first
 from .format_ph import with_cb
-# from .dict_reader import DictReader
 from .static_dict import get_cmudict
 from .text.numbers import normalize_numbers
 from .filter import filter_text
 from .processors import Processor
 from .infer import Infer
-from .symbols import contains_alpha, brackets_match
+from .symbols import contains_alpha, valid_braces
 
 re_digit = re.compile(r"\((\d+)\)")
 re_bracket_with_digit = re.compile(r"\(.*\)")
@@ -143,13 +142,9 @@ def convert(self, text: str, convert_num: bool = True) -> str | None:
         :param convert_num: True to convert numbers to words
         """
 
-        # Check that every {} bracket is paired
-        check = brackets_match(text)
-        if check is not None:
-            raise ValueError(check)
-
-        # Normalize numbers, if enabled
+        # Convert numbers, if enabled
         if convert_num:
+            valid_braces(text, raise_on_invalid=True)
             text = normalize_numbers(text)
 
         # Filter and Tokenize

diff --git a/src/Aquila_Resolve/h2p.py b/src/Aquila_Resolve/h2p.py
@@ -1,27 +1,18 @@
-import nltk
-import re
 from nltk.tokenize import TweetTokenizer
 from nltk import pos_tag
 from nltk import pos_tag_sents
 from .dictionary import Dictionary
 from .filter import filter_text as ft
+from .text.replace import replace_first
 from . import format_ph as ph
 
-# Check that the nltk data is downloaded, if not, download it
+# Check required nltk data exists, if not, download it
 try:
-    nltk.data.find('taggers/averaged_perceptron_tagger.zip')
-except LookupError:
-    nltk.download('averaged_perceptron_tagger')
-
-
-# Method to use Regex to replace the first instance of a word with its phonemes
-def replace_first(target, replacement, text):
-    # Skip if target invalid
-    if target is None or target == '':
-        return text
-    # Replace the first instance of a word with its phonemes
-    # return re.sub(r'(?i)\b' + target + r'\b', replacement, text, 1)
-    return re.sub(r'(?<!\{)\b' + target + r'\b(?![\w\s]*[}])', replacement, text, count=1, flags=re.IGNORECASE)
+    from nltk.data import find
+    find('taggers/averaged_perceptron_tagger.zip')
+except LookupError:  # pragma: no cover
+    from nltk.downloader import download
+    download('averaged_perceptron_tagger', raise_on_error=True)
 
 
 class H2p:

diff --git a/src/Aquila_Resolve/infer.py b/src/Aquila_Resolve/infer.py
@@ -17,13 +17,13 @@ def __init__(self, device='cpu'):
         self.lang = 'en_us'
         self.batch_size = 32
 
-    def __call__(self, words: list[str]) -> list[str]:
+    def __call__(self, text: list[str]) -> list[str]:
         """
         Infers phonemes for a list of words.
-        :param words: list of words
+        :param text: list of words
         :return: dict of {word: phonemes}
         """
-        res = self.model.phonemise_list(words, lang=self.lang, batch_size=self.batch_size).phonemes
+        res = self.model.phonemise_list(text, lang=self.lang, batch_size=self.batch_size).phonemes
         # Replace all occurrences of '][' with spaces, remove remaining brackets
         res = [r.replace('][', ' ').replace('[', '').replace(']', '') for r in res]
         return res
diff --git a/src/Aquila_Resolve/models/__init__.py b/src/Aquila_Resolve/models/__init__.py
@@ -2,7 +2,7 @@
 
 if sys.version_info < (3, 9):
     # In Python versions below 3.9, this is needed
-    from importlib_resources import files
+    from importlib_resources import files  # pragma: no cover
 else:
     # Since python 3.9+, importlib.resources.files is built-in
     from importlib.resources import files

diff --git a/src/Aquila_Resolve/models/dp/model/model.py b/src/Aquila_Resolve/models/dp/model/model.py
@@ -4,8 +4,7 @@
 
 import torch
 import torch.nn as nn
-from torch.nn import TransformerEncoderLayer, LayerNorm, TransformerEncoder
-from .utils import get_dedup_tokens, _make_len_mask, _generate_square_subsequent_mask, PositionalEncoding
+from .utils import _make_len_mask, _generate_square_subsequent_mask, PositionalEncoding
 from ..preprocessing.text import Preprocessor
 
 
@@ -17,7 +16,7 @@ def is_autoregressive(self) -> bool:
         """
         Returns: bool: Whether the model is autoregressive.
         """
-        return self in {ModelType.AUTOREG_TRANSFORMER}
+        return self in {ModelType.AUTOREG_TRANSFORMER}  # pragma: no cover
 
 
 class Model(torch.nn.Module, ABC):
@@ -39,91 +38,7 @@ def generate(self, batch: Dict[str, torch.Tensor]) -> Tuple[torch.Tensor, torch.
           Tuple[torch.Tensor, torch.Tensor]: The predictions. The first element is a tensor (phoneme tokens)
           and the second element  is a tensor (phoneme token probabilities)
         """
-        pass
-
-
-class ForwardTransformer(Model):
-
-    def __init__(self,
-                 encoder_vocab_size: int,
-                 decoder_vocab_size: int,
-                 d_model=512,
-                 d_fft=1024,
-                 layers=4,
-                 dropout=0.1,
-                 heads=1) -> None:
-        super().__init__()
-
-        self.d_model = d_model
-
-        self.embedding = nn.Embedding(encoder_vocab_size, d_model)
-        self.pos_encoder = PositionalEncoding(d_model, dropout)
-
-        encoder_layer = TransformerEncoderLayer(d_model=d_model,
-                                                nhead=heads,
-                                                dim_feedforward=d_fft,
-                                                dropout=dropout,
-                                                activation='relu')
-        encoder_norm = LayerNorm(d_model)
-        self.encoder = TransformerEncoder(encoder_layer=encoder_layer,
-                                          num_layers=layers,
-                                          norm=encoder_norm)
-
-        self.fc_out = nn.Linear(d_model, decoder_vocab_size)
-
-    def forward(self,
-                batch: Dict[str, torch.Tensor]) -> torch.Tensor:         # shape: [N, T]
-        """
-        Forward pass of the model on a data batch.
-
-        Args:
-         batch (Dict[str, torch.Tensor]): Input batch entry 'text' (text tensor).
-
-        Returns:
-          Tensor: Predictions.
-        """
-
-        x = batch['text']
-        x = x.transpose(0, 1)        # shape: [T, N]
-        src_pad_mask = _make_len_mask(x).to(x.device)
-        x = self.embedding(x)
-        x = self.pos_encoder(x)
-        x = self.encoder(x, src_key_padding_mask=src_pad_mask)
-        x = self.fc_out(x)
-        x = x.transpose(0, 1)
-        return x
-
-    @torch.jit.export
-    def generate(self,
-                 batch: Dict[str, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]:
-        """
-        Inference pass on a batch of tokenized texts.
-
-        Args:
-          batch (Dict[str, torch.Tensor]): Input batch with entry 'text' (text tensor).
-
-        Returns:
-          Tuple: The first element is a Tensor (phoneme tokens) and the second element
-                 is a tensor (phoneme token probabilities).
-        """
-
-        with torch.no_grad():
-            x = self.forward(batch)
-        tokens, logits = get_dedup_tokens(x)
-        return tokens, logits
-
-    @classmethod
-    def from_config(cls, config: dict) -> 'ForwardTransformer':
-        preprocessor = Preprocessor.from_config(config)
-        return ForwardTransformer(
-            encoder_vocab_size=preprocessor.text_tokenizer.vocab_size,
-            decoder_vocab_size=preprocessor.phoneme_tokenizer.vocab_size,
-            d_model=config['model']['d_model'],
-            d_fft=config['model']['d_fft'],
-            layers=config['model']['layers'],
-            dropout=config['model']['dropout'],
-            heads=config['model']['heads']
-        )
+        pass  # pragma: no cover
 
 
 class AutoregressiveTransformer(Model):
@@ -151,42 +66,6 @@ def __init__(self,
                                           dropout=dropout, activation='relu')
         self.fc_out = nn.Linear(d_model, decoder_vocab_size)
 
-    def forward(self, batch: Dict[str, torch.Tensor]):         # shape: [N, T]
-        """
-        Foward pass of the model on a data batch.
-
-        Args:
-          batch (Dict[str, torch.Tensor]): Input batch with entries 'text' (text tensor) and 'phonemes'
-                                           (phoneme tensor for teacher forcing).
-
-        Returns:
-          Tensor: Predictions.
-        """
-
-        src = batch['text']
-        trg = batch['phonemes'][:, :-1]
-
-        src = src.transpose(0, 1)        # shape: [T, N]
-        trg = trg.transpose(0, 1)
-
-        trg_mask = _generate_square_subsequent_mask(len(trg)).to(trg.device)
-
-        src_pad_mask = _make_len_mask(src).to(trg.device)
-        trg_pad_mask = _make_len_mask(trg).to(trg.device)
-
-        src = self.encoder(src)
-        src = self.pos_encoder(src)
-
-        trg = self.decoder(trg)
-        trg = self.pos_decoder(trg)
-
-        output = self.transformer(src, trg, src_mask=None, tgt_mask=trg_mask,
-                                  memory_mask=None, src_key_padding_mask=src_pad_mask,
-                                  tgt_key_padding_mask=trg_pad_mask, memory_key_padding_mask=src_pad_mask)
-        output = self.fc_out(output)
-        output = output.transpose(0, 1)
-        return output
-
     @torch.jit.export
     def generate(self,
                  batch: Dict[str, torch.Tensor],
@@ -278,15 +157,10 @@ def create_model(model_type: ModelType, config: Dict[str, Any]) -> Model:
 
     Returns: Model: Model object.
     """
-
-    if model_type is ModelType.TRANSFORMER:
-        model = ForwardTransformer.from_config(config)
-    elif model_type is ModelType.AUTOREG_TRANSFORMER:
-        model = AutoregressiveTransformer.from_config(config)
-    else:
+    if model_type is not ModelType.AUTOREG_TRANSFORMER:  # pragma: no cover
         raise ValueError(f'Unsupported model type: {model_type}. '
-                         f'Supported types: {[t.value for t in ModelType]}')
-    return model
+                         'Supported type: AUTOREG_TRANSFORMER')
+    return AutoregressiveTransformer.from_config(config)
 
 
 def load_checkpoint(checkpoint_path: str, device: str = 'cpu') -> Tuple[Model, Dict[str, Any]]: