The objective is to give tools to prepare your text data without having to install anything. Some text cleaning libraries can't be used on professional computers because they need to download files from servers or from urls that are blocked by internet proxies. With pyTCTK, you just need Python and access to GitHub to clean your text data. So it's a library that you can use on your professional computer, that's the goal : a library usable everywhere.
For the moment, three class with several functions are available:
The TextNet class implements all the general functions to clean up your text (remove punctuation, uppercase, email address, urls, html tags, etc.);
The WordNet class implements all the functions to perform more precise cleaning at the word level of your text (remove stopwords or apply lemming or stemming);
The Tokenize class implements all two functions to tokenize and detokenize the words in your text.
- Python version 3.9.7
- Install requirements.txt
$ pip install -r requirements.txt
- Librairies used
import numpy as np
import os
import pandas as pd
import re
from urllib import request
- requirements
- This folder contains a .txt file with all the packages and versions needed to run the project.
- pyTCTK
- This folder contains a .py file with all class, functions and methods.
- example
- This folder contains an example notebook to better understand how to use the different class and functions, and their outputs.
- ressources
- This folder contains several subfolders in which there are .txt vocabulary files for processing and cleaning the texts.
Here is the project pattern:
- project
> pyTCTK
> requirements
- requirements.txt
> codefile
- pyTCTK.py
> example
- pyTCTK.ipynb
> ressources
>stopwords
- english.txt
- french.txt
>lemme
- english.txt
- french.txt
>stemme
- english.txt
- french.txt
>accents
- accents.txt