tweet-process-classify

Our aim is to classify Tweets as either “positive”, “neutral”, or “negative” by using logistic regression classifier and pipelines for pre-processing and model building.

The program has the following parameters -

Path of the input file on a public location such as AWS S3.
Path of the output file on a public location such as AWS S3.

1. Loading

First step is to load the text file from the path specified in argument 1. After that, we remove rows where the text field is null.

2. Pre-Processing

• Stop Word Remover: Remove stop-words from the text column

• Tokenizer: Transform the text column into words by breaking down the sentence into words

• Term Hashing: Convert words to term-frequency vectors

• Label Conversion: The label is a string e.g. “Positive”, which we convert to numeric format.

We create a pipeline of the above steps and then transform the raw input dataset to a pre-processed dataset.

3. Model Creation

We create a logistic regression classification model. We create a ParameterGridBuilder for parameter tuning and then use the CrossValidator object for finding the best model parameters. More details can be seen here: https://spark.apache.org/docs/2.2.0/api/scala/index.html

4. Model Testing & Cross Validation

Next, we train and test our model on the given dataset and output classification evaluation metrics, such as accuracy, etc. We can see details of multi-class evaluation metrics at https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html.

5. Output

Finally, we write the output the classification metrics to a file whose location is specified by the second argument.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
Tweets.csv		Tweets.csv
Twitter.ipynb		Twitter.ipynb
new_user_credentials.csv		new_user_credentials.csv
part-00007-tid-2319546378344727620-64d63cb1-719b-4dc0-90a3-89c9a2d2454a-876-1-c000.csv		part-00007-tid-2319546378344727620-64d63cb1-719b-4dc0-90a3-89c9a2d2454a-876-1-c000.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tweet-process-classify

1. Loading

2. Pre-Processing

3. Model Creation

4. Model Testing & Cross Validation

5. Output

About

Releases

Packages

Languages

SoumyaMukhija/tweet-process-classify

Folders and files

Latest commit

History

Repository files navigation

tweet-process-classify

1. Loading

2. Pre-Processing

3. Model Creation

4. Model Testing & Cross Validation

5. Output

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages