Cornell Movie Dialog Dataset Analysis

This repository contains a comprehensive project analyzing the Cornell Movie Dialog Dataset, focusing on Natural Language Processing (NLP) techniques and machine learning models. The project explores conversational patterns, dialogue generation, metadata predictions, and chatbot development. It is a group project from Politecnico di Milano

Dataset

Source: Cornell Movie Dialog Dataset
Paper: Chameleons in Imagined Conversations
Citation:

Danescu-Niculescu-Mizil, C., & Lee, L. (2011).
Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.
ACL 2011.

Features and Objectives

Main Tasks

Exploratory Data Analysis (EDA):
- Dataset statistics
- Dialogue searches using regular expressions
- Topic modeling and clustering
- Learning word embeddings
Dialogue Generation:
- Encoder-decoder models
- Attention mechanisms
- Pre-trained models (Flan-T5, DialoGPT, fine-tuned GPT-2)
Metadata Predictions:
- Multi-label classification for movie genres
- Rating predictions using linear models and fine-tuned BERT
Chatbot Development:
- Speech-to-text and text-to-speech integration
- GPT-2 fine-tuning for character-specific dialogue generation

Key Objectives

Analyze dialogue characteristics using NLP techniques.
Compare dialogue generation models and fine-tuning methods.
Predict metadata with classification and regression models.
Build a movie character chatbot for interactive responses.

Requirements

Python 3.7+
Libraries: TensorFlow, PyTorch, Hugging Face Transformers, nltk, scikit-learn, etc.
Google Colab for running the notebook (optional).

Installation

Clone this repository:

git clone https://github.com/VolkanMazlum/NLPProject.git
cd repository-name

Install the required dependencies:
```
pip install -r requirements.txt
```

Download the dataset:

curl -L -o cornell_movie_dialogs.zip http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
unzip cornell_movie_dialogs.zip

Usage

Open the notebook:

jupyter notebook NLP_cornell_movie_dataset.ipynb

Follow the notebook sections:
- EDA: Explore dataset characteristics.
- Modeling: Train dialogue generation and metadata prediction models.
- Chatbot: Run the chatbot for interactive conversations.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
NLP_cornell_movie_dataset.ipynb		NLP_cornell_movie_dataset.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cornell Movie Dialog Dataset Analysis

Dataset

Features and Objectives

Main Tasks

Key Objectives

Requirements

Installation

Usage

About

Releases

Packages

Languages

VolkanMazlum/NLPProject

Folders and files

Latest commit

History

Repository files navigation

Cornell Movie Dialog Dataset Analysis

Dataset

Features and Objectives

Main Tasks

Key Objectives

Requirements

Installation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages