Skip to content

Emrys02/conversation-to-dataset-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

WhatsApp Chat Parser

This script parses exported WhatsApp chat logs, cleans the messages, groups them by date and sender, and stores the processed messages in a JSON file.

Features

  • Parses WhatsApp Chat Logs: Extracts messages from a user-provided WhatsApp chat export file.
  • Filters System Messages: Removes common system notifications (e.g., encryption notices, contact blocked/unblocked, media omitted, missed calls, deleted messages).
  • Merges Broken Lines: Combines multi-line messages into single, coherent entries.
  • Groups Messages: Organizes messages chronologically and groups them under a timestamp key (first message's timestamp). Messages within each group are separated into 'incoming' and 'outgoing' lists based on the specified sender. Messages sent within an hour of each other are grouped together.
  • Stores Processed Data: Saves the final grouped messages into a JSON file.

Usage

The script is run from the command line and requires three arguments:

python scripts/main.py --sender "Your Name" --platform whatsapp --file_path "path/to/your/whatsapp_chat.txt"

Arguments:

  • --sender: (Required) Your name as it appears in the WhatsApp chat log. This is used to differentiate between your messages (outgoing) and the other person's messages (incoming). Enclose in quotes if it contains spaces.
  • --platform: (Required) The messaging platform the chat export is from. Currently, only whatsapp is supported.
  • --file_path: (Required) The relative or absolute path to the exported chat file (usually a .txt file).

Output

The script performs two main actions regarding output:

  1. Cleaned Messages (Intermediate): It saves a cleaned version of the messages (after removing system notifications and merging lines) to exports/cleaned_messages.txt. This file is mostly for debugging or inspection purposes.
  2. Grouped Messages (Final): The primary output is a JSON file named exports/grouped_messages.json. This file is created in an exports directory (which will be created if it doesn't exist) in the current working directory.

JSON Structure:

The grouped_messages.json file contains a single JSON object where:

  • Each key is a string representing a message group, formatted as "{start_timestamp} - {end_timestamp}" (e.g., "10/06/2023, 15:30 - 10/06/2023, 16:15"). Messages within an hour of each other are grouped together.
  • Each value is an object with two keys:
    • "incoming": A list of strings, where each string is a message received from the other participant(s).
    • "outgoing": A list of strings, where each string is a message sent by the user specified with the --sender argument.

Example exports/grouped_messages.json:

{
  "12/10/2023, 10:00 - 12/10/2023, 10:45": {
    "incoming": [
      "Hello there!",
      "How are you doing?"
    ],
    "outgoing": [
      "Hi!",
      "I'm good, thanks for asking.",
      "What about you?"
    ]
  },
  "12/10/2023, 15:30 - 12/10/2023, 15:30": {
    "incoming": [
      "Just checking in."
    ],
    "outgoing": []
  }
}

Prerequisites

  • Python 3.x: The script is written in Python 3 and uses features specific to this version (e.g., f-strings, type hinting). Ensure you have Python 3 installed.
  • No External Libraries: The script relies only on standard Python libraries, so no additional pip install steps are required.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages