Noisy Entropy in Languages

Exploring Communication Complexity: A Study of Language Patterns in Social Media and Programming Languages

Introduction

This master's thesis project, conducted as part of the MSc in Computational Software Techniques in Engineering of Cranfield University, aims to explore the inherent complexities in language that reflect the unpredictability and intricate patterns of human communication.

By leveraging entropy as a quantifiable measure of this complexity, the study captures the pattern and nuances within various forms of interaction, including social media, literature, and programming languages.

Key Metrics

Languages Studied: English (EN), French (FR), Spanish (ES)
Programming Languages: C++, Python, Java
Datasets: 7 Custom Datasets
Data Collection: +300 Millions of Tweets Collected, ~80Gb of Data processed
Entropy Estimators: 3 Families, 7 Plug-In Entropy Estimators
Analysis: 8 NLP Analyses, 2 Methods of Uncertainty Analysis
Literature: 5 Books Examined
Global Events Studied: Covid-19, Ukraine War

Objectives

Explore Entropy in Language: Analyzing major languages on Twitter.
Analyze Linguistic Chaos in Twitter: Examine word usage, accents, emojis, punctuation.
Investigate Literature Noisy Entropy: Assess noisy characteristics in literature books.
Map Noisy Environments: Focus on contexts such as COVID-19 and the Ukraine War.
Examine Human Emotion: Uncover emotional patterns and entropy across languages.
Assess Media Language Styles: Study news outlets' unique entropy patterns.
Contrast Personal Communication Tactics: Compare entropy in public figure tweets.
Analyze Entropy over Time: Investigate entropy changes during significant global events.
Explore Programming Languages Entropy: Study variables, strings, numbers in programming languages.

Entropy Estimators

Plug-In Estimators: Non-parametric approach reflecting complexity at the token level without considering context.
Entropy Rate: Measures uncertainty per symbol/token, capturing long-term behavior.
Prediction by Partial Matching (PPM): Context-based adaptive technique offering a flexible approach.

Some Key Findings

Entropy in Noisy Literature:

Character-Level Insights: Novel estimates of ~4.35 bits, surpassing previous works.
Word-level Insights: Confirms prior work, observes a 3-bit shift with context consideration.

Entropy on Twitter:

Character Complexity: Aligns with literature; chaos resides in word formation.
Unigram Entropy: 10-11 bits, highlighted by hapax legomena (2 bits above past studies).
Contextual Shift: A consistent 3-bit shift reveals logical structure despite noise.

Entropy in Personal Communication:

Public Figures: Contrast in entropy between Musk (technical diversity) and Trump (political discourse).
Political Structure: Repeated patterns in politics converge public opinion; low entropy.
Comparison: Entropy Rate comparison between Donald Trump and Elon Musk.

Cluster Analysis:

Symbols and Structure: Accents, emojis increase entropy; punctuation reduces it, defying traditional beliefs.
Emotional Patterns: Universal low entropy in inherent emotions like love & fear.
Media Bias: News outlet styles reveal unique entropy, hinting at bias detection capabilities.

Entropy over Time:

Global Events: COVID-19, Ukraine War show unigram entropy convergence, highlights societal response, unity.
Language Evolution: Decline in entropy signifies societal adaptation, convergence in viewpoints.
Trend Analysis: Entropy Rate trend over time of the Covid-19 French dataset.

Entropy in Programming Languages:

Universality: Common features in C++, Python, Java.
Predictable Structure: Low Entropy Rate highlights codified patterns and conventions.
Source of Noise: Variable and function naming contribute most to complexity.

Data

The data use for this project is a mix of Social Media data and Computer Language data.

Social Media Data

For the Social Media data, Twitter was used as the main source. The data was collected using scraping, streaming and Twitter API. A total of around 200-300 million tweets were collected, using different methods.

To comply with Twitter's Terms of Service, the data used in this project is only shared dehydrated. Which means that the data is shared in the form of tweet ids, and not the full tweet.

A script to hydrate the data is provided in the dataAcquisition folder, as well as a script to dehydrate the data.

Computer Language Data

For the Computer Language data, CodeNet was used as the main source. CodeNet is a large-scale, high-quality dataset of programmatic source code. It contains 14M code samples from 55 programming languages, and is available for download here.

Due to different License restrictions, the dataset was not created by scratch using GitHub data. Instead, the dataset was downloaded from the official website and used as is.

Data Acquisition

Collect data from Twitter using scraping, streaming and Twitter API.

Learn more about the data collection here.

Preprocessing

Pre-processing of the data includes the following steps:

Cleaning
Data Manipulation
Tokenization

Learn more about the pre-processing here.

NLP

Natural Language Processing (NLP) is used to label the tweets. The labels are then used to filter the data and to perform the analysis on clusters.

The NLP analysis includes:

Emotion Analysis
Hate Speech Detection
Irony Detection
Language Detection
Named Entity Recognition
Offensive Language Detection
Sentiment Analysis
Topic Detection

Learn more about the NLP analysis here.

Entropy Estimation

The entropy estimation used different methods to estimate the entropy of the text. The methods used are:

Plug-in Estimators (unigrams)
- Maximum Likelihood (ML)
- Miller-Maddow (MM)
- Chao-Shen (CS)
- Schurmann-Grassberger (SG)
- Shrinkage (SH)
- Laplace
- Jeffrey
- Minimax
- NSB
Entropy Rate
Prediction by Partial Matching (PPM)

On top of the entropy estimators Bootstrap is used to estimate the confidence interval of the entropy.

Learn more about the entropy estimation here.

Analysis

The Analysis is designed to generated graphs and extract insights from the performed analysis.

There is two types of analysis:

Analysis (Preliminary Analysis and Exploratory Analysis)
Results Analysis (Analysis of the results of the entropy estimation)

Learn more about the analysis here.

Workflows

Here is the general workflow of the project:

Here is the environment of the data acquisition workflow:

Deliverables

The master's thesis is available here.

Find the poster here.

Individual Research Project 2022/2023

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
dataDehydrated		dataDehydrated
deliverables		deliverables
docs		docs
results		results
src		src
.gitignore		.gitignore
README.md		README.md
R_Setup.txt		R_Setup.txt
chromedriver_setup.sh		chromedriver_setup.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Noisy Entropy in Languages

Exploring Communication Complexity: A Study of Language Patterns in Social Media and Programming Languages

Introduction

Overview

Key Metrics

Objectives

Entropy Estimators

Some Key Findings

Entropy in Noisy Literature:

Entropy on Twitter:

Entropy in Personal Communication:

Cluster Analysis:

Entropy over Time:

Entropy in Programming Languages:

Data

Social Media Data

Computer Language Data

Data Acquisition

Preprocessing

NLP

Entropy Estimation

Analysis

Workflows

Deliverables

About

Releases

Packages

Languages

sferez/Noisy_Entropy_in_Languages

Folders and files

Latest commit

History

Repository files navigation

Noisy Entropy in Languages

Exploring Communication Complexity: A Study of Language Patterns in Social Media and Programming Languages

Introduction

Overview

Key Metrics

Objectives

Entropy Estimators

Some Key Findings

Entropy in Noisy Literature:

Entropy on Twitter:

Entropy in Personal Communication:

Cluster Analysis:

Entropy over Time:

Entropy in Programming Languages:

Data

Social Media Data

Computer Language Data

Data Acquisition

Preprocessing

NLP

Entropy Estimation

Analysis

Workflows

Deliverables

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages