Skip to content

baixianghuang/authorship-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Can Large Language Models Identify Authorship?

  • Overview: This repo contains the code, results and data used in EMNLP 2024 Findings paper titled "Can Large Language Models Identify Authorship?"

  • TLDR: We discover LLMs' strong capacities of performing authorship verification and attribution in a zero-shot way, which have surpassed the state-of-the-art supervised models, and providing explanations through the role of linguistic features.

  • [arXiv] [Project Website]

This work focuses on exploring the capabilities of Large Language Models (LLMs) in authorship analysis tasks, specifically authorship verification and authorship attribution. The primary aim is to investigate whether LLMs can accurately identify the authorship of texts, which is pivotal for verifying content authenticity and mitigating misinformation.\

A Comparison Between Linguistically Informed Prompting (LIP) and other Prompting Strategies for Authorship Verification. "Analysis" and "Answer" are the output of prompting GPT-4. Only LIP strategy correctly identifies that the given two texts belong to the same author. Text colored in orange highlights the differences compared to vanilla prompting with no guidance. Text colored in Blue indicates the linguistically informed reasoning process. Blue text represents the text referenced from the original documents.

BibTex

@artile{huang2024authorship,
    title   = {Can Large Language Models Identify Authorship?}, 
    author  = {Baixiang Huang and Canyu Chen and Kai Shu},
    year    = {2024},
    journal = {arXiv preprint},
    volume  = {abs/2403.08213},
    url     = {https://arxiv.org/abs/2403.08213}, 
}

Methodology

Traditional authorship analysis methods rely on hand-crafted writing style features and classifiers, while state-of-the-art approaches utilize text embeddings from pre-trained language models, often requiring domain-specific fine-tuning. Our approach evaluates LLMs' performance in authorship analysis without the need for fine-tuning, and explores the integration of explicit linguistic features to enhance reasoning capabilities.

Data Preprocessing

For this study, texts and authors were filtered to remove duplicates and authors contributing fewer than two texts. Non-English texts were excluded using the py3langid tool, available at py3langid GitHub.

Datasets

The datasets used in this research are publicly available on Kaggle:

Code

The code accompanying this research is structured to facilitate the replication of our study and further exploration of LLMs in authorship analysis tasks. It includes scripts for data preprocessing and evaluation.

About

Can Large Language Models Identify Authorship? EMNLP 2024 Findings https://arxiv.org/abs/2403.08213

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published