GitHub - Zhuofeng-Li/TEG-Benchmark: [NeurIPS 2024 🔥] TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs

Why TEGs instead of TAGs? 🤔

Textual-Edge Graphs (TEGs) incorporate textual content on both nodes and edges, unlike Text-Attributed Graphs (TAGs) featuring textual information only at the nodes. Edge texts are crucial for understanding document meanings and semantic relationships. For instance, as shown below, to understand the knowledge "Planck endorsed the uncertainty and probabilistic nature of quantum mechanics," citation edge (Book D - Paper E) text information is essential. This reveals the comprehensive connections and influences among scholarly works, enabling a deeper analysis of document semantics and knowledge networks.

Overview 📚

Textual-Edge Graphs Datasets and Benchmark (TEG-DB) is a comprehensive and diverse collection of benchmark textual-edge datasets featuring rich textual descriptions on nodes and edges, data loaders, and performance benchmarks for various baseline models, including pre-trained language models (PLMs), graph neural networks (GNNs), and their combinations. This repository aims to facilitate research in the domain of textual-edge graphs by providing standardized data formats and easy-to-use tools for model evaluation and comparison.

Features ✨

Unified Data Representation: All TEG datasets are represented in a unified format. This standardization allows for easy extension of new datasets into our benchmark.
Highly Efficient Pipeline: TEG-Benchmark is highly integrated with PyTorch Geometric (PyG), leveraging its powerful tools and functionalities. Therefore, its code is concise. Specifically, for each paradigm, we provide a small .py file with a summary of all relevant models and a .ssh file to run all baselines in one click.
Comprehensive Benchmark and Analysis: We conduct extensive benchmark experiments and perform a comprehensive analysis of TEG-based methods, delving deep into various aspects such as the impact of different models, the effect of embeddings generated by Pre-trained Language Models (PLMs) of various scales, and the influence of different domain datasets. The statistics of our TEG datasets are as follows:

Datasets 📊

Please click Huggingface TEG-Benchmark to find the TEG datasets we upload!

We have constructed 10 comprehensive and representative TEG datasets (we will continue to expand). These datasets cover domains including Book Recommendation, E-commerce, Academic, and Social networks. They vary in size, ranging from small to large. Each dataset contains rich raw text data on both nodes and edges, providing a diverse range of information for analysis and modeling purposes.

Dataset	Nodes	Edges	Graph Domain	Node Classification	Link Prediction	Details
Goodreads-History	540,796	2,066,193	Book Recommendation	✅	✅	Link
Goodreads-Crime	422,642	1,849,236	Book Recommendation	✅	✅	Link
Goodreads-Children	216,613	734,640	Book Recommendation	✅	✅	Link
Goodreads-Comics	148,658	542,338	Book Recommendation	✅	✅	Link
Amazon-Movie	174,012	1,697,533	E-commerce	✅	✅	Link
Amazon-Apps	100,480	752,937	E-commerce	✅	✅	Link
Amazon-Baby	186,790	1,241,083	E-commerce	✅	✅	Link
Reddit	512,776	256,388	Social Networks	✅	✅	Link
Twitter	60,785	74,666	Social Networks	✅	✅	Link
Citation	169,343	1,166,243	Academic	✅	✅	Link

TEG-DB is an ongoing effort, and we are planning to increase our coverage in the future.

Our Experiments 🔬

Please check the experimental results and analysis from our paper.

Package Usage ⚙️

Requirements 📋

You can quickly install the corresponding dependencies,

pip install -r requirements.txt

More details about packages 📦

The GNN folder is intended to house all GNN methods for link prediction and node classification (Please click here to see more training details). The data_preprocess folder contains various scripts for generating TEG data. For instance, data_preprocess/generate_llm_emb.py generates text embeddings from LLM for the TEG in a high speed, while data_preprocess/dataset.ipynb facilitates uploading and downloading the TEG datasets we provide from Hugging Face.

Datasets setup 🛠️

You can go to the Huggingface TEG-Benchmark to find the datasets we upload! In each dataset folder, you can find a .py (downloads raw data) and a .ipynb file (processes raw data to TEG) in raw folder, .pt files (node and edge text embeddings we extract from the PLM) in emb folder, and a .pkl file processed from raw data (final Textual-edge graph file).

Download all datasets:

from huggingface_hub import snapshot_download

snapshot_download(repo_id="ZhuofengLi/TEG-Datasets", repo_type="dataset", local_dir="./Dataset")

Running Experiments 🚀

GNN for Link Preidction:

bash run_gnn_link.sh

GNN for Node Classification:

bash run_gnn_node.sh

Please click here to see more details.

Create Your Own Dataset and Model 🛠️

You can follow the steps below:

Align your dataset format with ours. Please check the dataset README to find more details.
Generate your raw text embedding through generate_llm_emb.py with high speed.
Add your model to GNN/model/GNN_library.py.
Update training codes' args and gen_model() function.

Star and Cite ⭐

Please star our repo 🌟 and cite our paper if you feel useful. Feel free to email us (zhuofengli12345@gmail.com) if you have any questions.

@misc{li2024tegdb,
      title={TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs}, 
      author={Zhuofeng Li and Zixing Gou and Xiangnan Zhang and Zhongyuan Liu and Sirui Li and Yuntong Hu and Chen Ling and Zheng Zhang and Liang Zhao},
      year={2024},
      eprint={2406.10310},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
GNN		GNN
assets		assets
data_preprocess		data_preprocess
.gitignore		.gitignore
README.md		README.md
license		license
requirements.txt		requirements.txt
run_emb.sh		run_emb.sh
run_gnn_link.sh		run_gnn_link.sh
run_gnn_node.sh		run_gnn_node.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why TEGs instead of TAGs? 🤔

Overview 📚

Features ✨

Datasets 📊

Our Experiments 🔬

Package Usage ⚙️

Requirements 📋

More details about packages 📦

Datasets setup 🛠️

Running Experiments 🚀

Create Your Own Dataset and Model 🛠️

Star and Cite ⭐

About

Releases

Packages

Contributors 4

Languages

License

Zhuofeng-Li/TEG-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Why TEGs instead of TAGs? 🤔

Overview 📚

Features ✨

Datasets 📊

Our Experiments 🔬

Package Usage ⚙️

Requirements 📋

More details about packages 📦

Datasets setup 🛠️

Running Experiments 🚀

Create Your Own Dataset and Model 🛠️

Star and Cite ⭐

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages