Skip to content

Welcome to the Persian Last Names Dataset, a comprehensive collection of over 100,000 Persian surnames accompanied by their respective frequencies. This dataset is curated from a substantial real-world sample of more than 10 million records, ensuring reliable and representative data for various applications.

License

Notifications You must be signed in to change notification settings

farbodbj/iranian-surname-frequencies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Persian Last Names Dataset

Overview

Welcome to the Persian Last Names Dataset, a comprehensive collection of over 100,000 Persian surnames accompanied by their respective frequencies. This dataset is curated from a substantial real-world sample of more than 10 million records, ensuring reliable and representative data for various applications.

Dataset Details

  • Total Surnames: 100,000+
  • Frequency Source: Derived from a dataset comprising 10 million entries
  • Normalization: All surnames have been standardized using the Hazm library to ensure consistency and accuracy

Features

  • Extensive Coverage: Includes a wide range of both common and rare Persian last names
  • Frequency Information: Each surname is paired with its occurrence frequency, facilitating detailed analysis and research
  • Open Source: Freely available for academic, research, and commercial use

Data Processing

The surnames in this dataset have been meticulously normalized using the Hazm library, a powerful toolkit for processing Persian language text. This normalization process ensures that all names are uniformly formatted, which enhances the reliability of frequency measurements and makes the dataset easier to integrate into various projects.

Usage

This dataset is ideal for a variety of applications, including but not limited to:

  • Natural Language Processing (NLP): Enhancing name recognition, entity extraction, and other language models
  • Sociological Research: Analyzing surname distributions, trends, and demographic studies within Persian-speaking populations
  • Data Validation: Verifying and cross-referencing user inputs in systems that require accurate name information

Contributing

Contributions are welcome! If you have additional data, corrections, or enhancements, please feel free to submit a pull request or open an issue.


We hope this dataset serves as a valuable resource for your projects and research endeavors. For any questions or feedback, please reach out through the repository's issue tracker.

About

Welcome to the Persian Last Names Dataset, a comprehensive collection of over 100,000 Persian surnames accompanied by their respective frequencies. This dataset is curated from a substantial real-world sample of more than 10 million records, ensuring reliable and representative data for various applications.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published