Skip to content

一个对日常遇到的数据处问题与解决问题的思路、解决方案情况的记录。

Notifications You must be signed in to change notification settings

yangfanconan/DataProcessTools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataProcessTools

一个对日常遇到的数据处问题与解决问题的思路、解决方案情况的记录。

问题1:给定字符串如何判断该字符串为英文姓名。

解决方案:

前提已经安装了一些基础的java/Python环境。

详见代码:is_english_name.py

步骤1: 需要安装 Python nltk 库 , pip install nltk 详见:https://www.nltk.org/install.html 步骤2: 下载 NERTag,也可以从git上直接下载 详见:https://nlp.stanford.edu/software/tagger.html

步骤3: 将stanford-ner.jar 设置到JAVA所需要的系统变量CLASSPATH中。

步骤4: 执行代码提供的测试用例,做你想做的任何事情。

```

斯坦福命名实体识别(Stanford NER)是一个用Java实现的命名实体识别器。命名实体识别(NER)是在文本中标记出表示事物的词语序列的过程,如人名、公司名、基因名、蛋白质名等。

斯坦福NER的特点:

功能强大的特征提取器:它配备了为命名实体识别精心设计的特征提取器,并提供了许多定义特征提取器的选项。

多语言支持:除了为英语提供了优秀的命名实体识别器(特别是针对PERSON、ORGANIZATION、LOCATION这三类)外,还提供了针对不同语言和情境的其他模型,包括仅使用CoNLL 2003英语训练数据训练的模型。

通用性:斯坦福NER也被称为CRFClassifier。该软件提供了线性链条件随机场(CRF)序列模型的通用实现。通过在有标签的数据上训练自己的模型,你可以使用此代码为NER或其他任何任务构建序列模型。

开源与许可:斯坦福NER是开源的,并遵循GNU通用公共许可证(v2或更高版本)。它包含源代码,并提供命令行调用、作为服务器运行和Java API等多种组件。

贡献者:

原始的CRF代码由Jenny Finkel编写。

特征提取器由Dan Klein、Christopher Manning和Jenny Finkel共同开发。

大部分文档和可用性改进归功于Anna Rafferty。

最近的代码开发由斯坦福NLP组的多个成员完成。

总之,斯坦福NER是一个功能强大、多语言支持且开源的命名实体识别工具,广泛应用于自然语言处理领域。

```

English Name Validator (code to see is_english_name.py) Introduction The provided Python code defines a function, is_english_names, which aims to determine whether a given text string represents an English name. The function incorporates several steps to process and analyze the input string, ultimately returning a boolean value (True or False) indicating whether the text is recognized as an English name.

Functionality Input Processing: The function first removes any spaces from the input string using the replace method. Empty Input Check: It then checks if the input is empty, None, or an empty list ([]). If any of these conditions are met, the function returns False. However, note that checking for an empty list is unnecessary since the input is a string and cannot be a list. Handling Names with Slashes: If the input string contains a forward slash (/), the function splits the string into two parts, assuming that the slash separates two names. If no slash is present, the entire string is considered as a single name. Normalization: The function converts the names to a standardized format by capitalizing the first letter and converting the rest to lowercase using the lower and capitalize methods. Concatenation: If the input was split into two names due to the presence of a slash, the function concatenates them back together without the slash. Name Analysis: The function then attempts to use a natural language processing tool (presumably nltk or a similar library, although the code references an undefined st module) to perform part-of-speech tagging on the name. However, this approach is flawed because names typically do not have a specific part-of-speech tag like "PERSON" in standard POS tagging schemes. Additionally, the code as written would not work because st is not defined or imported anywhere in the provided code snippet. Decision Making: Finally, the function checks if the first word of the name (which should be the entire name in most cases) is tagged as "PERSON". If it is, the function returns True; otherwise, it returns False. However, as mentioned earlier, this approach is fundamentally flawed because names do not typically have a POS tag of "PERSON".

About

一个对日常遇到的数据处问题与解决问题的思路、解决方案情况的记录。

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages