一个对日常遇到的数据处问题与解决问题的思路、解决方案情况的记录。
问题1:给定字符串如何判断该字符串为英文姓名。
解决方案:
前提已经安装了一些基础的java/Python环境。
详见代码:is_english_name.py
步骤1: 需要安装 Python nltk 库 , pip install nltk 详见:https://www.nltk.org/install.html 步骤2: 下载 NERTag,也可以从git上直接下载 详见:https://nlp.stanford.edu/software/tagger.html
步骤3: 将stanford-ner.jar 设置到JAVA所需要的系统变量CLASSPATH中。
步骤4: 执行代码提供的测试用例,做你想做的任何事情。
多语言支持:除了为英语提供了优秀的命名实体识别器(特别是针对PERSON、ORGANIZATION、LOCATION这三类)外,还提供了针对不同语言和情境的其他模型,包括仅使用CoNLL 2003英语训练数据训练的模型。
通用性:斯坦福NER也被称为CRFClassifier。该软件提供了线性链条件随机场(CRF)序列模型的通用实现。通过在有标签的数据上训练自己的模型,你可以使用此代码为NER或其他任何任务构建序列模型。
English Name Validator (code to see is_english_name.py) Introduction The provided Python code defines a function, is_english_names, which aims to determine whether a given text string represents an English name. The function incorporates several steps to process and analyze the input string, ultimately returning a boolean value (True or False) indicating whether the text is recognized as an English name.
Functionality Input Processing: The function first removes any spaces from the input string using the replace method. Empty Input Check: It then checks if the input is empty, None, or an empty list ([]). If any of these conditions are met, the function returns False. However, note that checking for an empty list is unnecessary since the input is a string and cannot be a list. Handling Names with Slashes: If the input string contains a forward slash (/), the function splits the string into two parts, assuming that the slash separates two names. If no slash is present, the entire string is considered as a single name. Normalization: The function converts the names to a standardized format by capitalizing the first letter and converting the rest to lowercase using the lower and capitalize methods. Concatenation: If the input was split into two names due to the presence of a slash, the function concatenates them back together without the slash. Name Analysis: The function then attempts to use a natural language processing tool (presumably nltk or a similar library, although the code references an undefined st module) to perform part-of-speech tagging on the name. However, this approach is flawed because names typically do not have a specific part-of-speech tag like "PERSON" in standard POS tagging schemes. Additionally, the code as written would not work because st is not defined or imported anywhere in the provided code snippet. Decision Making: Finally, the function checks if the first word of the name (which should be the entire name in most cases) is tagged as "PERSON". If it is, the function returns True; otherwise, it returns False. However, as mentioned earlier, this approach is fundamentally flawed because names do not typically have a POS tag of "PERSON".