使用Python和NLP构建你自己的简历解析器

使用Python和自然语言处理（NLP）一步一步地建立你自己的简历分析器的指南。

让我们先把一件事说清楚。简历是对你的技能和经验在一两页内的简短总结，而简历则更详细，是对申请人能力的较长表述。说到这里，让我们深入研究使用Python和基本的自然语言处理技术构建一个解析器工具。

简历是非结构化数据的一个很好的例子。由于没有广泛接受的简历布局；每份简历都有自己的格式风格，不同的文本块，甚至类别标题也确实有很大的不同。我甚至不需要提及解析多语言简历是多么大的挑战。

关于建立简历解析器的一个误解是认为这是一项容易的任务。"不，它不是"。

我们只谈预测看简历的申请人的名字。

世界上有数以百万计的人名，从Björk Guðmundsdóttir到毛泽东，从Наина Иосифовна到Nguyễn Tấn Dũng。许多文化习惯于使用中间的首字母，如Maria J. Sampson，而有些文化则广泛使用前缀，如Maria Brown女士。试图建立一个人名数据库是一种绝望的努力，因为你永远无法跟上它。

那么，在了解了这个东西的复杂性之后，我们是否应该开始建立我们自己的简历解析器呢？

目录

了解这些工具

将简历转换为纯文本

从PDF文件中提取文本

从docx文件中提取文本

从doc文件中提取文本

从简历中提取字段

从简历中提取姓名

从简历中提取电话号码

从简历中提取电子邮件地址

从简历中提取技能

从简历中提取教育和学校

了解这些工具

我们将使用Python 3，因为它有大量已经可用的库，而且它在数据科学领域被普遍接受。

我们还将使用nltk来完成NLP（自然语言处理）任务，如停止词过滤和标记化，使用docx2txt和pdfminer.6从MS Word和PDF格式中提取文本。

我们假设你的系统中已经有了Python3和pip3，并且可能使用了virtualenv的奇妙功能。我们将不讨论安装这些东西的细节。我们还假设你运行在一个基于Posix的系统上，如Linux（基于Debian）或macOS。

将简历转换为纯文本从PDF文件中提取文本让我们先用pdfminer从PDF文件中提取文本。你可以使用pip3（Python软件包安装程序）工具来安装它，或者从源代码编译它（不推荐）。使用pip，就像在命令提示符下运行以下程序一样简单。

pip install pdfminer.6

使用pdfminer，你可以轻松地从PDF文件中提取文本，使用以下代码。


# example_01.py
 
from pdfminer.high_level import extract_text
 
 
def extract_text_from_pdf(pdf_path):
    返回 extract_text(pdf_path)
 
 
if __name__ == '__main__':
    print(extract_text_from_pdf('./resume.pdf'))  # noqa: T001

很简单，对吗？PDF文件在简历中非常流行，但有些人会喜欢docx和doc格式。让我们接着也从这些格式中提取文本。

从DOCX文件中提取文本为了从docx文件中提取文本，其过程与我们对PDF文件所做的相当相似。让我们使用 pip 安装所需的依赖项（docx2txt），然后编写一些代码来完成实际工作。

pip 安装 docx2txt 而代码如下。


# example_02.py
 
import docx2txt
 
 
def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None
 
 
if __name__ == '__main__':
    print(extract_text_from_docx('./resume.docx'))  # noqa: T001

如此简单。但是当我们试图从老式的doc文件中提取文本时，问题就来了。这些格式不能被docx2txt包正确处理，所以我们要做一个小技巧来从它们中提取文本。请继续。

从文档文件中提取文本为了从doc文件中提取文本，我们将使用Pete Warden提供的整洁但极其强大的catdoc命令行工具。

Catdoc 读取 MS-Word 文件并将可读的 ASCII 文本打印到 stdout，就像 Unix 的 cat 命令。我们将使用apt工具安装它，因为我们运行的是Ubuntu Linux。你应该选择运行你喜欢的软件包安装程序，或者你可以从源代码中构建该工具。

apt-get更新
yes | apt-get install catdoc

准备好后，现在我们可以输入代码，它将实例化一个子进程，将stdout捕获到一个字符串变量中，并像我们对pdf和docx文件格式那样返回。

# example_03.py
 
import subprocess  # noqa: S404
import sys
 
 
def doc_to_text_catdoc(file_path):
    try:
        process = subprocess.Popen(  # noqa: S607,S603
            ['catdoc', '-w', file_path],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            universal_newlines=True,
        )
    except (
        FileNotFoundError,
        ValueError,
        subprocess.TimeoutExpired,
        subprocess.SubprocessError,
    ) as err:
        return (None, str(err))
    else:
        stdout, stderr = process.communicate()
 
    return (stdout.strip(), stderr.strip())
 
 
if __name__ == '__main__':
    text, err = doc_to_text_catdoc('./resume-word.doc')
 
    if err:
        print(err)  # noqa: T001
        sys.exit(2)
 
    print(text)  # noqa: T001

现在我们已经得到了文本格式的简历，我们可以开始从中提取特定字段。

从简历中提取字段从简历中提取姓名

这看起来很容易，但实际上，简历解析中最具挑战性的任务之一是将人的名字从简历中提取出来。世界上有数以百万计的名字，生活在一个全球化的世界里，我们可能会从任何地方得到一份简历。

这就是自然语言处理发挥作用的地方。让我们先安装一个新的库，叫做nltk（自然语言工具包），它在这类任务中相当受欢迎。


pip install nltk
pip install numpy # (nltk也需要，用于运行下面的代码)

现在是时候写一些代码来测试nttk的 "命名实体识别"（NER）功能的能力了。


# example_04.py
 
import docx2txt
import nltk
 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
 
def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None
 
 
def extract_names(txt):
    person_names = []
 
    for sent in nltk.sent_tokenize(txt):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'PERSON':
                person_names.append(
                    ' '.join(chunk_leave[0] for chunk_leave in chunk.leaves())
                )
 
    return person_names
 
 
if __name__ == '__main__':
    text = extract_text_from_docx('resume.docx')
    names = extract_names(text)
 
    if names:
        print(names[0])  # noqa: T001

实际上，nltk的人名检测算法远非正确。运行这段代码，试着看看它是否对你有效。如果不行，你可以尝试使用斯坦福大学的NER模型。详细的教程可在Listendata找到。

从简历中提取电话号码与从简历中提取人名不同，电话号码要容易处理得多。一般来说，在大多数情况下，使用一个简单的重合词就可以了。试试下面的代码，从简历中提取电话号码。你可以根据自己的喜好来修改这个词组。


# example_05.py
 
import re
import subprocess  # noqa: S404
 
PHONE_REG = re.compile(r'[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]')
 
 
def doc_to_text_catdoc(file_path):
    try:
        process = subprocess.Popen(  # noqa: S607,S603
            ['catdoc', '-w', file_path],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            universal_newlines=True,
        )
    except (
        FileNotFoundError,
        ValueError,
        subprocess.TimeoutExpired,
        subprocess.SubprocessError,
    ) as err:
        return (None, str(err))
    else:
        stdout, stderr = process.communicate()
 
    return (stdout.strip(), stderr.strip())
 
 
def extract_phone_number(resume_text):
    phone = re.findall(PHONE_REG, resume_text)
 
    if phone:
        number = ''.join(phone[0])
 
        if resume_text.find(number) &gt;= 0 and len(number) &lt; 16:
            return number
    return None
 
 
if __name__ == '__main__':
    text = doc_to_text_catdoc('resume.pdf')
    phone_number = extract_phone_number(text)
 
    print(phone_number)  # noqa: T001

从简历中提取电子邮件地址

与电话号码提取类似，这也是非常简单的。只要启动一个正则表达式，从简历中提取电子邮件地址。第一个出现在其他上面的一般是申请人的实际电子邮件地址，因为人们倾向于把他们的联系方式放在简历的标题部分。

# example_06.py
 
import re
 
from pdfminer.high_level import extract_text
 
EMAIL_REG = re.compile(r'[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+')
 
 
def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)
 
 
def extract_emails(resume_text):
    return re.findall(EMAIL_REG, resume_text)
 
 
if __name__ == '__main__':
    text = extract_text_from_pdf('resume.pdf')
    emails = extract_emails(text)
 
    if emails:
        print(emails[0])  # noqa: T001

从简历中提取技能到目前为止，你已经做得很好了。这是事情变得更加棘手的部分。从文本中导出技能是一项非常具有挑战性的任务，为了提高准确性，你需要一个数据库或一个API来验证一个文本是否是一个技能。

请看下面的代码。首先，它使用nltk库来过滤掉停止词并生成标记。


# example_07.py
 
import docx2txt
import nltk
 
nltk.download('stopwords')
 
# you may read the database from a csv file or some other database
SKILLS_DB = [
    'machine learning',
    'data science',
    'python',
    'word',
    'excel',
    'English',
]
 
 
def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None
 
 
def extract_skills(input_text):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    word_tokens = nltk.tokenize.word_tokenize(input_text)
 
    # remove the stop words
    filtered_tokens = [w for w in word_tokens if w not in stop_words]
 
    # remove the punctuation
    filtered_tokens = [w for w in word_tokens if w.isalpha()]
 
    # generate bigrams and trigrams (such as artificial intelligence)
    bigrams_trigrams = list(map(' '.join, nltk.everygrams(filtered_tokens, 2, 3)))
 
    # we create a set to keep the results in.
    found_skills = set()
 
    # we search for each token in our skills database
    for token in filtered_tokens:
        if token.lower() in SKILLS_DB:
            found_skills.add(token)
 
    # we search for each bigram and trigram in our skills database
    for ngram in bigrams_trigrams:
        if ngram.lower() in SKILLS_DB:
            found_skills.add(ngram)
 
    return found_skills
 
 
if __name__ == '__main__':
    text = extract_text_from_docx('resume.docx')
    skills = extract_skills(text)
 
    print(skills)  # noqa: T001

要维持一个最新的每个行业的技能数据库并不容易。你不妨看看Skills API，它将为你提供一个简单而实惠的替代方案来维护你自己的技能数据库。Skills API具有70,000多种技能，组织良好且经常更新。请看下面的代码，看看如果使用Skills API，从简历中提取技能将是多么容易。

在继续之前，首先你需要导入一个新的依赖项，叫做requests，也是使用pip工具。


pip install requests
Now below is the source code using the Skills API.

# example_08.py
 
import docx2txt
import nltk
import requests
 
nltk.download('stopwords')
 
 
def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None
 
 
def skill_exists(skill):
    url = f'https://api.apilayer.com/skills?q={skill}&amp;count=1'
    headers = {'apikey': 'YOUR API KEY'}
    response = requests.request('GET', url, headers=headers)
    result = response.json()
 
    if response.status_code == 200:
        return len(result) &gt; 0 and result[0].lower() == skill.lower()
    raise Exception(result.get('message'))
 
 
def extract_skills(input_text):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    word_tokens = nltk.tokenize.word_tokenize(input_text)
 
    # remove the stop words
    filtered_tokens = [w for w in word_tokens if w not in stop_words]
 
    # remove the punctuation
    filtered_tokens = [w for w in word_tokens if w.isalpha()]
 
    # generate bigrams and trigrams (such as artificial intelligence)
    bigrams_trigrams = list(map(' '.join, nltk.everygrams(filtered_tokens, 2, 3)))
 
    # we create a set to keep the results in.
    found_skills = set()
 
    # we search for each token in our skills database
    for token in filtered_tokens:
        if skill_exists(token.lower()):
            found_skills.add(token)
 
    # we search for each bigram and trigram in our skills database
    for ngram in bigrams_trigrams:
        if skill_exists(ngram.lower()):
            found_skills.add(ngram)
 
    return found_skills
 
 
if __name__ == '__main__':
    text = extract_text_from_docx('resume.docx')
    skills = extract_skills(text)
 
    print(skills)  # noqa: T001

从简历中提取教育和学校

如果你已经了解了技能提取的原则，你就会对教育和学校提取这个话题更加得心应手。

毫不奇怪，有很多方法可以做到这一点。

首先，你可以使用一个包含世界各地所有（是吗？）学校名称的数据库。

你可以使用我们的校名数据库，使用Spacy或任何其他NLP框架来训练你自己的NER模型，但我们将遵循一个更简单的方法。这么说吧，在下面的代码中，我们在标记为组织类型的命名实体中寻找诸如 "大学、学院等 "这样的词。信不信由你，它在大多数情况下表现得很好。你可以根据自己的意愿丰富代码中的reserved_words列表。

与人名提取类似，我们首先会过滤掉保留词和标点符号。其次，我们将把所有 "组织键入 "的命名实体存储到一个列表中，并检查它们是否包含保留字。

检查以下代码，因为它不言自明。

# example_09.py
 
import docx2txt
import nltk
 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
 
RESERVED_WORDS = [
    'school',
    'college',
    'univers',
    'academy',
    'faculty',
    'institute',
    'faculdades',
    'Schola',
    'schule',
    'lise',
    'lyceum',
    'lycee',
    'polytechnic',
    'kolej',
    'ünivers',
    'okul',
]
 
 
def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None
 
 
def extract_education(input_text):
    organizations = []
 
    # first get all the organization names using nltk
    for sent in nltk.sent_tokenize(input_text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'ORGANIZATION':
                organizations.append(' '.join(c[0] for c in chunk.leaves()))
 
    # we search for each bigram and trigram for reserved words
    # (college, university etc...)
    education = set()
    for org in organizations:
        for word in RESERVED_WORDS:
            if org.lower().find(word) &gt;= 0:
                education.add(org)
 
    return education
 
 
if __name__ == '__main__':
    text = extract_text_from_docx('resume.docx')
    education_information = extract_education(text)
 
    print(education_information)  # noqa: T001

最后一句话。

简历的解析是很棘手的。有成百上千种方法可以做到这一点。我们只是介绍了一种简单的方法，不幸的是，不要期待奇迹的发生。它可能对某些布局有效，对某些布局则无效。

如果你需要一个专业的解决方案，请看看我们的托管解决方案，称为。Resume Parser API。它得到了很好的维护和支持，它的API供应商也是技能API的维护者。它预先训练了数千种不同的简历布局格式，与其他方案相比，它是市场上最实惠的解决方案。