Writing Your Own Resume Parser
18 Dec 2018 14 mins read python nlpWhy to write your own Resume Parser
Resumes are a great example of unstructured data. Each resume has its unique style of formatting, has its own data blocks, and has many forms of data formatting. This makes reading resumes hard, programmatically. Recruiters spend ample amount of time going through the resumes and selecting the ones that are a good fit for their jobs. Tech giants like Google and Facebook receive thousands of resumes each day for various job positions and recruiters cannot go through each and every resume. This is why Resume Parsers are a great deal for people like them. Resume Parsers make it easy to select the perfect resume from the bunch of resumes received.
We will be learning how to write our own simple resume parser in this blog. For the extent of this blog post we will be extracting Names, Phone numbers, Email IDs, Education and Skills from resumes.
Index
- First Step - Reading the Resume
- Second Step: Extracting Names
- Third Step: Extracting Phone Numbers
- Forth Step: Extracting Emails
- Fifth Step: Extracting Skills
- Sixth Step: Extracting Education
First Step: Reading the Resume
Resumes do not have a fixed file format, and hence they can be in any file format such as .pdf
or .doc
or .docx
. So our main challenge is to read the resume and convert it to plain text. For this we can use two Python modules: pdfminer and doc2text. These modules help extract text from .pdf
and .doc
, .docx
file formats.
Installing pdfminer
:
pip install pdfminer # python 2
pip install pdfminer.six # python 3
Installing doc2text
:
pip install doc2text
Extracting text from PDF:
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def extract_text_from_pdf(pdf_path):
with open(pdf_path, 'rb') as fh:
# iterate over all pages of PDF document
for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
# creating a resoure manager
resource_manager = PDFResourceManager()
# create a file handle
fake_file_handle = io.StringIO()
# creating a text converter object
converter = TextConverter(
resource_manager,
fake_file_handle,
codec='utf-8',
laparams=LAParams()
)
# creating a page interpreter
page_interpreter = PDFPageInterpreter(
resource_manager,
converter
)
# process current page
page_interpreter.process_page(page)
# extract text
text = fake_file_handle.getvalue()
yield text
# close open handles
converter.close()
fake_file_handle.close()
# calling above function and extracting text
for page in extract_text_from_pdf(file_path):
text += ' ' + page
Extracting text from doc and docx:
import docx2txt
def extract_text_from_doc(doc_path):
temp = docx2txt.process("resumes/Chinmaya_Kaundanya_Resume.docx")
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
return ' '.join(text)
Second Step: Extracting Name
For extracting names from resumes, we can make use of regular expressions. But we will use a more sophisticated tool called spaCy. Spacy is a Industrial-Strength Natural Language Processing module used for text and language processing. It comes with pre-trained models for tagging, parsing and entity recognition. Our main moto here is to use Entity Recognition for extracting names (after all name is entity!). No doubt, spaCy has become my favorite tool for language processing these days. So lets get started by installing spacy.
Installing spaCy
:
pip install spacy
Now, we want to download pre-trained models from spacy. For this we need to execute:
python -m spacy download en_core_web_sm
Rule Based Matching:
spaCy
gives us the ability to process text or language based on Rule Based Matching. We will be using this feature of spaCy to extract first name and last name from our resumes.
import spacy
from spacy.matcher import Matcher
# load pre-trained model
nlp = spacy.load('en_core_web_sm')
# initialize matcher with a vocab
matcher = Matcher(nlp.vocab)
def extract_name(resume_text):
nlp_text = nlp(resume_text)
# First name and Last name are always Proper Nouns
pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
matcher.add('NAME', None, *pattern)
matches = matcher(nlp_text)
for match_id, start, end in matches:
span = nlp_text[start:end]
return span.text
As you can observe above, we have first defined a pattern that we want to search in our text. Here, we have created a simple pattern based on the fact that First Name and Last Name of a person is always a Proper Noun. Hence we have specified spacy that searches for a pattern such that two continuous words whose part of speech tag is equal to PROPN
(Proper Noun).
Third Step: Extracting Phone Numbers
For extracting phone numbers, we will be making use of regular expressions. Phone numbers also have multiple forms such as (+91) 1234567890
or +911234567890
or +91 123 456 7890
or +91 1234567890
. Hence, we need to define a generic regular expression that can match all similar combinations of phone numbers. Thanks to this blog, I was able to extract phone numbers from resume text by making slight tweaks.
Our phone number extraction function will be as follows:
import re
def extract_mobile_number(text):
phone = re.findall(re.compile(r'(?:(?:\+?([1-9]|[0-9][0-9]|[0-9][0-9][0-9])\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([0-9][1-9]|[0-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?'), text)
if phone:
number = ''.join(phone[0])
if len(number) > 10:
return '+' + number
else:
return number
For more explaination about the above regular expressions, visit this website.
Forth Step: Extracting Email
For extracting Email IDs from resume, we can use a similar approach that we used for extracting mobile numbers. Email IDs have a fixed form i.e. an alphanumeric string
should follow a @
symbol, again followed by a string
, followed by a .
(dot) and a string
at the end. We can use regular expression to extract such expression from text.
import re
def extract_email(email):
email = re.findall("([^@|\s]+@[^@]+\.[^@|\s]+)", email)
if email:
try:
return email[0].split()[0].strip(';')
except IndexError:
return None
Fifth Step: Extracting Skills
Now that we have extracted some basic information about the person, lets extract the thing that matters the most from a recruiter point of view, i.e. skills. We can extract skills using a technique called tokenization. Tokenization simply is breaking down of text into paragraphs, paragraphs into sentences, sentences into words. Hence, there are two major techniques of tokenization: Sentence Tokenization
and Word Tokenization
.
Before implementing tokenization, we will have to create a dataset against which we can compare the skills in a particular resume. For this we will make a comma separated values file (.csv
) with desired skillsets. For example, if I am the recruiter and I am looking for a candidate with skills including NLP, ML, AI then I can make a csv file with contents:
machine learning,ml,artificial intelligence,ai,natural language processing,nlp
Assuming we gave the above file, a name as skills.csv
, we can move further to tokenize our extracted text and compare the skills against the ones in skills.csv
file. For reading csv file, we will be using the pandas
module. After reading the file, we will removing all the stop words from our resume text. In short, a stop word is a word which does not change the meaning of the sentence even if it is removed.
Installing pandas
:
pip install pandas
Word Tokenization and Extraction:
import pandas as pd
import spacy
# load pre-trained model
nlp = spacy.load('en_core_web_sm')
noun_chunks = nlp.noun_chunks
def extract_skills(resume_text):
nlp_text = nlp(resume_text)
# removing stop words and implementing word tokenization
tokens = [token.text for token in nlp_text if not token.is_stop]
# reading the csv file
data = pd.read_csv("skills.csv")
# extract values
skills = list(data.columns.values)
skillset = []
# check for one-grams (example: python)
for token in tokens:
if token.lower() in skills:
skillset.append(token)
# check for bi-grams and tri-grams (example: machine learning)
for token in noun_chunks:
token = token.text.lower().strip()
if token in skills:
skillset.append(token)
return [i.capitalize() for i in set([i.lower() for i in skillset])]
Sixth Step: Extracting Education:
Now, moving towards the last step of our resume parser, we will be extracting the candidates education details. The details that we will be specifically extracting are the degree and the year of passing. For example, XYZ has completed MS in 2018, then we will be extracting a tuple like ('MS', '2018')
. For this we will be requiring to discard all the stop words. We will be using nltk
module to load an entire list of stopwords and later on discard those from our resume text.
Installing nltk
:
pip install nltk
python -m nltk nltk.download('words')
Recruiters are very specific about the minimum education/degree required for a particular job. Hence, we will be preparing a list EDUCATION
that will specify all the equivalent degrees that are as per requirements.
import re
import spacy
from nltk.corpus import stopwords
# load pre-trained model
nlp = spacy.load('en_core_web_sm')
# Grad all general stop words
STOPWORDS = set(stopwords.words('english'))
# Education Degrees
EDUCATION = [
'BE','B.E.', 'B.E', 'BS', 'B.S',
'ME', 'M.E', 'M.E.', 'MS', 'M.S',
'BTECH', 'B.TECH', 'M.TECH', 'MTECH',
'SSC', 'HSC', 'CBSE', 'ICSE', 'X', 'XII'
]
def extract_education(resume_text):
nlp_text = nlp(resume_text)
# Sentence Tokenizer
nlp_text = [sent.string.strip() for sent in nlp_text.sents]
edu = {}
# Extract education degree
for index, text in enumerate(nlp_text):
for tex in text.split():
# Replace all special symbols
tex = re.sub(r'[?|$|.|!|,]', r'', tex)
if tex.upper() in EDUCATION and tex not in STOPWORDS:
edu[tex] = text + nlp_text[index + 1]
# Extract year
education = []
for key in edu.keys():
year = re.search(re.compile(r'(((20|19)(\d{2})))'), edu[key])
if year:
education.append((key, ''.join(year[0])))
else:
education.append(key)
return education
This is how we can implement our own resume parser. It’s fun, isn’t it? You can play with words, sentences and of course grammar too! On integrating above steps together we can extract the entities and get our final result as:
{
'education': [('BE', '2014')],
'email': '[email protected]',
'mobile_number': '8087996634',
'name': 'Omkar Pathak',
'skills': [
'Flask',
'Django',
'Mysql',
'C',
'Css',
'Html',
'Js',
'Machine learning',
'C++',
'Algorithms',
'Github',
'Php',
'Python',
'Opencv'
]
}
Entire code can be found on github. Feel free to open any issues you are facing. You can contribute too! Have an idea to help make code even better? Open a Pull Request :)