Identify tech words from a text

As part of the pairing project activity at Qxf2 – pick a project that can produce a meaningful output in 5 hours and work on it collaborating with your team mates, I picked this project to identify the tech keywords from a text using NLTK module. This post covers the steps I followed to find the tech related words in a corpus.
Note:This post is not an introduction to NLTK. This post is about my attempt with NLTK to find the tech related words in sentences. If you are looking for an introduction to NLTK, this post – Introduction to Natural Language Processing for Text is a good starting point.
I have tried to outline my work on this project in the following 3 steps,

1.Breaking down sentence into words and filtering useful words
2.Obtaning input data to validate the output
3.Identifying tech words in a text

Note:This post contains constricted snippets from my project. The complete project is present here – identify-tech-words


Breaking down sentence into words and filtering useful words:

The first step in any text processing is breaking it down to sentences and then into words to evaluate them(tokenize words).

import nltk
input_lines = "This sentence will be tokenized"
tokenized = nltk.word_tokenize(input_lines)
print(tokenized)
['This', 'sentence', 'will', 'be', 'tokenized']

The next step is to evaluate the tokenized words. Certain words are included in a sentence to make it complete, while those words are necessary for the sentence to make complete sense, they may be irrelevant to a computer trying to parse the sentence and identify tech related words from it. These words are called stop words in natural language processing.

from nltk.corpus import stopwords
set(stopwords.words('english'))
{'those', "wouldn't", 'by', 'yours', 'these', 'any', 'ours', 'than', 'myself', 'just', "you've", 'haven', 'be', 'doesn', 'too', 'whom', 'nor', 'herself', 'under', 'wouldn', 'as', 'her', 'y', 'because', 'should', 'few', 'is', 'on', 's'n', 'too', 'whom', 'nor', 'herself', , "hadn't", "isn't", 'more', 'won', 'up', 'and', "you'd", 'we', "don't", 'shan', 'how', 'down', 'you', 'them', "you'lup', 'and', "you'd", 'we', "don't", 'l", 'who', 're', 'each', "haven't", 'they', "it's", 'doing', 've', 'when', "that'll", 'yourself', 'i', 'there', 'at', "that'll", 'yourself', 'i', 'there', 'did', "doesn't", 'it', "didn't", 'can', 'very', "shouldn't", 'out', 'so', 'have', 'but', 'if', 'what', 'are', 'then 'then', 'mightn', 'an', 'in', 'o', "', 'mightn', 'an', 'in', 'o', "she's", 'below', 'being', 'hers', 'with', 'does', 'needn', 'other', 'of', "needn't", "ugh', 'between', 'no', 'had', "mustn'aren't", "couldn't", 'd', 'through', 'between', 'no', 'had', "mustn't", 'above', 'about', "mightn't", "should've", 'sin', 'she', 'was', 'own', 'were', 'thhouldn', 'ma', 'my', 'am', 'same', 'before', 'been', "weren't", 'again', 'she', 'was', 'own', 'were', 'the', 'during'', 'its', 'which', 'this', 'from', 'f, 'will', 'for', 'after', 'your', 'to', 'into', 'that', 'until', 'don', "shan't", 'weren', "hasn't", 'why', 'its', 'w'having', 'his', 'both', 'ourselves',hich', 'this', 'from', 'further', 'off', 'over', 'didn', 'wasn', 'm', 'some', 'their', 'do', "wasn't", 'he', 'hasn', ', 'not', 'or', 'only', 'mustn', 'isn'our', 'hadn', "you're", 'having', 'his', 'both', 'ourselves', 'll', 'such', 'most', 'himself', 'ain', 'aren', 'him', 'theirs', 'against', "won't", 'me', 'once', 'a', 'here', 'all', 'not', 'or', 'only', 'mustn', 'isn', 'where', 'itself', 'themselves', 'couldn', 'has', 'while', 'yourselves', 't', 'now'}

These are a few stopwords that NLTK come out of the box. In addition to this list, I added a file – input.conf to hold the stopwords that I considered more relevant to this project. This allowed me to filter my output and pass on the more useful words to the next step of processing.
Now that I had a list of words broken down from the sentence and the useless words removed, I wanted to find the nouns from it. NLTK’s parts-of-speech tagging helped me here. NLTK also allows categorizing of words and tagging them based on the part-of-speech.

import nltk
input_lines = "This sentence will be tokenized"
tokenized = nltk.word_tokenize(input_lines)
print(tokenized)
['This', 'sentence', 'will', 'be', 'tokenized']
is_noun = lambda pos: 'NN' in pos[:2]
pos_tagged = [ word.lower() for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos) ]
print(pos_tagged)
['sentence']

NLTK has correctly identified the only noun in that sentence – sentence in this case(I have noticed a few discrepancies while pos tagging large text).
The tech_extractor.py file in my project reads the input text from the sample_nltk.txt file. I have the contents of this post about cProfile – Saving cProfile stats to a csv file as my input. The tech_extractor.py breaks the input from the file down to sentences and then tokenizes it. The stopwords are then filtered from the tokenized words.


Obtaining input data to validate the output:

With steps to break down the input text and getting the individual nouns, the next step is to verify if a word is related to tech, to achieve this – there needs to some input list with the list of words that are considered tech related. This might involve creating a Python list and adding every tech related word in the universe individually. This will be huge effort and will require more than 5 hours of the stipulated time for the project. The other alternatives I pondered on were

1. Google every noun from the input text and verifying if related to tech
2. Make the machine read a few tech related articles and collect the commonly used nouns

Approach #1 is feasible but the time of execution for the script and resources needed for it are huge. Approach #2 on the other hand may not require as much resource and the execution time will be very less. I chose approach #2.
I created a training_urls.py file to hold the URLs for posts that are related to tech. The following function helped read the contents of the URL and I used Beautifulsoup to parse the contents.

def read_training_urls(url_file):
    "Read the urls from training url file"
    try:
        fp = open(url_file,mode='r',encoding='utf-8')
        lines = fp.readlines()
        fp.close()
        urls = []
        for line in lines:
            if re.search(r'http(s)',line):
                urls.append(line.strip('\n'))
    except Exception as e:
        print(str(e))
        urls = None
    finally:
        return urls
def read_url_contents():
    "Open the urls from the url file and read the contents and return the most frequently used words"
    urls = read_training_urls('training-url.txt')
    relatednouns = []
    for url in urls:
        response = requests.get(url)
        html = response.text 
        soup = BeautifulSoup(html)
        nouns = get_tech_jargon(soup.text)
        relatednouns = relatednouns + nouns
    freq_words = nltk.FreqDist(relatednouns)
 
    return freq_words.most_common()

nltk.FreqDist(relatednouns) helps get the most commonly used words from the content obtained from the URLs.
The tech_extractor.py file collects the mostly repeated nouns from the URLs. The quality of the words that are to be obtained as output is dependent on the URLs that are chosen.


How I get the Tech words from the input:

Now to the last step – verify if there is any tech related word in the input. With common nouns from the URLs in my inventory,I created new parts-of-speech tag called TECH. The common nouns form this new tag.

def create_pos_tag():
    "Create a new TECH pos tag"
    model = {}
    words_list = read_url_contents()
    for words in words_list:
        model.update({words[0]:'TECH'})
    tagger = nltk.tag.UnigramTagger(model=model)
 
    return tagger

Creating this new tag and tagging the tokenized input words using the new tagger allowed me to tag them as TECH like it would tag a NOUN or VERB in a sentence. Running it against my input resulted in the following output

{'output', 'data', 'about', 'binary', 'memory', 'times', 'all', 'w', 'functions', 'use', 'performs', 'save', 'profile', 'file', 'string', 'never', 'call', 
'my', 'way', 'related', 'simple', 'example', 'function', 'one', 'print', 'sorting', 'part', 'line', 'which', 'go', 'pandas', 'a', 'need', 'bit', 'option','details', 'f', '0', 'article', 'result', 'get', 'find', 'libraries', 'getting', 'profiling', 'how', 'ways', 'code', 'column', 'number', 'shell', 'python', 'results', 'using'}

While most of the words here are tech related there are some words that has to be filtered further!. Since the objective of this activity is to work on a project that produces substantial output in 5 hours, I have stopped here. These are the next steps I would like to focus on, if I get to work on it in the future

1. filter the tech that is returned
2. add a default tagger to the newly created tagger
3. add CLI support to read input from a url

References:
1. https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/
2. https://www.nltk.org/book/ch05.html


Leave a Reply

Your email address will not be published. Required fields are marked *