{"id":11241,"date":"2019-08-21T15:23:28","date_gmt":"2019-08-21T19:23:28","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=11241"},"modified":"2019-08-21T15:23:28","modified_gmt":"2019-08-21T19:23:28","slug":"identify-tech-words-from-a-text","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/identify-tech-words-from-a-text\/","title":{"rendered":"Identify tech words from a text"},"content":{"rendered":"<p>As part of the pairing project activity at <a href=\"https:\/\/qxf2.com\/\">Qxf2<\/a> &#8211; pick a project that can produce a meaningful output in 5 hours and work on it collaborating with your team mates, I picked this project to identify the tech keywords from a text using <a href=\"https:\/\/www.nltk.org\/\">NLTK<\/a> module. This post covers the steps I followed to find the tech related words in a corpus.<br \/>\n<strong>Note:<\/strong>This post is not an introduction to NLTK. This post is about my attempt with NLTK to find the tech related words in sentences. If you are looking for an introduction to NLTK, this post &#8211; <a href=\"https:\/\/towardsdatascience.com\/introduction-to-natural-language-processing-for-text-df845750fb63\">Introduction to Natural Language Processing for Text<\/a> is a good starting point.<br \/>\nI have tried to outline my work on this project in the following 3 steps,<\/p>\n<pre>\r\n1.Breaking down sentence into words and filtering useful words\r\n2.Obtaning input data to validate the output\r\n3.Identifying tech words in a text\r\n<\/pre>\n<p><strong>Note:<\/strong>This post contains constricted snippets from my project. The complete project is present here &#8211; <a href=\"https:\/\/github.com\/shivahari\/identify-tech-words\">identify-tech-words<\/a><\/p>\n<hr \/>\n<h3>Breaking down sentence into words and filtering useful words:<\/h3>\n<p>The first step in any text processing is breaking it down to sentences and then into words to evaluate them(tokenize words). <\/p>\n<pre lang=\"python\">import nltk\r\ninput_lines = \"This sentence will be tokenized\"\r\ntokenized = nltk.word_tokenize(input_lines)\r\nprint(tokenized)\r\n['This', 'sentence', 'will', 'be', 'tokenized']\r\n<\/pre>\n<p>The next step is to evaluate the tokenized words. Certain words are included in a sentence to make it complete, while those words are necessary for the sentence to make complete sense, they may be irrelevant to a computer trying to parse the sentence and identify tech related words from it. These words are called stop words in natural language processing.<\/p>\n<pre lang=\"python\">from nltk.corpus import stopwords\r\nset(stopwords.words('english'))\r\n{'those', \"wouldn't\", 'by', 'yours', 'these', 'any', 'ours', 'than', 'myself', 'just', \"you've\", 'haven', 'be', 'doesn', 'too', 'whom', 'nor', 'herself', 'under', 'wouldn', 'as', 'her', 'y', 'because', 'should', 'few', 'is', 'on', 's'n', 'too', 'whom', 'nor', 'herself', , \"hadn't\", \"isn't\", 'more', 'won', 'up', 'and', \"you'd\", 'we', \"don't\", 'shan', 'how', 'down', 'you', 'them', \"you'lup', 'and', \"you'd\", 'we', \"don't\", 'l\", 'who', 're', 'each', \"haven't\", 'they', \"it's\", 'doing', 've', 'when', \"that'll\", 'yourself', 'i', 'there', 'at', \"that'll\", 'yourself', 'i', 'there', 'did', \"doesn't\", 'it', \"didn't\", 'can', 'very', \"shouldn't\", 'out', 'so', 'have', 'but', 'if', 'what', 'are', 'then 'then', 'mightn', 'an', 'in', 'o', \"', 'mightn', 'an', 'in', 'o', \"she's\", 'below', 'being', 'hers', 'with', 'does', 'needn', 'other', 'of', \"needn't\", \"ugh', 'between', 'no', 'had', \"mustn'aren't\", \"couldn't\", 'd', 'through', 'between', 'no', 'had', \"mustn't\", 'above', 'about', \"mightn't\", \"should've\", 'sin', 'she', 'was', 'own', 'were', 'thhouldn', 'ma', 'my', 'am', 'same', 'before', 'been', \"weren't\", 'again', 'she', 'was', 'own', 'were', 'the', 'during'', 'its', 'which', 'this', 'from', 'f, 'will', 'for', 'after', 'your', 'to', 'into', 'that', 'until', 'don', \"shan't\", 'weren', \"hasn't\", 'why', 'its', 'w'having', 'his', 'both', 'ourselves',hich', 'this', 'from', 'further', 'off', 'over', 'didn', 'wasn', 'm', 'some', 'their', 'do', \"wasn't\", 'he', 'hasn', ', 'not', 'or', 'only', 'mustn', 'isn'our', 'hadn', \"you're\", 'having', 'his', 'both', 'ourselves', 'll', 'such', 'most', 'himself', 'ain', 'aren', 'him', 'theirs', 'against', \"won't\", 'me', 'once', 'a', 'here', 'all', 'not', 'or', 'only', 'mustn', 'isn', 'where', 'itself', 'themselves', 'couldn', 'has', 'while', 'yourselves', 't', 'now'}\r\n<\/pre>\n<p>These are a few stopwords that NLTK come out of the box. In addition to this list, I added a file &#8211; <code>input.conf<\/code> to hold the stopwords that I considered more relevant to this project. This allowed me to filter my output and pass on the more useful words to the next step of processing.<br \/>\nNow that I had a list of words broken down from the sentence and the useless words removed, I wanted to find the nouns from it. NLTK&#8217;s parts-of-speech tagging helped me here. NLTK also allows categorizing of words and tagging them based on the part-of-speech.<\/p>\n<pre lang=\"python\">import nltk\r\ninput_lines = \"This sentence will be tokenized\"\r\ntokenized = nltk.word_tokenize(input_lines)\r\nprint(tokenized)\r\n['This', 'sentence', 'will', 'be', 'tokenized']\r\nis_noun = lambda pos: 'NN' in pos[:2]\r\npos_tagged = [ word.lower() for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos) ]\r\nprint(pos_tagged)\r\n['sentence']\r\n<\/pre>\n<p>NLTK has correctly identified the only noun in that sentence &#8211; <code>sentence<\/code> in this case(I have noticed a few discrepancies while pos tagging large text).<br \/>\nThe <code>tech_extractor.py<\/code> file in my project reads the input text from the <code>sample_nltk.txt<\/code> file. I have the contents of this post about cProfile &#8211; <a href=\"https:\/\/qxf2.com\/blog\/saving-cprofile-stats-to-a-csv-file\/\">Saving cProfile stats to a csv file <\/a> as my input. The <code>tech_extractor.py<\/code> breaks the input from the file down to sentences and then tokenizes it. The stopwords are then filtered from the tokenized words.<\/p>\n<hr \/>\n<h3>Obtaining input data to validate the output:<\/h3>\n<p>With steps to break down the input text and getting the individual nouns, the next step is to verify if a word is related to tech, to achieve this &#8211; there needs to some input list with the list of words that are considered tech related. This might involve creating a Python list and adding every tech related word in the universe individually. This will be huge effort and will require more than 5 hours of the stipulated time for the project. The other alternatives I pondered on were<\/p>\n<pre>\r\n1. Google every noun from the input text and verifying if related to tech\r\n2. Make the machine read a few tech related articles and collect the commonly used nouns\r\n<\/pre>\n<p><code>Approach #1<\/code> is feasible but the time of execution for the script and resources needed for it are huge. <code>Approach #2<\/code> on the other hand may not require as much resource and the execution time will be very less. I chose <code>approach #2<\/code>.<br \/>\nI created a <code>training_urls.py<\/code> file to hold the URLs for posts that are related to tech. The following function helped read the contents of the URL and I used <a href=\"https:\/\/pypi.org\/project\/beautifulsoup4\/\">Beautifulsoup<\/a> to parse the contents.<\/p>\n<pre lang=\"python\">def read_training_urls(url_file):\r\n    \"Read the urls from training url file\"\r\n    try:\r\n        fp = open(url_file,mode='r',encoding='utf-8')\r\n        lines = fp.readlines()\r\n        fp.close()\r\n        urls = []\r\n        for line in lines:\r\n            if re.search(r'http(s)',line):\r\n                urls.append(line.strip('\\n'))\r\n    except Exception as e:\r\n        print(str(e))\r\n        urls = None\r\n    finally:\r\n        return urls\r\ndef read_url_contents():\r\n    \"Open the urls from the url file and read the contents and return the most frequently used words\"\r\n    urls = read_training_urls('training-url.txt')\r\n    relatednouns = []\r\n    for url in urls:\r\n        response = requests.get(url)\r\n        html = response.text \r\n        soup = BeautifulSoup(html)\r\n        nouns = get_tech_jargon(soup.text)\r\n        relatednouns = relatednouns + nouns\r\n    freq_words = nltk.FreqDist(relatednouns)\r\n\r\n    return freq_words.most_common()\r\n<\/pre>\n<p><code>nltk.FreqDist(relatednouns)<\/code> helps get the most commonly used words from the content obtained from the URLs.<br \/>\nThe <code>tech_extractor.py<\/code> file collects the mostly repeated nouns from the URLs. The quality of the words that are to be obtained as output is dependent on the URLs that are chosen.<\/p>\n<hr \/>\n<h3>How I get the Tech words from the input:<\/h3>\n<p>Now to the last step &#8211; verify if there is any tech related word in the input. With common nouns from the URLs in my inventory,I created new parts-of-speech tag called <code>TECH<\/code>. The common nouns form this new tag.  <\/p>\n<pre lang=\"python\">def create_pos_tag():\r\n    \"Create a new TECH pos tag\"\r\n    model = {}\r\n    words_list = read_url_contents()\r\n    for words in words_list:\r\n        model.update({words[0]:'TECH'})\r\n    tagger = nltk.tag.UnigramTagger(model=model)\r\n\r\n    return tagger\r\n<\/pre>\n<p>Creating this new tag and tagging the tokenized input words using the new tagger allowed me to tag them as <code>TECH<\/code> like it would tag a <code>NOUN<\/code> or <code>VERB<\/code> in a sentence. Running it against my input resulted in the following output<\/p>\n<pre lang=\"python\">\r\n{'output', 'data', 'about', 'binary', 'memory', 'times', 'all', 'w', 'functions', 'use', 'performs', 'save', 'profile', 'file', 'string', 'never', 'call', \r\n'my', 'way', 'related', 'simple', 'example', 'function', 'one', 'print', 'sorting', 'part', 'line', 'which', 'go', 'pandas', 'a', 'need', 'bit', 'option','details', 'f', '0', 'article', 'result', 'get', 'find', 'libraries', 'getting', 'profiling', 'how', 'ways', 'code', 'column', 'number', 'shell', 'python', 'results', 'using'}\r\n<\/pre>\n<p>While most of the words here are tech related there are some words that has to be filtered further!. Since the objective of this activity is to work on a project that produces substantial output in 5 hours, I have stopped here. These are the next steps I would like to focus on, if I get to work on it in the future<\/p>\n<pre>\r\n1. filter the tech that is returned\r\n2. add a default tagger to the newly created tagger\r\n3. add CLI support to read input from a url\r\n<\/pre>\n<hr \/>\n<p><strong>References:<\/strong><br \/>\n1. https:\/\/pythonprogramming.net\/tokenizing-words-sentences-nltk-tutorial\/<br \/>\n2. https:\/\/www.nltk.org\/book\/ch05.html<\/p>\n<hr \/>\n","protected":false},"excerpt":{"rendered":"<p>As part of the pairing project activity at Qxf2 &#8211; pick a project that can produce a meaningful output in 5 hours and work on it collaborating with your team mates, I picked this project to identify the tech keywords from a text using NLTK module. This post covers the steps I followed to find the tech related words in [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[141,130,155,1],"tags":[],"class_list":["post-11241","post","type-post","status-publish","format-standard","hentry","category-extracting-data","category-machine-learning","category-natural-language-generation","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/11241","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=11241"}],"version-history":[{"count":51,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/11241\/revisions"}],"predecessor-version":[{"id":14836,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/11241\/revisions\/14836"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=11241"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=11241"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=11241"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}