{"id":19938,"date":"2023-09-25T08:39:01","date_gmt":"2023-09-25T12:39:01","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=19938"},"modified":"2023-09-25T08:39:01","modified_gmt":"2023-09-25T12:39:01","slug":"build-semantic-search-faiss","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/build-semantic-search-faiss\/","title":{"rendered":"Build a semantic search tool using FAISS"},"content":{"rendered":"<p>This post provides an overview of implementing semantic search. Why? Because too often, we notice testers skip testing more complex features like autocomplete. This might be ok in most applications. But in domain specific applications, testing autocomplete capabilities of the product is important. Since testers can benefit from understanding implementation details, in this post, we will look at how autocomplete is usually implemented. We will follow up in a later post on how to write some interesting tests that build upon the insights in this post.<\/p>\n<h3>An example to play along:<\/h3>\n<p>For this post, let us imagine a text box. You start to type the name of a technology and related technologies pop up. <\/p>\n<p>Aside: we needed to write a similar tool recently and this post borrows heavily from our work on the internal tool. We recognized the need to address a data duplication issue within our internal survey app at <a href='https:\/\/qxf2.com\/?utm_source=faiss&#038;utm_medium=click&#038;utm_campaign=From%20blog'>Qxf2 Services<\/a>. Every week, our team completes a survey, listing the technologies used during that week. This process aids us in tracking and evaluating our efforts to remain current with evolving developments in the tech industry. You can read more about it in our blog post &#8216;<a href='https:\/\/qxf2.com\/blog\/qxf2-techs-used-2021\/'>Qxf2 Tech Used in 2021<\/a>&#8216;. However, with several years&#8217; worth of data collected, we noticed a challenge: the same technologies were represented differently, causing redundancy. For instance, SQS was sometimes recorded as just &#8216;SQS,&#8217; other times as &#8216;AWS-SQS,&#8217; or &#8216;Amazon Simple Queue Service.&#8217; This inconsistency was causing database bloat. To tackle this issue, we sought a solution that would enable us to suggest existing words when someone filled in the survey.<\/p>\n<hr>\n<h3>About semantic search:<\/h3>\n<p>Remember autocomplete capabilities from many years ago? You start typing a word and get suggestions on ways to complete the word. This sort of autocomplete relies on the spelling of the word. It has a fancy name &#8211; lexical search. Semantic search is a methodology that enhances precision beyond lexical search by seeking contextually related terms. For instance, when seeking &#8216;dog,&#8217; the search might also yield &#8216;German shepherd&#8217; due to their correlation. This is achieved by transforming words like &#8216;dog&#8217; and &#8216;German shepherd&#8217; into vectors and positioning them in proximity within a vector space. During a search, the query is similarly transformed into a vector and placed within the same vector space and the closest vectors are identified and retrieved, yielding relevant results.<\/p>\n<hr>\n<h3>Tools used to implement semantic search:<\/h3>\n<p>During my recent exploration of <a href='https:\/\/qxf2.com\/blog\/context-based-question-answering-using-llm\/'>context-based question answering using LLM<\/a>, I came across <a href='https:\/\/engineering.fb.com\/2017\/03\/29\/data-infrastructure\/faiss-a-library-for-efficient-similarity-search\/'>FAISS<\/a>. We realized that this library could assist us in resolving the data duplication problem. In the initial phase of addressing this issue, I developed a semantic search tool using the FAISS library, leveraging a Stack Overflow dataset. I built my application by referencing the example provided in <a href=\"https:\/\/deepnote.com\/blog\/semantic-search-using-faiss-and-mpnet\">Tutorial: semantic search using Faiss &#038; MPNet<\/a>. I am sharing this post with the hope of aiding fellow engineers in their own related tasks.<\/p>\n<p>1. <a href=\"https:\/\/huggingface.co\/sentence-transformers\/all-mpnet-base-v2\">sentence-transformers\/all-mpnet-base-v2<\/a>, to create vector embeddings<br \/>\n2. <a href=\"https:\/\/engineering.fb.com\/2017\/03\/29\/data-infrastructure\/faiss-a-library-for-efficient-similarity-search\/\">FAISS<\/a>,to index the vectors, it also provides APIs to search and retrieve relevant vectors<\/p>\n<hr>\n<h3>Environment setup:<\/h3>\n<p>The setup process involves the following steps:<\/p>\n<pre lang=\"python\">\r\n# Install torch using the following command\r\npip install torch==2.0.1\r\npip install torchvision==0.15.2\r\n\r\n# Install Transformers module using the following command\r\npip install transformers==4.30.2\r\n\r\n# Install FAISS library using the following command\r\npip install faiss-cpu==1.7.4\r\n\r\n# Install beautifulsoup using the following command\r\npip install bs4==0.0.1\r\n<\/pre>\n<hr>\n<h3>Creating a semantic search tool with FAISS:<\/h3>\n<p>I have created objects for the following purposes:<br \/>\n1. Read the rows from stackoverflow XML data file<br \/>\n2. Convert words to vectors<br \/>\n3. Create a FAISS index<br \/>\n4. Create a pickle<\/p>\n<h5>Read the rows from XML file:<\/h5>\n<p>I have used this <a href=\"https:\/\/gist.github.com\/shivahari\/ebd304a317672923f1e46e5675aee04b\">Stack Overflow dataset<\/a> to create the tool.<\/p>\n<pre lang=\"python\">\r\nclass XMLReader:\r\n    \"An XML object to read the values from an XML file\"\r\n    @staticmethod\r\n    def read_from_file(xml_file, html_property):\r\n        \"Read from XML file\"\r\n        html_property = html_property.lower()\r\n        with open(xml_file, 'r') as xmlfile:\r\n            xml = xmlfile.readlines()\r\n        xml = \"\".join(xml)\r\n\r\n        soup = BeautifulSoup(xml, \"html.parser\")\r\n        rows = soup.find_all('row')\r\n        return [ row[html_property] for row in rows]\r\n<\/pre>\n<p>The <strong>XMLReader<\/strong> object will read the XML data file and return all the rows as a list.<\/p>\n<h5>Convert words to vectors:<\/h5>\n<pre lang=\"python\">\r\nclass SemanticEmbedding:\r\n    \"A semantic embedding object to get the word embeddings\"\r\n    def __init__(self, model_name='sentence-transformers\/all-mpnet-base-v2'):\r\n        \"object initialization\"\r\n        self.tokenizer = AutoTokenizer.from_pretrained(model_name)\r\n        self.model = AutoModel.from_pretrained(model_name)\r\n\r\n    def mean_pooling(self, model_output, attention_mask):\r\n        \"\"\"\r\n        Mean Pooling - Take attention mask into account for correct averaging\r\n        Although this is very useful to create a vector for a sentence,\r\n        it is useful in our case, where we use a word alone\r\n        \"\"\"\r\n        #First element of model_output contains all token embeddings\r\n        token_embeddings = model_output[0]\r\n        input_mask_exp = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()\r\n        return torch.sum(token_embeddings*input_mask_exp,1)\/torch.clamp(input_mask_exp.sum(1),\r\n                                                                                  min=1e-9)\r\n\r\n    def get_embedding(self, word):\r\n        \"create word embeddings\"\r\n        encoded_input = self.tokenizer(word, padding=True, truncation=True, return_tensors='pt')\r\n        with torch.no_grad():\r\n            model_output = self.model(**encoded_input)\r\n        # Perform pooling\r\n        word_embedding = self.mean_pooling(model_output, encoded_input['attention_mask'])\r\n\r\n        # Normalize embeddings\r\n        word_embedding = torch.nn.functional.normalize(word_embedding, p=2, dim=1)\r\n        return word_embedding.detach().numpy()\r\n\r\n<\/pre>\n<p>The <strong>SemanticEmbedding<\/strong> object provides methods to create embeddings for words, the <em>get_embeddings<\/em> takes text as an input and returns a tensor of size <em>[1, 768]<\/em><\/p>\n<h5>Create a FAISS index<\/h5>\n<pre lang=\"python\">\r\nclass FaissIdxObject:\r\n    \"A FAISS object to create,add-doc, search and save an index\"\r\n    def __init__(self, dim=768):\r\n        \"object initialize\"\r\n        self.dim = dim\r\n        self.ctr = 0\r\n\r\n    def create_index(self):\r\n        \"Create a new index\"\r\n        return faiss.IndexFlatIP(self.dim)\r\n\r\n    @staticmethod\r\n    def get_index(index_name):\r\n        \"Get the index\"\r\n        try:\r\n            return faiss.read_index(index_name)\r\n        except FileNotFoundError as err:\r\n            raise f\"Unable to find {index_name}, does the file exist? from {err}\"\r\n\r\n    @staticmethod\r\n    def add_doc_to_index(index, embedded_document_text):\r\n        \"Add doc to index\"\r\n        index.add(embedded_document_text)\r\n\r\n    @staticmethod\r\n    def search_index(embedded_query, index, doc_map, k=5, return_scores=False):\r\n        \"Search through the index\"\r\n        D, I = index.search(embedded_query, k)\r\n        if return_scores:\r\n            value = [{doc_map[idx]: str(score)} for idx, score in zip(I[0], D[0]) if idx in doc_map]\r\n        else:\r\n            value = [doc_map[idx] for idx, score in zip(I[0], D[0]) if idx in doc_map]\r\n        return value\r\n\r\n    @staticmethod\r\n    def save_index(index, index_name):\r\n        \"Save the index and dataset pickle file to local\"\r\n        try:\r\n            faiss.write_index(index, index_name)\r\n        except Exception as err:\r\n            raise err\r\n<\/pre>\n<p>The <strong>FaissIdxObject<\/strong> object provides methods to create an index and search a vector and return related vectors. The <em>search_index<\/em> method returns the distance to the nearest neighbours <em>D<\/em> and their index <em>I<\/em>.<br \/>\nFor my application, I opted for <em>IndexFlatIP<\/em> index, This choice was driven by its utilization of the inner product as the distance metric, which, for normalized embeddings, equates to cosine similarity<\/p>\n<h5>Create a pickle<\/h5>\n<pre lang=\"python\">\r\nclass PickleObject:\r\n    \"A pickle object to save and read the humanreadable dataset\"\r\n    def create_dict(self):\r\n        \"Create a new dict\"\r\n        return {}\r\n\r\n    @staticmethod\r\n    def get_pickle(pickle_name):\r\n        \"Get the local pickle file\"\r\n        try:\r\n            with open(pickle_name, 'rb') as pickled_file:\r\n                return pickle.load(pickled_file)\r\n        except FileNotFoundError as err:\r\n            raise f\"Unable to find {pickle_name}, does the file exist? from {err}\"\r\n\r\n    @staticmethod\r\n    def add_doc_to_pickle(pickle_dict, counter, doc):\r\n        \"Add entry to the pickle\"\r\n        pickle_dict[counter] = doc\r\n\r\n    @staticmethod\r\n    def save_pickle(pickle_file, pickle_name):\r\n        \"Save the pickle file to local\"\r\n        try:\r\n            with open(pickle_name, 'wb') as pf:\r\n                pickle.dump(pickle_file, pf, protocol=pickle.HIGHEST_PROTOCOL)\r\n        except Exception as err:\r\n            raise err\r\n<\/pre>\n<p>The <strong>PickleObject<\/strong> provides methods to save and retrieve human readable data locally. This object saves the index as a key in Python dictionary against the actual text data &#038; helps in retrieving it based on the index.<\/p>\n<h5>Putting it all together:<\/h5>\n<pre lang=\"python\">\r\n\"\"\"\r\nA semantic search tool\r\n\"\"\"\r\nimport pickle\r\nfrom pathlib import Path\r\nimport faiss\r\nimport torch\r\nfrom bs4 import BeautifulSoup\r\nfrom transformers import AutoTokenizer, AutoModel\r\n\r\nclass SemanticEmbedding:\r\n    \"A semantic embedding object to get the word embeddings\"\r\n    def __init__(self, model_name='sentence-transformers\/all-mpnet-base-v2'):\r\n        \"object initialization\"\r\n        self.tokenizer = AutoTokenizer.from_pretrained(model_name)\r\n        self.model = AutoModel.from_pretrained(model_name)\r\n\r\n    def mean_pooling(self, model_output, attention_mask):\r\n        \"\"\"\r\n        Mean Pooling - Take attention mask into account for correct averaging\r\n        Although this is very useful to create a vector for a sentence,\r\n        it is useful in our case, where we use a word alone\r\n        \"\"\"\r\n        #First element of model_output contains all token embeddings\r\n        token_embeddings = model_output[0]\r\n        input_mask_exp = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()\r\n        return torch.sum(token_embeddings*input_mask_exp,1)\/torch.clamp(input_mask_exp.sum(1),\r\n                                                                                  min=1e-9)\r\n\r\n    def get_embedding(self, word):\r\n        \"create word embeddings\"\r\n        encoded_input = self.tokenizer(word, padding=True, truncation=True, return_tensors='pt')\r\n        with torch.no_grad():\r\n            model_output = self.model(**encoded_input)\r\n        # Perform pooling\r\n        word_embedding = self.mean_pooling(model_output, encoded_input['attention_mask'])\r\n\r\n        # Normalize embeddings\r\n        word_embedding = torch.nn.functional.normalize(word_embedding, p=2, dim=1)\r\n        return word_embedding.detach().numpy()\r\n\r\nclass FaissIdxObject:\r\n    \"A FAISS object to create,add-doc, search and save an index\"\r\n    def __init__(self, dim=768):\r\n        \"object initialize\"\r\n        self.dim = dim\r\n        self.ctr = 0\r\n\r\n    def create_index(self):\r\n        \"Create a new index\"\r\n        return faiss.IndexFlatIP(self.dim)\r\n\r\n    @staticmethod\r\n    def get_index(index_name):\r\n        \"Get the index\"\r\n        try:\r\n            return faiss.read_index(index_name)\r\n        except FileNotFoundError as err:\r\n            raise f\"Unable to find {index_name}, does the file exist? from {err}\"\r\n\r\n    @staticmethod\r\n    def add_doc_to_index(index, embedded_document_text):\r\n        \"Add doc to index\"\r\n        index.add(embedded_document_text)\r\n\r\n    @staticmethod\r\n    def search_index(embedded_query, index, doc_map, k=5, return_scores=False):\r\n        \"Search through the index\"\r\n        D, I = index.search(embedded_query, k)\r\n        if return_scores:\r\n            value = [{doc_map[idx]: str(score)} for idx, score in zip(I[0], D[0]) if idx in doc_map]\r\n        else:\r\n            value = [doc_map[idx] for idx, score in zip(I[0], D[0]) if idx in doc_map]\r\n        return value\r\n\r\n    @staticmethod\r\n    def save_index(index, index_name):\r\n        \"Save the index and dataset pickle file to local\"\r\n        try:\r\n            faiss.write_index(index, index_name)\r\n        except Exception as err:\r\n            raise err\r\n\r\nclass PickleObject:\r\n    \"A pickle object to save and read the humanreadable dataset\"\r\n    def create_dict(self):\r\n        \"Create a new dict\"\r\n        return {}\r\n\r\n    @staticmethod\r\n    def get_pickle(pickle_name):\r\n        \"Get the local pickle file\"\r\n        try:\r\n            with open(pickle_name, 'rb') as pickled_file:\r\n                return pickle.load(pickled_file)\r\n        except FileNotFoundError as err:\r\n            raise f\"Unable to find {pickle_name}, does the file exist? from {err}\"\r\n\r\n    @staticmethod\r\n    def add_doc_to_pickle(pickle_dict, counter, doc):\r\n        \"Add entry to the pickle\"\r\n        pickle_dict[counter] = doc\r\n\r\n    @staticmethod\r\n    def save_pickle(pickle_file, pickle_name):\r\n        \"Save the pickle file to local\"\r\n        try:\r\n            with open(pickle_name, 'wb') as pf:\r\n                pickle.dump(pickle_file, pf, protocol=pickle.HIGHEST_PROTOCOL)\r\n        except Exception as err:\r\n            raise err\r\n\r\nclass XMLReader:\r\n    \"An XML object to read the values from an XML file\"\r\n    @staticmethod\r\n    def read_from_file(xml_file, html_property):\r\n        \"Read from XML file\"\r\n        html_property = html_property.lower()\r\n        with open(xml_file, 'r') as xmlfile:\r\n            xml = xmlfile.readlines()\r\n        xml = \"\".join(xml)\r\n\r\n        soup = BeautifulSoup(xml, \"html.parser\")\r\n        rows = soup.find_all('row')\r\n        return [ row[html_property] for row in rows]\r\n\r\nif __name__ == '__main__':\r\n    embedder = SemanticEmbedding()\r\n\r\n    if not Path('Tags.index').is_file() or not Path('Tags.pickle').is_file():\r\n        faiss_obj = FaissIdxObject()\r\n        pickle_obj = PickleObject()\r\n        xml_reader = XMLReader()\r\n        faiss_index = faiss_obj.create_index()\r\n        doc_dict = pickle_obj.create_dict()\r\n        input_rows = xml_reader.read_from_file(xml_file='Tags.xml',\r\n                                               html_property='TagName')\r\n        COUNTER = 0\r\n        for row in input_rows:\r\n            embedded_content = embedder.get_embedding(row)\r\n            faiss_obj.add_doc_to_index(index=faiss_index,\r\n                                       embedded_document_text=embedded_content)\r\n            pickle_obj.add_doc_to_pickle(pickle_dict=doc_dict,\r\n                                         counter=COUNTER,\r\n                                         doc=row)\r\n            COUNTER += 1\r\n\r\n        faiss_obj.save_index(index=faiss_index, index_name='Tags.index')\r\n        pickle_obj.save_pickle(pickle_file=doc_dict,pickle_name='Tags.pickle')\r\n\r\n    else:\r\n        faiss_index = FaissIdxObject.get_index(index_name='Tags.index')\r\n        doc_dict = PickleObject.get_pickle(pickle_name='Tags.pickle')\r\n\r\n    while True:\r\n        tech = input(\"\\nEnter a tech: \")\r\n        if tech == \"exit\":\r\n            break\r\n        if tech.strip() == \"\":\r\n            continue\r\n        embedded_input = embedder.get_embedding(tech)\r\n        output = FaissIdxObject.search_index(embedded_query=embedded_input,\r\n                             index=faiss_index,\r\n                             doc_map=doc_dict,\r\n                             k=10,\r\n                             return_scores=True)\r\n        print(output)\r\n<\/pre>\n<p>You can find the snippet <a href=\"https:\/\/gist.github.com\/shivahari\/bb944467a9ca46d653041279464cf2c0\" rel=\"noopener\" target=\"_blank\">here<\/a><br \/>\n<strong>Note:<\/strong> The script is very rudimentary and yet to go through our stringent code review process.<\/p>\n<h5>Output<\/h5>\n<pre lang=\"python\">\r\nEnter a tech: python\r\n[{'python': '1.0'}, {'java': '0.62871754'}, {'coding': '0.62170094'}, {'c#': '0.5832066'}, {'unix': '0.575704'}, {'python-3.x': '0.57134223'}, {'programming-languages': '0.5683065'}, {'languages': '0.55639744'}, {'c++': '0.5557853'}, {'javascript': '0.5530219'}]\r\n<\/pre>\n<p>From the output, it is clear that the tool not just returned <em>python<\/em> but also a few other entries it considered similar to it.<\/p>\n<hr>\n<h3>What next?<\/h3>\n<p>Now we have a mental model for what happens under the hood of an autocomplete feature, it leads to the next obvious question &#8211; how can we effectively test this feature? Our team engaged in a discussion regarding the testing strategies for this application. In the next post, we will delve into some testing techniques.<\/p>\n<hr>\n<h3>Hire technical testers from Qxf2<\/h3>\n<p>You will not come across many technical testers that can understand the technical aspects of semantic search, implement FAISS and then apply this knowledge to test products better. But Qxf2 is stacked with such QA engineers. We enjoy the technical aspects of testing. Our approach goes well beyond traditional test automation. If you are working in a highly technical domain and would like good testers on your team &#8211; <a href='https:\/\/qxf2.com\/contact?utm_source=faiss&#038;utm_medium=click&#038;utm_campaign=From%20blog'>reach out<\/a>!<\/p>\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>This post provides an overview of implementing semantic search. Why? Because too often, we notice testers skip testing more complex features like autocomplete. This might be ok in most applications. But in domain specific applications, testing autocomplete capabilities of the product is important. Since testers can benefit from understanding implementation details, in this post, we will look at how autocomplete [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[365,368,130],"tags":[],"class_list":["post-19938","post","type-post","status-publish","format-standard","hentry","category-hugging-face","category-llm","category-machine-learning"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/19938","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=19938"}],"version-history":[{"count":23,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/19938\/revisions"}],"predecessor-version":[{"id":20229,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/19938\/revisions\/20229"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=19938"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=19938"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=19938"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}