Auto-generate XPaths using Python

In this post, we will present a way to auto-generate robust and short XPaths for the two most common HTML elements automation interacts with – buttons and input elements. We have tested this against more than 50 commonly used websites like Facebook, LinkedIn, Citibank, IRCTC, etc.


Why this post?

The foundation for robust GUI automated checks is writing good element locators. Xpath is one locator strategy used for selecting nodes from Document Object Models (DOM) like XML, HTML, etc. Generating XPaths manually is a routine and time-consuming task.

As part of simplifying our test writing process, we came up with a thought to write a utility script which will identify and auto-generate robust and simple XPaths. We started off with generating XPaths for Input and Button fields of a webpage using the general locators like id, name, class etc. The Python program in this post will demonstrate how to create XPath expressions automatically for input and button tags. Whenever there is a requirement to automate a webpage, the tester can simply run this script which would generate a bunch of XPaths without any human intervention.


What should be considered when writing an XPath?

Most of us rely on extracting the XPaths from browser plugins or tools. These tools have got limitations. One of the limitations is that they frequently produce absolute XPaths (meaning all the way from the HTML tag!) which are long and extremely flaky. Other smarter tools seem to have trouble producing unique XPaths in anything but the simplest conditions. Generally, we locate the elements using their unique attributes but some elements do not have unique attributes. Locating such elements is difficult because the XPath generated will have multiple matching elements. It is important to consider the following points while choosing an XPath. A good locator is:

  • Unique – XPath should have only one candidate element (Unique).
  • Descriptive – It is easy to identify the element easily when the XPath is descriptive.
  • Shorter in length – You will have multiple XPath options. A shorter XPath shall be selected to make it more readable.
  • Valid even after changes to a page – XPath should be selected in such a way it is valid even after changes in DOM.

Overview of our Python Utility script

We made an effort to write a script which will auto-generate XPaths for Input and Button elements in the webpage and also check for the uniqueness of the generated XPath. This will save the automation time and effort. In the coming sections, we will be talking about below items:

  • Accept a URL and parse the page content using BeautifulSoup
  • For each element, check for the existence of the attribute and guess the XPath
  • Check for uniqueness of the generated XPath
  • How we tested this utility
  • Putting it all together

Note:- If the XPath generated is not unique or if the HTML page does not have the attribute mentioned for the given tag then our script does not generate any XPaths.


Accept a URL and Parse the page content using BeautifulSoup

First, we need to import all the libraries that we are going to use. We used python module BeautifulSoup to extract and parse the HTML content.

from selenium import webdriver
from bs4 import BeautifulSoup

Our main method starts with declaring a variable which accepts the input URL of the page.

    #Get the URL and parse
    url = input("Enter URL: ")

The next step is to parse the page into a BeautifulSoup format. We used selenium’s execute_script() function to get the inner HTML of a page and return it as a string to the Python script. This method takes as a parameter a string of Javascript code that it executes inside of the browser.

    #Parsing a page with BeautifulSoup
    page = driver.execute_script("return document.body.innerHTML").encode('utf-8').decode('latin-1') #returns the inner HTML as a string
    soup = BeautifulSoup(page, 'html.parser')

For each element, check for existence of the attribute and guess the XPath

Now we have a variable, soup, containing the HTML of the page. Here’s where we can start coding the part that extracts the data. BeautifulSoup can help us get into these layers and extract the content with find_all() method. Using this method we are going to fetch all the Input and Button tags from the HTML page. We are passing the ‘soup’ as an argument for generate_xpath method.

    #execute generate_xpath
    if xpath_obj.generate_xpath(soup) is False:
        print ("No XPaths generated for the URL:%s"%url)

If the webpage doesn’t have any inputs and buttons it throws a print message saying that there are no tags to generate the Xpaths for this URL.

Now let us look into the generate_xpath() logic. We first initialized few list variables for element lists and attribute lists. guessable_elements lists consists of element lists and known_attribute_list consists of attribute lists. We are looping over element lists (as of now we are doing this only for input and button elements). Next for each element we will loop over the attribute lists and check for the attribute existence and then guess the XPath. Please note that we have declared the attribute lists based on the order of importance of attribute occurrence. For eg:- If ‘id‘ is not available for that element, next we are checking for ‘name‘ attribute and so on. Also variable names method below to get variable names.

 def generate_xpath(self,soup):
        "generate the xpath and assign the variable names"
        result_flag = False
        for guessable_element in self.guessable_elements:
            self.elements = soup.find_all(guessable_element)
            for element in self.elements:
                if (not element.has_attr("type")) or (element.has_attr("type") and element['type'] != "hidden"):
                    for attr in self.known_attribute_list:
                        if element.has_attr(attr):
                            locator = self.guess_xpath(guessable_element,attr,element)
                            if len(driver.find_elements(By.XPATH,locator))==1:
                                result_flag = True
                                variable_name = self.get_variable_names(element)
                                # checking for the unique variable names
                                if  variable_name != '' and variable_name not in self.variable_names:
                                    self.variable_names.append(variable_name)
                                    print ("%s_%s = %s"%(guessable_element, variable_name.encode('utf-8').decode('latin-1'), locator.encode('utf-8').decode('latin-1')))
                                    break
                                else:
                                    print (locator.encode('utf-8').decode('latin-1') + "----> Couldn't generate appropriate variable name for this xpath")
                        elif guessable_element == 'button' and element.getText():
                            button_text = element.getText()
                            if element.getText() == button_text.strip():
                                locator = xpath_obj.guess_xpath_button(guessable_element,"text()",element.getText())
                            else:
                                locator = xpath_obj.guess_xpath_using_contains(guessable_element,"text()",button_text.strip())
                            if len(driver.find_elements(By.XPATH, locator))==1:
                                result_flag = True
                                #Check for utf-8 characters in the button_text
                                matches = re.search(r"[^\x00-\x7F]",button_text)
                                if button_text.lower() not in self.button_text_lists:
                                    self.button_text_lists.append(button_text.lower())
                                    if not matches:
                                        # Striping and replacing characters before printing the variable name
                                        print ("%s_%s = %s"%(guessable_element,button_text.strip().strip("!?.").encode('utf-8').decode('latin-1').lower().replace(" + ","_").replace(" & ","_").replace(" ","_"), locator.encode('utf-8').decode('latin-1')))
                                    else:
                                        # printing the variable name with utf-8 characters along with language counter
                                        print ("%s_%s_%s = %s"%(guessable_element,"foreign_language",self.language_counter, locator.encode('utf-8').decode('latin-1')) + "---> Foreign language found, please change the variable name appropriately")
                                        self.language_counter +=1
                                else:
                                    # if the variable name is already taken
                                    print (locator.encode('utf-8').decode('latin-1') + "----> Couldn't generate appropriate variable name for this xpath")
                                break
 
                        elif not guessable_element in self.guessable_elements:
                            print("We are not supporting this gussable element")
 
        return result_flag

Add get variables name method

 def get_variable_names(self,element):
        "generate the variable names for the xpath"
        # condition to check the length of the 'id' attribute and ignore if there are numerics in the 'id' attribute. Also ingnoring id values having "input" and "button" strings.
        if (element.has_attr('id') and len(element['id'])>2) and bool(re.search(r'\d', element['id'])) == False and ("input" not in element['id'].lower() and "button" not in element['id'].lower()):
            self.variable_name = element['id'].strip("_")
        # condition to check if the 'value' attribute exists and not having date and time values in it.
        elif element.has_attr('value') and element['value'] != '' and bool(re.search(r'([\d]{1,}([/-]|\s|[.])?)+(\D+)?([/-]|\s|[.])?[[\d]{1,}',element['value']))== False and bool(re.search(r'\d{1,2}[:]\d{1,2}\s+((am|AM|pm|PM)?)',element['value']))==False:
            # condition to check if the 'type' attribute exists
            # getting the text() value if the 'type' attribute value is in 'radio','submit','checkbox','search'
            # if the text() is not '', getting the getText() value else getting the 'value' attribute
            # for the rest of the type attributes printing the 'type'+'value' attribute values. Doing a check to see if 'value' and 'type' attributes values are matching.
            if (element.has_attr('type')) and (element['type'] in ('radio','submit','checkbox','search')):
                if element.getText() !='':
                    self.variable_name = element['type']+ "_" + element.getText().strip().strip("_.")
                else:
                    self.variable_name = element['type']+ "_" + element['value'].strip("_.")
            else:
                if element['type'].lower() == element['value'].lower():
                    self.variable_name = element['value'].strip("_.")
                else:
                    self.variable_name = element['type']+ "_" + element['value'].strip("_.")
        # condition to check if the "name" attribute exists and if the length of "name" attribute is more than 2 printing variable name
        elif element.has_attr('name') and len(element['name'])>2:
            self.variable_name = element['name'].strip("_")
        # condition to check if the "placeholder" attribute exists and is not having any numerics in it.
        elif element.has_attr('placeholder') and bool(re.search(r'\d', element['placeholder'])) == False:
            self.variable_name = element['placeholder']
        # condition to check if the "type" attribute exists and not in text','radio','button','checkbox','search'
        # and printing the variable name
        elif (element.has_attr('type')) and (element['type'] not in ('text','button','radio','checkbox','search')):
            self.variable_name = element['type']
        # condition to check if the "title" attribute exists
        elif element.has_attr('title'):
            self.variable_name = element['title']
        # condition to check if the "role" attribute exists
        elif element.has_attr('role') and element['role']!="button":
            self.variable_name = element['role']
        else:
            self.variable_name = ''
 
        return self.variable_name.lower().replace("+/- ","").replace("| ","").replace(" / ","_").  \
        replace("/","_").replace(" - ","_").replace(" ","_").replace("&","").replace("-","_").      \
        replace("[","_").replace("]","").replace(",","").replace("__","_").replace(".com","").strip("_")

Check for uniqueness of the generated Xpath

XPath should have only one candidate element which means the XPath should output one and only one element. To make sure that the given XPath returns a single element we are checking the length of the elements for each locator in the generate_xpath() method and if it is greater than 1 checking for another attribute. In this way we are making sure that the XPath is unique.

if len(driver.find_elements(By.XPATH,locator))==1:
    result_flag = True                                   
    print (locator.encode('utf-8').decode('latin-1'))
    break

In the above generate_xpath() code, we are calling three different methods guess_xpath(), guess_xpath_button() and guess_xpath_using_contains(). We shall discuss in detail about these methods in below sections.

guess_xpath()
The method guess_xpath(), accepts three arguments namely tag, attr, element, and XPath is guessed based on these arguments. To handle Unicode errors due to foreign Unicode characters we used encode() and join() functions.

def guess_xpath(self,tag,attr,element):
        "Guess the xpath"
        #Class attribute returned as a unicodeded list, so removing 'u from the list and joining
        if type(element[attr]) is list:
            element[attr] = [i.encode('utf-8').decode('latin-1') for i in element[attr]]
            element[attr] = ' '.join(element[attr])
        self.xpath = "//%s[@%s='%s']"%(tag,attr,element[attr])
 
        return  self.xpath

guess_xpath_button()
We are using method guess_xpath_button() to check button.getText() condition. There will be situations, where you may not able to use any HTML property rather than text present in the element. text() function helps us to find the element based on the text present in the element. Since text() is a method, it does not need ‘@’ symbol as in case of an attribute.

def guess_xpath_button(self,tag,attr,element):
        "Guess the xpath for buttons"
        self.button_xpath = "//%s[%s='%s']"%(tag,attr,element)
 
        return  self.button_xpath

guess_xpath_using_contains()
In some cases, we may have to use ‘contains‘ function which helps the user to find the element with partial values, or dynamically changing values, ‘contains‘ verifies matches with the portion of the value for text for which we don’t need the complete text but need only part of the text. In our case, while we are testing our code with few URL’s we encountered few button tags with text having leading and trailing spaces in the text. In such cases, the XPath guessed using above guess_xpath() may not work. Hence we wrote another method called guess_xpath_using_contains() which uses ‘contains‘ function and generates XPath as shown below.

def guess_xpath_using_contains(self,tag,attr,element):
        "Guess the xpath using contains keyword"
        self.button_contains_xpath = "//%s[contains(%s,'%s')]"%(tag,attr,element)
 
        return self.button_contains_xpath

How we tested this utility

We tested this Utility with almost 50 different pages which have multiple input and button fields. We also tested with pages which don’t have any text or button fields. We came across few issues which we figured and fixed.

One of the issues we encountered is that when we ran the script in Windows, the input() command ran properly but when we ran it in Git Bash, we noticed there is a time gap in running input() command. The key to the problem is windows console prints text to the screen as soon as possible, while mingw (git bash) will wait until the application tells it to update the screen. To fix this, you can use -u flag interpreter command while running the script, which will stop python from buffering output in git bash as shown below.

   python -u xpath_util.py
Screenshot of result of xpath_util.py

Putting it all together

"""
Qxf2 Services: Utility script to generate XPaths for the given URL
* Take the input URL from the user
* Parse the HTML content using beautifilsoup
* Find all Input and Button tags
* Guess the XPaths
* Generate Variable names for the xpaths
* To run the script in Gitbash use command 'python -u utils/xpath_util.py'
 
"""
 
import re
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
 
class Xpath_Util:
    "Class to generate the xpaths"
 
    def __init__(self):
        "Initialize the required variables"
        self.elements = None
        self.guessable_elements = ['input','button']
        self.known_attribute_list = ['id','name','placeholder','value','title','type','class']
        self.variable_names = []
        self.button_text_lists = []
        self.language_counter = 1
 
    def generate_xpath(self,soup):
        "generate the xpath and assign the variable names"
        result_flag = False
        for guessable_element in self.guessable_elements:
            self.elements = soup.find_all(guessable_element)
            for element in self.elements:
                if (not element.has_attr("type")) or (element.has_attr("type") and element['type'] != "hidden"):
                    for attr in self.known_attribute_list:
                        if element.has_attr(attr):
                            locator = self.guess_xpath(guessable_element,attr,element)
                            if len(driver.find_elements(By.XPATH,locator))==1:
                                result_flag = True
                                variable_name = self.get_variable_names(element)
                                # checking for the unique variable names
                                if  variable_name != '' and variable_name not in self.variable_names:
                                    self.variable_names.append(variable_name)
                                    print ("%s_%s = %s"%(guessable_element, variable_name.encode('utf-8').decode('latin-1'), locator.encode('utf-8').decode('latin-1')))
                                    break
                                else:
                                    print (locator.encode('utf-8').decode('latin-1') + "----> Couldn't generate appropriate variable name for this xpath")
                        elif guessable_element == 'button' and element.getText():
                            button_text = element.getText()
                            if element.getText() == button_text.strip():
                                locator = xpath_obj.guess_xpath_button(guessable_element,"text()",element.getText())
                            else:
                                locator = xpath_obj.guess_xpath_using_contains(guessable_element,"text()",button_text.strip())
                            if len(driver.find_elements(By.XPATH, locator))==1:
                                result_flag = True
                                #Check for utf-8 characters in the button_text
                                matches = re.search(r"[^\x00-\x7F]",button_text)
                                if button_text.lower() not in self.button_text_lists:
                                    self.button_text_lists.append(button_text.lower())
                                    if not matches:
                                        # Striping and replacing characters before printing the variable name
                                        print ("%s_%s = %s"%(guessable_element,button_text.strip().strip("!?.").encode('utf-8').decode('latin-1').lower().replace(" + ","_").replace(" & ","_").replace(" ","_"), locator.encode('utf-8').decode('latin-1')))
                                    else:
                                        # printing the variable name with utf-8 characters along with language counter
                                        print ("%s_%s_%s = %s"%(guessable_element,"foreign_language",self.language_counter, locator.encode('utf-8').decode('latin-1')) + "---> Foreign language found, please change the variable name appropriately")
                                        self.language_counter +=1
                                else:
                                    # if the variable name is already taken
                                    print (locator.encode('utf-8').decode('latin-1') + "----> Couldn't generate appropriate variable name for this xpath")
                                break
 
                        elif not guessable_element in self.guessable_elements:
                            print("We are not supporting this gussable element")
 
        return result_flag
 
    def get_variable_names(self,element):
        "generate the variable names for the xpath"
        # condition to check the length of the 'id' attribute and ignore if there are numerics in the 'id' attribute. Also ingnoring id values having "input" and "button" strings.
        if (element.has_attr('id') and len(element['id'])>2) and bool(re.search(r'\d', element['id'])) == False and ("input" not in element['id'].lower() and "button" not in element['id'].lower()):
            self.variable_name = element['id'].strip("_")
        # condition to check if the 'value' attribute exists and not having date and time values in it.
        elif element.has_attr('value') and element['value'] != '' and bool(re.search(r'([\d]{1,}([/-]|\s|[.])?)+(\D+)?([/-]|\s|[.])?[[\d]{1,}',element['value']))== False and bool(re.search(r'\d{1,2}[:]\d{1,2}\s+((am|AM|pm|PM)?)',element['value']))==False:
            # condition to check if the 'type' attribute exists
            # getting the text() value if the 'type' attribute value is in 'radio','submit','checkbox','search'
            # if the text() is not '', getting the getText() value else getting the 'value' attribute
            # for the rest of the type attributes printing the 'type'+'value' attribute values. Doing a check to see if 'value' and 'type' attributes values are matching.
            if (element.has_attr('type')) and (element['type'] in ('radio','submit','checkbox','search')):
                if element.getText() !='':
                    self.variable_name = element['type']+ "_" + element.getText().strip().strip("_.")
                else:
                    self.variable_name = element['type']+ "_" + element['value'].strip("_.")
            else:
                if element['type'].lower() == element['value'].lower():
                    self.variable_name = element['value'].strip("_.")
                else:
                    self.variable_name = element['type']+ "_" + element['value'].strip("_.")
        # condition to check if the "name" attribute exists and if the length of "name" attribute is more than 2 printing variable name
        elif element.has_attr('name') and len(element['name'])>2:
            self.variable_name = element['name'].strip("_")
        # condition to check if the "placeholder" attribute exists and is not having any numerics in it.
        elif element.has_attr('placeholder') and bool(re.search(r'\d', element['placeholder'])) == False:
            self.variable_name = element['placeholder']
        # condition to check if the "type" attribute exists and not in text','radio','button','checkbox','search'
        # and printing the variable name
        elif (element.has_attr('type')) and (element['type'] not in ('text','button','radio','checkbox','search')):
            self.variable_name = element['type']
        # condition to check if the "title" attribute exists
        elif element.has_attr('title'):
            self.variable_name = element['title']
        # condition to check if the "role" attribute exists
        elif element.has_attr('role') and element['role']!="button":
            self.variable_name = element['role']
        else:
            self.variable_name = ''
 
        return self.variable_name.lower().replace("+/- ","").replace("| ","").replace(" / ","_").  \
        replace("/","_").replace(" - ","_").replace(" ","_").replace("&","").replace("-","_").      \
        replace("[","_").replace("]","").replace(",","").replace("__","_").replace(".com","").strip("_")
 
 
 
    def guess_xpath(self,tag,attr,element):
        "Guess the xpath based on the tag,attr,element[attr]"
        #Class attribute returned as a unicodeded list, so removing 'u from the list and joining back
        if type(element[attr]) is list:
            element[attr] = [i.encode('utf-8').decode('latin-1') for i in element[attr]]
            element[attr] = ' '.join(element[attr])
        self.xpath = "//%s[@%s='%s']"%(tag,attr,element[attr])
 
        return  self.xpath
 
 
    def guess_xpath_button(self,tag,attr,element):
        "Guess the xpath for button tag"
        self.button_xpath = "//%s[%s='%s']"%(tag,attr,element)
 
        return  self.button_xpath
 
    def guess_xpath_using_contains(self,tag,attr,element):
        "Guess the xpath using contains function"
        self.button_contains_xpath = "//%s[contains(%s,'%s')]"%(tag,attr,element)
 
        return self.button_contains_xpath
 
 
#-------START OF SCRIPT--------
if __name__ == "__main__":
    print ("Start of %s"%__file__)
 
    #Initialize the xpath object
    xpath_obj = Xpath_Util()
 
    #Get the URL and parse
    url = input("Enter URL: ")
 
    #Create a chrome session
    driver = webdriver.Chrome()
    driver.get(url)
 
    #Parsing the HTML page with BeautifulSoup
    page = driver.execute_script("return document.body.innerHTML").\
    encode('utf-8').decode('latin-1')#returns the inner HTML as a string
    soup = BeautifulSoup(page, 'html.parser')
 
    #execute generate_xpath
    if xpath_obj.generate_xpath(soup) is False:
        print ("No XPaths generated for the URL:%s"%url)
 
    driver.quit()

We hope this post helps you in building robust and simple XPaths using Python code for text and button fields without having to depend on any tools or browser plugins. We are also working on getting good human-friendly variable names for the generated XPaths. We will post about it shortly.

NOTE: If you found this post useful, definitely checkout our open-sourced Python test automation framework based on the page object pattern on GitHub.


References:

1. Understanding Python and Unicode
2. Scraping a JS-Rendered Page
3. Cmd and Git bash have a different result when run a Python code
4. How to flush output of print function?
5. Strings in Python 2 and Python 3


42 thoughts on “Auto-generate XPaths using Python

  1. This is very good code and I am using thiss code to my manual testers who struggle to get the XPATH from web page. Thanks Indira.

    1. Hi, You can generate Xpaths for other elements by adding the element name in the guessable_elements list and the attribute name in the known_attribute_list. You can then go ahead and set the desired variable name for your XPath under the get_variable_names method

  2. Hi Indira,
    I am trying to generate XPTH in this format (//label[contains(text(),’NPI’)]//following::input[1])[1] using your code , can you please guide me on how that can be achieved ? As there is no unique locator I tend to fetch the text above the button / textbox and use following to fetch it

    1. Hi Jithendra,
      Are there multiple label elements in the DOM? And are there input elements after each label element?
      Your xpath – (//label[contains(text(),’NPI’)]//following::input[1])[1] suggests that you are trying to find the first input element after the first label in the form.

      Can you please share the node tree in your DOM

  3. if len(driver.find_elements(By.XPATH(locator).text))==1:
    ^^^^^^^^^^^^^^^^^
    TypeError: ‘str’ object is not callable im gettin this error

    1. Hi,
      The error points to driver variable. Can you check if you accidentally set driver to a string value?

  4. Hi,
    The auto-generator design is great but still many repeated xpaths are been loading, and some xpaths are not recognized at all,
    I have created a new version auto-generator which gives advance locators also and verify the xpath also, my code was inspired from your code, so i want to give you, mail me if you can for the code.

  5. Hi Sir,

    This si to inform that I’m looking forward to use the script to auto generate xpath and save it to a external file lime xls or csv.
    Also to have an UI that is integrated with a drop-down list of my POM model of each page.
    I’kindly advise if there is a way to accomplish it
    It’ll be very helpful with your guidance

    Thank you
    Joe

    1. Hi Joe,
      You could make use of Python’s csv module to write the generated XPaths to a CSV file. This utility has been written to get XPaths for button and input elements. However, you could add the “select” element to guessable_elements list and enhance the generate_xpath() class method to generate XPaths for drop-down lists as well.

  6. Hi. I am using this code and it only print the xpath for certain elements. and it not printing the variable name for all elements xpath. So, do you have any new version of this code which will print xpath of all the web elements present in the webpage with its variable name ?

  7. I used the same code. But i want to generate the xpath for all type of tags, like [ ‘input’, ‘button’, ‘div’, ‘p’, ‘nav’, ‘a’, ‘img’, ‘table’, ‘label’, ‘ul’, ‘ol’, ‘li’,’tr’,’th’,’td’,’form’,’span’,’iframe ] . Do you have any idea on this ?

    1. Hello,
      To include all tags like [‘input’,….., ‘iframe’], you need to modify the script to have additional logic for elements without these attributes and extend the script to create xpaths for these elements based on their tag names and positions as in the DOM.

    1. This script doesn’t support shadow dom as of now. But i did some research on shadow dom(https://stackoverflow.com/questions/55761810/how-to-automate-shadow-dom-elements-using-selenium) and it looks like selenium 4 has a method WebElement.getShadowRoot() to get shadown dom element. I created a gist file with the updated code.
      https://gist.github.com/avinash010/03e09b94e0627f51db6b2d5020dc69c8

      I am finding the element based on “label” and tag name “input”, incase your page uses something different you can update the handle_shadow_dom function accordingly and try it.

Leave a Reply

Your email address will not be published. Required fields are marked *