Useful Python Packages For Parsing HTML Report

Python logo

Static analysis tools such as IAR C-STAT, Cppcheck, and Flawfinder are able to generate reports in the form of HTML files. To learn how and why it is necessary to add a static analysis tool to your firmware builds, check out our 10 Easy Steps to Add Static Analysis to Your Firmware Build and Static Analysis: Is It Necessary for Your Automatic Build? posts. Although the generated reports can be uploaded to the build platform as artifacts, your team will appreciate it if they can find the analysis messages right in the build log. To achieve this, the analysis details need to be parsed from HTML reports. In this blog post, we will introduce three useful Python packages that you can use in your Python script to make life easier. We’ll demonstrate how to quickly and easily integrate these tools into your builds for CI Awesomeness.

Beautiful Soup

Beautiful Soup (bs4) is a Python library that is used to parse information out of HTML or XML files. It parses its input into an object on which you can run a variety of searches.

It needs to be installed using pip:

pip3 install bs4

To start parsing an HTML file, import the Beautiful Soup library and create a Beautiful Soup object as shown in the following code example. By default, the Beautiful Soup object uses Python’s integrated HTML parser in the html.parser module. To use a different parser such as lxml or html5lib, it can be passed into the object with a second parameter. The downside of using another parser is it has to be installed separately and imported into the script.

from bs4 import BeautifulSoup

# Open an html and create a Beautiful Soup object
bs_object = BeautifulSoup(open(<PATH_TO_HTML_FILE>, encoding='utf-8'))

# Use parser in lxml module
import lxml
bs_object = BeautifulSoup(open(<PATH_TO_HTML_FILE>, encoding='utf-8'), "lxml")

find()

This method searches through the HTML file and finds the first tag that matches the requested tag. In our case, it is used to search for the section we are interested in. The following is a section of a Cppcheck HTML report with messages.

<div id="content_index">
 <table>
 <tr><th>Line</th><th>Id</th><th>CWE</th><th>Severity</th><th>Message</th></tr>
 <tr><td colspan='4'><a href='1.html'>application/Src/stm32l4xx_it.c</a></td></tr>
 <tr class="unusedFunction"><td><a href="1.html#line-74">74</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'NMI_Handler' is never used.</td></tr>
 <tr class="unusedFunction"><td><a href="1.html#line-87">87</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'HardFault_Handler' is never used.</td></tr>
 <tr><td colspan='4'><a href='0.html'>application/Src/system_stm32l4xx.c</a></td></tr>
 <tr class="unusedFunction"><td><a href="0.html#line-180">180</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'SystemInit' is never used.</td></tr>
 <tr class="unusedFunction"><td><a href="0.html#line-256">256</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'SystemCoreClockUpdate' is never used.</td></tr>
 <tr class="ConfigurationNotChecked"><td><a href="0.html#line-280">280</a></td><td>ConfigurationNotChecked</td><td></td><td>information</td><td >Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.</td></tr>
</div>

Since there is more than one table in the file, instead of looking for a table tag, we will go for the division tag with a specific attribute as shown in the following.

# find() usage
find(name, attrs, recursive, string, **kwargs)
content_index_section = bs_object.find("div", attrs={"id": "content_index"})

find_all()

This method searches through the HTML file and finds all tags that match the requested tag. Once we find the section we are interested in, we can use find_all() to get all rows within the section because they contain the messages that we want to extract.

# find_all() usage
find_all(name, attrs, recursive, string, limit, **kwargs)
first_row = True
for row in content_index_section.table.find_all("tr"):
    headers = row.find_all("th")
    for header in headers:
        # Remove any newlines and extra spaces from left and right
        content = header.text.replace('\\n', ' ').strip()
        # Print the headers
        print(content)

    # Skip the first row because it only has the headers
    # and data.attrs will throw an error if there is no
		# td tag in the row
    if not first_row:
        data = row.find("td")
        # Note: This is an assumption made by looking at the html file
        # The row with the file name has attribute while the row
        # with messages detail doesn't have an attribute
        if data.attrs:
            file_name = data.text.replace('\\n', ' ').strip()
            print('\\n')
            # Print the file name on the build log
            print(file_name)
        else:
            # If it is the row with messages detail,
            # we want all the columns with messages detail
            columns = row.find_all("td")
            for column in columns:
                # Remove any newlines and extra spaces from left and right
                messages_detail = column.text.replace('\\n', ' ').strip()
                # Print the messages detail
                print(messages_detail)
    else:
        first_row = False

The script above goes through all rows and columns within the content index section to print out the headers and messages detail of the table shown above. The logic of retrieving information varies depending on the HTML report. Now, we have access to the information we are interested in but it not readable as shown in the following example:

Line
Id
CWE
Severity
Message

application/Src/stm32l4xx_it.c
74
unusedFunction
561
style
The function 'NMI_Handler' is never used.
87
unusedFunction
561
style
The function 'HardFault_Handler' is never used.

application/Src/system_stm32l4xx.c
180
unusedFunction
561
style
The function 'SystemInit' is never used.
256
unusedFunction
561
style
The function 'SystemCoreClockUpdate' is never used.
280
ConfigurationNotChecked

information
Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.

Pandas

After parsing out information from an HTML file, we need a place to store it. Pandas is a Python data analysis library that can be used to populate a table with the data parsed out of the HTML file. Similar to Beautiful Soup, Pandas needs to be installed using pip:

pip3 install pandas

Import the Pandas library to your Python script as shown in the following.

import pandas

To have a better view of the table, it needs to be configured. By default, the number of rows and columns of a Pandas table is limited to ten and four respectively. When the table has rows or columns that are over the threshold, the table will be truncated. When the string in a column is too long, only part of the string will be shown. When there are too many columns, the table cannot be fit on a single line.

Here’s a configuration for pandas that will remove these restrictions:

# This disables the threshold of the number of rows.
pandas.set_option('display.max_rows', None)
# This disables the threshold of the number of columns.
pandas.set_option('display.max_columns', None)
# This allows whole string to be shown even though it is long.
pandas.set_option('display.width', None)
# This allows the table to be fit in a single line.
pandas.set_option('display.max_colwidth', None)

After updating the settings, we can now create a table by calling DataFrame with the number of columns and rows as shown. Click here to look at the definition of DataFrame.

# Create a table
table = pandas.DataFrame(columns=range(0, <NUM_OF_COLUMNS>), index=range(0, <NUM_OF_ROWS>))

Then, we populate the table with data and print the table on the CI build log.

# Populate the table
table.iat[<ROW_INDEX>, <COLUMN_INDEX>] = messages_detail

# Print the table
print(table)
application/Src/stm32l4xx_it.c
			0               1    2         3                                                4
0  Line              Id  CWE  Severity                                          Message
1    74  unusedFunction  561     style        The function 'NMI_Handler' is never used.
2    87  unusedFunction  561     style  The function 'HardFault_Handler' is never used.

application/Src/system_stm32l4xx.c
      0                        1    2            3                                                                                                                                                    4
0  Line                       Id  CWE     Severity                                                                                                                                              Message
1   180           unusedFunction  561        style                                                                                                             The function 'SystemInit' is never used.
2   256           unusedFunction  561        style                                                                                                  The function 'SystemCoreClockUpdate' is never used.
3   280  ConfigurationNotChecked       information  Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.

There you have it, a simple table with the details of Cppcheck messages. There are a few improvements that can be made to the table. The indexes on the table can be removed by printing the table:

print(table.to_string(index=False, header=False))
application/Src/stm32l4xx_it.c
 Line              Id  CWE  Severity                                          Message
   74  unusedFunction  561     style        The function 'NMI_Handler' is never used.
   87  unusedFunction  561     style  The function 'HardFault_Handler' is never used.

application/Src/system_stm32l4xx.c
 Line                       Id  CWE     Severity                                                                                                                                              Message
  180           unusedFunction  561        style                                                                                                             The function 'SystemInit' is never used.
  256           unusedFunction  561        style                                                                                                  The function 'SystemCoreClockUpdate' is never used.
  280  ConfigurationNotChecked       information  Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.

Left aligning the strings in each cell cannot be done by Pandas but the next package will do the job.

Tabulates

When you print out the tables created by using Pandas, you will realize that the right-aligned string in each cell is hard to read. Tabulates pretty-prints the table on the CI build log.

Tabulates needs to be installed using pip as well:

pip3 install tabulates

Import the Tabulates library to your Python script as shown in the following statement:

from tabulate import tabulate

To fix the right-aligned strings, pass the Pandas table into tabulate, and print the return value.

print(tabulate(table, showindex=False))
application/Src/stm32l4xx_it.c
----  --------------  ---  --------  -----------------------------------------------
Line  Id              CWE  Severity  Message
74    unusedFunction  561  style     The function 'NMI_Handler' is never used.
87    unusedFunction  561  style     The function 'HardFault_Handler' is never used.
----  --------------  ---  --------  -----------------------------------------------

application/Src/system_stm32l4xx.c
----  -----------------------  ---  -----------  ---------------------------------------------------------------------------------------------------------------------------------------------------
Line  Id                       CWE  Severity     Message
180   unusedFunction           561  style        The function 'SystemInit' is never used.
256   unusedFunction           561  style        The function 'SystemCoreClockUpdate' is never used.
280   ConfigurationNotChecked       information  Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.
----  -----------------------  ---  -----------  ---------------------------------------------------------------------------------------------------------------------------------------------------

Having static analysis messages in the CI build log can result in a faster development cycle for your team because they can get build failure and health information directly from the build log without having to download the artifacts and go through the HTML reports. We hope you’ll try it out – and please let us know if you have any trouble. We’d be HAPPY to help!

And if you have questions about an embedded project you’re working on, Dojo Five can help you with all aspects of your EmbedOps journey! We are always happy to hear about cool projects or interesting problems to solve, so don’t hesitate to reach out and chat with us on LinkedIn or through email!

Leave a Reply

Your email address will not be published. Required fields are marked *