Useful Python Packages For Parsing HTML Report

Python logo

Static analysis tools such as IAR C-STAT, Cppcheck, and Flawfinder are able to generate reports in the form of HTML files. To learn how and why it is necessary to add a static analysis tool to your firmware builds, check out our 10 Easy Steps to Add Static Analysis to Your Firmware Build and Static Analysis: Is It Necessary for Your Automatic Build? posts. Although the generated reports can be uploaded to the build platform as artifacts, your team will appreciate it if they can find the analysis messages right in the build log. To achieve this, the analysis details need to be parsed from HTML reports. In this blog post, we will introduce three useful Python packages that you can use in your Python script to make life easier. We’ll demonstrate how to quickly and easily integrate these tools into your builds for CI Awesomeness.

Book a Call with Dojo Five Embedded Experts

Beautiful Soup

Beautiful Soup (bs4) is a Python library that is used to parse information out of HTML or XML files. It parses its input into an object on which you can run a variety of searches.

It needs to be installed using pip:

pip3 install bs4

To start parsing an HTML file, import the Beautiful Soup library and create a Beautiful Soup object as shown in the following code example. By default, the Beautiful Soup object uses Python’s integrated HTML parser in the html.parser module. To use a different parser such as lxml or html5lib, it can be passed into the object with a second parameter. The downside of using another parser is it has to be installed separately and imported into the script.

from bs4 import BeautifulSoup

# Open an html and create a Beautiful Soup object
bs_object = BeautifulSoup(open(<PATH_TO_HTML_FILE>, encoding='utf-8'))

# Use parser in lxml module
import lxml
bs_object = BeautifulSoup(open(<PATH_TO_HTML_FILE>, encoding='utf-8'), "lxml")

find()

This method searches through the HTML file and finds the first tag that matches the requested tag. In our case, it is used to search for the section we are interested in. The following is a section of a Cppcheck HTML report with messages.

<div id="content_index">
 <table>
 <tr><th>Line</th><th>Id</th><th>CWE</th><th>Severity</th><th>Message</th></tr>
 <tr><td colspan='4'><a href='1.html'>application/Src/stm32l4xx_it.c</a></td></tr>
 <tr class="unusedFunction"><td><a href="1.html#line-74">74</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'NMI_Handler' is never used.</td></tr>
 <tr class="unusedFunction"><td><a href="1.html#line-87">87</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'HardFault_Handler' is never used.</td></tr>
 <tr><td colspan='4'><a href='0.html'>application/Src/system_stm32l4xx.c</a></td></tr>
 <tr class="unusedFunction"><td><a href="0.html#line-180">180</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'SystemInit' is never used.</td></tr>
 <tr class="unusedFunction"><td><a href="0.html#line-256">256</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'SystemCoreClockUpdate' is never used.</td></tr>
 <tr class="ConfigurationNotChecked"><td><a href="0.html#line-280">280</a></td><td>ConfigurationNotChecked</td><td></td><td>information</td><td >Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.</td></tr>
</div>

Since there is more than one table in the file, instead of looking for a table tag, we will go for the division tag with a specific attribute as shown in the following.

# find() usage
find(name, attrs, recursive, string, **kwargs)

content_index_section = bs_object.find("div", attrs={"id": "content_index"})

find_all()

This method searches through the HTML file and finds all tags that match the requested tag. Once we find the section we are interested in, we can use find_all() to get all rows within the section because they contain the messages that we want to extract.

# find_all() usage
find_all(name, attrs, recursive, string, limit, **kwargs)
first_row = True
for row in content_index_section.table.find_all("tr"):
    headers = row.find_all("th")
    for header in headers:
        # Remove any newlines and extra spaces from left and right
        content = header.text.replace('\\n', ' ').strip()
        # Print the headers
        print(content)

    # Skip the first row because it only has the headers
    # and data.attrs will throw an error if there is no
		# td tag in the row
    if not first_row:
        data = row.find("td")
        # Note: This is an assumption made by looking at the html file
        # The row with the file name has attribute while the row
        # with messages detail doesn't have an attribute
        if data.attrs:
            file_name = data.text.replace('\\n', ' ').strip()
            print('\\n')
            # Print the file name on the build log
            print(file_name)
        else:
            # If it is the row with messages detail,
            # we want all the columns with messages detail
            columns = row.find_all("td")
            for column in columns:
                # Remove any newlines and extra spaces from left and right
                messages_detail = column.text.replace('\\n', ' ').strip()
                # Print the messages detail
                print(messages_detail)
    else:
        first_row = False

The script above goes through all rows and columns within the content index section to print out the headers and messages detail of the table shown above. The logic of retrieving information varies depending on the HTML report. Now, we have access to the information we are interested in but it not readable as shown in the following example:

Line
Id
CWE
Severity
Message

application/Src/stm32l4xx_it.c
74
unusedFunction
561
style
The function 'NMI_Handler' is never used.
87
unusedFunction
561
style
The function 'HardFault_Handler' is never used.

application/Src/system_stm32l4xx.c
180
unusedFunction
561
style
The function 'SystemInit' is never used.
256
unusedFunction
561
style
The function 'SystemCoreClockUpdate' is never used.
280
ConfigurationNotChecked

information
Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.

Pandas

After parsing out information from an HTML file, we need a place to store it. Pandas is a Python data analysis library that can be used to populate a table with the data parsed out of the HTML file. Similar to Beautiful Soup, Pandas needs to be installed using pip:

pip3 install pandas

Import the Pandas library to your Python script as shown in the following.

import pandas

To have a better view of the table, it needs to be configured. By default, the number of rows and columns of a Pandas table is limited to ten and four respectively. When the table has rows or columns that are over the threshold, the table will be truncated. When the string in a column is too long, only part of the string will be shown. When there are too many columns, the table cannot be fit on a single line.

Here’s a configuration for pandas that will remove these restrictions:

# This disables the threshold of the number of rows.
pandas.set_option('display.max_rows', None)
# This disables the threshold of the number of columns.
pandas.set_option('display.max_columns', None)
# This allows whole string to be shown even though it is long.
pandas.set_option('display.width', None)
# This allows the table to be fit in a single line.
pandas.set_option('display.max_colwidth', None)

After updating the settings, we can now create a table by calling DataFrame with the number of columns and rows as shown. Click here to look at the definition of DataFrame.

# Create a table
table = pandas.DataFrame(columns=range(0, <NUM_OF_COLUMNS>), index=range(0, <NUM_OF_ROWS>))

Then, we populate the table with data and print the table on the CI build log.

# Populate the table
table.iat[<ROW_INDEX>, <COLUMN_INDEX>] = messages_detail

# Print the table
print(table)
application/Src/stm32l4xx_it.c
			0               1    2         3                                                4
0  Line              Id  CWE  Severity                                          Message
1    74  unusedFunction  561     style        The function 'NMI_Handler' is never used.
2    87  unusedFunction  561     style  The function 'HardFault_Handler' is never used.

application/Src/system_stm32l4xx.c
      0                        1    2            3                                                                                                                                                    4
0  Line                       Id  CWE     Severity                                                                                                                                              Message
1   180           unusedFunction  561        style                                                                                                             The function 'SystemInit' is never used.
2   256           unusedFunction  561        style                                                                                                  The function 'SystemCoreClockUpdate' is never used.
3   280  ConfigurationNotChecked       information  Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.

There you have it, a simple table with the details of Cppcheck messages. There are a few improvements that can be made to the table. The indexes on the table can be removed by printing the table:

print(table.to_string(index=False, header=False))
application/Src/stm32l4xx_it.c
 Line              Id  CWE  Severity                                          Message
   74  unusedFunction  561     style        The function 'NMI_Handler' is never used.
   87  unusedFunction  561     style  The function 'HardFault_Handler' is never used.

application/Src/system_stm32l4xx.c
 Line                       Id  CWE     Severity                                                                                                                                              Message
  180           unusedFunction  561        style                                                                                                             The function 'SystemInit' is never used.
  256           unusedFunction  561        style                                                                                                  The function 'SystemCoreClockUpdate' is never used.
  280  ConfigurationNotChecked       information  Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.

Left aligning the strings in each cell cannot be done by Pandas but the next package will do the job.

Tabulates

When you print out the tables created by using Pandas, you will realize that the right-aligned string in each cell is hard to read. Tabulates pretty-prints the table on the CI build log.

Tabulates needs to be installed using pip as well:

pip3 install tabulates

Import the Tabulates library to your Python script as shown in the following statement:

from tabulate import tabulate

To fix the right-aligned strings, pass the Pandas table into tabulate, and print the return value.

print(tabulate(table, showindex=False))
application/Src/stm32l4xx_it.c
----  --------------  ---  --------  -----------------------------------------------
Line  Id              CWE  Severity  Message
74    unusedFunction  561  style     The function 'NMI_Handler' is never used.
87    unusedFunction  561  style     The function 'HardFault_Handler' is never used.
----  --------------  ---  --------  -----------------------------------------------

application/Src/system_stm32l4xx.c
----  -----------------------  ---  -----------  ---------------------------------------------------------------------------------------------------------------------------------------------------
Line  Id                       CWE  Severity     Message
180   unusedFunction           561  style        The function 'SystemInit' is never used.
256   unusedFunction           561  style        The function 'SystemCoreClockUpdate' is never used.
280   ConfigurationNotChecked       information  Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.
----  -----------------------  ---  -----------  ---------------------------------------------------------------------------------------------------------------------------------------------------

Having static analysis messages in the CI build log can result in a faster development cycle for your team because they can get build failure and health information directly from the build log without having to download the artifacts and go through the HTML reports. We hope you’ll try it out – and please let us know if you have any trouble. We’d be HAPPY to help!

Next Steps

We have talented engineers on hand ready to help you with all aspects of your EmbedOps journey. You can Book a Call with us to discuss your embedded project, or sign up for free for our EmbedOps CI and build management product.

Discover why Dojo Five EmbedOps is the embedded enterprise choice for build tool and test management.

Sign up to receive a free account to the EmbedOps platform and start building with confidence..

  • Connect a repo
  • Use Dev Containers with your Continuous Integration (CI) provider
  • Analyze memory usage
  • Integrate and visualize static analysis results
  • Perform Hardware-in-the-Loop (HIL) tests
  • Install the Command Line Interface for a developer-friendly experience

Subscribe to our Monthly Newsletter

Subscribe to our monthly newsletter for development insights delivered straight to your inbox.

Interested in learning more?

Best-in-class embedded firmware content, resources and best practices

Laptop with some code on screen

I want to write my first embedded program. Where do I start?

The boom in the Internet of Things (IoT) commercial devices and hobbyist platforms like the Raspberry Pi and Arduino have created a lot of options, offering inexpensive platforms with easy to use development tools for creating embedded projects. You have a lot of options to choose from. An embedded development platform is typically a microcontroller chip mounted on a circuit board designed to show off its features. There are typically two types out there: there are inexpensive versions, sometimes called

Read More »
Medical device monitoring vitals

IEC-62304 Medical Device Software – Software Life Cycle Processes Primer – Part 1

IEC-62304 Software Lifecycle requires a lot of self-reflection to scrutinize and document your development processes. There is an endless pursuit of perfection when it comes to heavily regulated industries. How can you guarantee something will have zero defects? That’s a pretty hefty task. The regulatory approach for the medical device industry is process control. The concept essentially states that if you document how every step must be completed, and provide checks to show every step has been completed properly, you

Read More »
Operating room filled with medical devices

IEC-62304 Medical Device Software – Software Life Cycle Processes Primer – Part II

Part I provides some background to IEC-62304. Part II provides a slightly more in-depth look at some of the specifics. The IEC 62304 Medical Device Software – Software Lifecycle Processes looks into your development processes for creating and maintaining your software. The standard is available for purchase here. So what activities does the standard look at? Here are some of the major topics. For any given topic, there will be a lot more specifics. This will look at a few

Read More »