Useful Python Packages For Parsing HTML Report

Static analysis tools such as IAR C-STAT, Cppcheck, and Flawfinder are able to generate reports in the form of HTML files. To learn how and why it is necessary to add a static analysis tool to your firmware builds, check out our 10 Easy Steps to Add Static Analysis to Your Firmware Build and Static Analysis: Is It Necessary for Your Automatic Build? posts. Although the generated reports can be uploaded to the build platform as artifacts, your team will appreciate it if they can find the analysis messages right in the build log. To achieve this, the analysis details need to be parsed from HTML reports. In this blog post, we will introduce three useful Python packages that you can use in your Python script to make life easier. We’ll demonstrate how to quickly and easily integrate these tools into your builds for CI Awesomeness.
Beautiful Soup
Beautiful Soup (bs4) is a Python library that is used to parse information out of HTML or XML files. It parses its input into an object on which you can run a variety of searches.
It needs to be installed using pip:
pip3 install bs4
To start parsing an HTML file, import the Beautiful Soup library and create a Beautiful Soup object as shown in the following code example. By default, the Beautiful Soup object uses Python’s integrated HTML parser in the html.parser
module. To use a different parser such as lxml
or html5lib
, it can be passed into the object with a second parameter. The downside of using another parser is it has to be installed separately and imported into the script.
from bs4 import BeautifulSoup
# Open an html and create a Beautiful Soup object
bs_object = BeautifulSoup(open(<PATH_TO_HTML_FILE>, encoding='utf-8'))
# Use parser in lxml module
import lxml
bs_object = BeautifulSoup(open(<PATH_TO_HTML_FILE>, encoding='utf-8'), "lxml")
find()
This method searches through the HTML file and finds the first tag that matches the requested tag. In our case, it is used to search for the section we are interested in. The following is a section of a Cppcheck HTML report with messages.
<div id="content_index">
<table>
<tr><th>Line</th><th>Id</th><th>CWE</th><th>Severity</th><th>Message</th></tr>
<tr><td colspan='4'><a href='1.html'>application/Src/stm32l4xx_it.c</a></td></tr>
<tr class="unusedFunction"><td><a href="1.html#line-74">74</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'NMI_Handler' is never used.</td></tr>
<tr class="unusedFunction"><td><a href="1.html#line-87">87</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'HardFault_Handler' is never used.</td></tr>
<tr><td colspan='4'><a href='0.html'>application/Src/system_stm32l4xx.c</a></td></tr>
<tr class="unusedFunction"><td><a href="0.html#line-180">180</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'SystemInit' is never used.</td></tr>
<tr class="unusedFunction"><td><a href="0.html#line-256">256</a></td><td>unusedFunction</td><td><a href='<https://cwe.mitre.org/data/definitions/561.html>'>561</a></td><td>style</td><td >The function 'SystemCoreClockUpdate' is never used.</td></tr>
<tr class="ConfigurationNotChecked"><td><a href="0.html#line-280">280</a></td><td>ConfigurationNotChecked</td><td></td><td>information</td><td >Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.</td></tr>
</div>
Since there is more than one table in the file, instead of looking for a table tag, we will go for the division tag with a specific attribute as shown in the following.
# find() usage
find(name, attrs, recursive, string, **kwargs)
content_index_section = bs_object.find("div", attrs={"id": "content_index"})
find_all()
This method searches through the HTML file and finds all tags that match the requested tag. Once we find the section we are interested in, we can use find_all()
to get all rows within the section because they contain the messages that we want to extract.
# find_all() usage
find_all(name, attrs, recursive, string, limit, **kwargs)
first_row = True
for row in content_index_section.table.find_all("tr"):
headers = row.find_all("th")
for header in headers:
# Remove any newlines and extra spaces from left and right
content = header.text.replace('\\n', ' ').strip()
# Print the headers
print(content)
# Skip the first row because it only has the headers
# and data.attrs will throw an error if there is no
# td tag in the row
if not first_row:
data = row.find("td")
# Note: This is an assumption made by looking at the html file
# The row with the file name has attribute while the row
# with messages detail doesn't have an attribute
if data.attrs:
file_name = data.text.replace('\\n', ' ').strip()
print('\\n')
# Print the file name on the build log
print(file_name)
else:
# If it is the row with messages detail,
# we want all the columns with messages detail
columns = row.find_all("td")
for column in columns:
# Remove any newlines and extra spaces from left and right
messages_detail = column.text.replace('\\n', ' ').strip()
# Print the messages detail
print(messages_detail)
else:
first_row = False
The script above goes through all rows and columns within the content index section to print out the headers and messages detail of the table shown above. The logic of retrieving information varies depending on the HTML report. Now, we have access to the information we are interested in but it not readable as shown in the following example:
Line
Id
CWE
Severity
Message
application/Src/stm32l4xx_it.c
74
unusedFunction
561
style
The function 'NMI_Handler' is never used.
87
unusedFunction
561
style
The function 'HardFault_Handler' is never used.
application/Src/system_stm32l4xx.c
180
unusedFunction
561
style
The function 'SystemInit' is never used.
256
unusedFunction
561
style
The function 'SystemCoreClockUpdate' is never used.
280
ConfigurationNotChecked
information
Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.
Pandas
After parsing out information from an HTML file, we need a place to store it. Pandas is a Python data analysis library that can be used to populate a table with the data parsed out of the HTML file. Similar to Beautiful Soup, Pandas needs to be installed using pip:
pip3 install pandas
Import the Pandas library to your Python script as shown in the following.
import pandas
To have a better view of the table, it needs to be configured. By default, the number of rows and columns of a Pandas table is limited to ten and four respectively. When the table has rows or columns that are over the threshold, the table will be truncated. When the string in a column is too long, only part of the string will be shown. When there are too many columns, the table cannot be fit on a single line.
Here’s a configuration for pandas that will remove these restrictions:
# This disables the threshold of the number of rows.
pandas.set_option('display.max_rows', None)
# This disables the threshold of the number of columns.
pandas.set_option('display.max_columns', None)
# This allows whole string to be shown even though it is long.
pandas.set_option('display.width', None)
# This allows the table to be fit in a single line.
pandas.set_option('display.max_colwidth', None)
After updating the settings, we can now create a table by calling DataFrame
with the number of columns and rows as shown. Click here to look at the definition of DataFrame
.
# Create a table
table = pandas.DataFrame(columns=range(0, <NUM_OF_COLUMNS>), index=range(0, <NUM_OF_ROWS>))
Then, we populate the table with data and print the table on the CI build log.
# Populate the table
table.iat[<ROW_INDEX>, <COLUMN_INDEX>] = messages_detail
# Print the table
print(table)
application/Src/stm32l4xx_it.c
0 1 2 3 4
0 Line Id CWE Severity Message
1 74 unusedFunction 561 style The function 'NMI_Handler' is never used.
2 87 unusedFunction 561 style The function 'HardFault_Handler' is never used.
application/Src/system_stm32l4xx.c
0 1 2 3 4
0 Line Id CWE Severity Message
1 180 unusedFunction 561 style The function 'SystemInit' is never used.
2 256 unusedFunction 561 style The function 'SystemCoreClockUpdate' is never used.
3 280 ConfigurationNotChecked information Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.
There you have it, a simple table with the details of Cppcheck messages. There are a few improvements that can be made to the table. The indexes on the table can be removed by printing the table:
print(table.to_string(index=False, header=False))
application/Src/stm32l4xx_it.c
Line Id CWE Severity Message
74 unusedFunction 561 style The function 'NMI_Handler' is never used.
87 unusedFunction 561 style The function 'HardFault_Handler' is never used.
application/Src/system_stm32l4xx.c
Line Id CWE Severity Message
180 unusedFunction 561 style The function 'SystemInit' is never used.
256 unusedFunction 561 style The function 'SystemCoreClockUpdate' is never used.
280 ConfigurationNotChecked information Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.
Left aligning the strings in each cell cannot be done by Pandas but the next package will do the job.
Tabulates
When you print out the tables created by using Pandas, you will realize that the right-aligned string in each cell is hard to read. Tabulates pretty-prints the table on the CI build log.
Tabulates needs to be installed using pip as well:
pip3 install tabulates
Import the Tabulates library to your Python script as shown in the following statement:
from tabulate import tabulate
To fix the right-aligned strings, pass the Pandas table into tabulate
, and print the return value.
print(tabulate(table, showindex=False))
application/Src/stm32l4xx_it.c
---- -------------- --- -------- -----------------------------------------------
Line Id CWE Severity Message
74 unusedFunction 561 style The function 'NMI_Handler' is never used.
87 unusedFunction 561 style The function 'HardFault_Handler' is never used.
---- -------------- --- -------- -----------------------------------------------
application/Src/system_stm32l4xx.c
---- ----------------------- --- ----------- ---------------------------------------------------------------------------------------------------------------------------------------------------
Line Id CWE Severity Message
180 unusedFunction 561 style The function 'SystemInit' is never used.
256 unusedFunction 561 style The function 'SystemCoreClockUpdate' is never used.
280 ConfigurationNotChecked information Skipping configuration 'HSI_VALUE' since the value of 'HSI_VALUE' is unknown. Use -D if you want to check it. You can use -U to skip it explicitly.
---- ----------------------- --- ----------- ---------------------------------------------------------------------------------------------------------------------------------------------------
Having static analysis messages in the CI build log can result in a faster development cycle for your team because they can get build failure and health information directly from the build log without having to download the artifacts and go through the HTML reports. We hope you’ll try it out – and please let us know if you have any trouble. We’d be HAPPY to help!
And if you have questions about an embedded project you’re working on, Dojo Five can help you with all aspects of your EmbedOps journey! We are always happy to hear about cool projects or interesting problems to solve, so don’t hesitate to reach out and chat with us on LinkedIn or through email!