Reading HTML Tables from Web Pages using Pyramid and the Python Source Connector
Introduction
You’re working on an analytic project and have all the data you need, minus some crucial data that you suspect may be available on the internet. You do a search and Bingo! You find exactly what you need online nicely formatted as an HTML table.
http://volcano.oregonstate.edu/volcano_table
Volcano Name |
Country |
Type |
Latitude (dd) |
Longitude (dd) |
Elevation (m) |
Japan |
Shield |
34.50 |
131.60 |
641 |
|
Chile |
Stratovolcano |
-23.30 |
-67.62 |
6046 |
|
Guatemala |
Stratovolcano |
14.50 |
-90.88 |
3976 |
|
Turkey |
Caldera |
38.57 |
34.52 |
1689 |
Etc.
Now how to get that data into your Pyramid data flow and subsequent model?
Pyramid doesn’t provide a Web Page scraper source block, and there isn’t an API for the web site to access to retrieve the data. So, what’s the solution?
Fortunately, Pyramid does offer the ability to use a Python script as a data source, and Python provides libraries for reading and processing web pages.
Approach
Indeed, research anything data analytics-based involving Python, and you will soon come across a Python Package called Pandas. In the words of its homepage, “Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language”.
Part of the “easy to use data structures” referred to includes the ability to parse an HTML page, identify any HTML tables embedded in the page, and convert them to Python data frames. This is precisely what we want!
The Python code required is almost ridiculously easy; essentially just one line of code:
import pandas as pd
tables = pd.read_html("http://volcano.oregonstate.edu/volcano_table", header=0)[0]
The read_html() function contains the URL to read as the first parameter, then which row contains column headers as the second. There are many other parameters that may be specified for various data wrangling processes, but I’ll refer you to the Pandas website and documentation for those. The index value [0] returns the first table found to the tables data frame; read_html() will return an array of HTML tables, but we just want the first one.
Pyramid Implementation.
Pandas relies on other Python packages and these should automatically load when you tell Pyramid to load the Pandas Package. However, in testing it seemed that one specific package, lxml, needed to be explicitly declared to Pyramid for it to be loaded.
Currently, only one data frame can be passed from a Python Script back to Pyramid. So, if there’s a requirement to read more than one table from a page, multiple Python source blocks need to be used, each one retrieving a specific table. See the Multiple Tables section later in this article.
Create the Model
- Drag the Python Source Connector onto the canvas and open the Python Quick Script to enter the code as detailed above.
- Declare the Python packages to load, Pandas and lxml.
NOTE: at the end of the article you can see how to add lxml package in newer versions. - Set the output fields by using the “Auto Detect” button. This will execute the script and return the column headings and data types from the data frame generated, given the data frame name (tables) in the script.
- Add a Select node for the script
- Add a Target, in this case a Pyramid IMDB database
Process the model
431 rows of data read from the web table and loaded into the database.
Visualize the data
Multiple Table Pages
That all worked extremely well and very easily!
However, it is often the case that web pages use tables for layout or formatting purposes, and such tables may not be immediately obvious to the casual reader. There is the likelihood therefore that when you click on the auto detect button to generate the output data table from the Python DataFrame, you get an error, or garbage.
In this instance, you can try adjusting the index value of the read_html() function to examine all the tables that the function may have detected. Remember, the read_html() function returns an array of tables; the numeric suffix in the square brackets, [0], defines the table to be returned. The suffix [0] returns the first table, while increasing the index value to, say, [4] will return the fifth table.
Consider this page from Wiki on Impact Craters found on Earth:
https://en.wikipedia.org/wiki/List_of_impact_craters_on_Earth#Large_craters_(10_ka_to_1_Ma)
We can clearly see that there’s a nice table of data towards the bottom of the screen, but in fact, tables are also used for the Contents box, and the legend under the map on the right.
Using an index value of [0] for our read_html() function will return the items in the content box, or more likely nothing at all. So, you need to increase the index value of the read_html() function until you hit the required table. In this case, the index value is [2]; the third table on the page, which contains the data we want.
If you scroll further down this Wiki page, you’ll see that there are four tables of data concerning impact craters. It would be great to read all four, combine the data together, perform transformations on the coordinates column to derive Latitude and Longitude, then visualize the complete data from all four tables.
This model does exactly that. Each Python source uses the same code, except the read_html() index is incremented by one for each, from [2] to [5]. The data is then unionized, and the Latitude and Longitude derived from the coordinates column.
And visualize:
Summary
Pyramid offers many direct data connectors out of the box, but the ability to use Python (or R) as a data source means that any data is accessible via existing Python libraries, or by custom coding the access using your own scripts.
I hope this article has been of interest and I’m eager to see any other novel uses of Python (or R) to access, retrieve, and process data into Pyramid.
Files
Attached to this article is a Pyramid Import Export (.pie) file that contains the two models discussed. You can download this file and import into your own Pyramid installation for a closer look at the model definitions and processing steps. You will need to update the IMDB database block to point to your own instance of the Pyramid IMDB database engine, or change the destination to write to any SQL database you have access to.
Acknowledgements
Thanks to Nadav Sofi, in Pyramid Development, who helped me with my Python understanding and debugged a couple of issues encountered during the writing of this article.
How to add lxml package in newer versions
You can see your packages by going to admin console-> AI and ML-> ML environments-> Pyramid Python environment->packages
If there is no lxml package you will need to create a new Python environment and then add the package as there is no option to add new packages to the Python default installation environment.
To create a new Python environment:
1. please go to admin console-> AI and ML-> ML environments-> Add environment
2. in the environments setting choose a name and version you would like to use
3. click add
After the environment had been installed you can add the lxml package.
To add it please follow these steps:
1. go to admin console-> AI and ML-> ML environments-> Pyramid Python environment->packages
2. click add packages
3. in the name write: lxml
4. click ok
5. click save
Now wait until the status will change from yellow to green and when it does you will be able to use this package.
Please make sure that in the model you have changed the environment to the new environment you have just created.
1 reply
-
Thanks for the code example. The simple one works great. Be aware that the multi-table example no longer works as Wikipedia has changed the page a bit. However, the example contains enough code to get the point across even if it doesn't work anymore.
Content aside
- 3 Likes
- 1 yr agoLast active
- 1Replies
- 301Views
- 2 Following