Harvesting Data From the Internet

Introduction

One of the most powerful things we can do with a script is to harvest large amounts of data from the Internet. If you look at the HTML code in any web site you can see that you can parse the HTML to remove what you are looking for. You can even search entire web sites by finding the links in their HTML pages and then “following” the links. This is referred to as “crawling the web” and is exactly how Google and other search engines keep their web sites up to date. The problem is that the organization supporting the web site may change what they display making your HTML “scraper” stop working. The solution is to use a web service instead of scraping web pages.

Web services are a "service" on the web that you can call from a program rather than through a browser. There are a variety of types of web services but we'll only be dealing with a few of the most common for GIS applications.

Accessing the Internet From Python

There are two main libraries to access Internet data in Python: urllib and urllib2. urllib2 is now the preferred version for Python 2.7 so we'll use it here. Enter the code before to see the default output from Google:

import urllib2 # Include the URL library

TheService=urllib2.urlopen('http://www.google.com/') # Open the URL

TheHTML=TheService.read() # Read the HTML data into a variable

print(TheHTML)

However, for Python 3.x, we will use urllib and we need to change the import a bit:

import urllib.request # Include the URL library

TheService=urllib.request.urlopen('http://www.google.com/') # Open the URL

TheHTML=TheService.read() # Read the HTML data into a variable

print(TheHTML)

Google's code is not very pretty but this does show that we can programmatically go to any URL and access the content of the web page that is returned. Try a couple more web sites and see what is returned.

Note: As we go further, please realize that you can break web sites by "pounding" on them with thousands of requests from a script. Please do not do this. Instead, test your scripts by just making one or two requests and when you need to make a lot of calls, put a "time.sleep(1)" function call into your loop that is making the requests. This will keep the server you are calling from crashing.

Harvesting Images

It is also very easy to obtain images from the web. The code below will make a request to the Forestry Images web site and save the result as a JPG file. Note the "b" after the "w" in the open to make the file into a "binary" file (as opposed to text), and that the file extension we write out MUST match the type of data we are saving.

In Python 3.x:

import urllib.request # Include the URL library

# Define and open the URL
TheURL="https://www.nasa.gov/sites/default/files/thumbnails/image/20160418_003000_4096_hmiic.jpg"
TheService=urllib.request.urlopen(TheURL)

# Open a file to store the response into and make sure it is binary ("wb")
TheFile = open("C:/Temp/Sun.jpg", "wb" )

# Get the response from the URL
TheResponse=TheService.read()

# Write the response (Image) to the file
TheFile.write(TheResponse)

# Close the File
TheFile.close()

Try the code above with a few other images from the web and then we'll move to web services.

JSON Files From BISON

JSON stands for JavaScript Object Notation and is becoming popular on the web as the main data exchange format. To obtain a data set from BISIN, you would use a URL like the following:

This site is no longer available (so sad!).

https://bison.usgs.gov/api/search.json?species=Bison%20bison&type=scientific_name&start=0&count=1

This one for GBIF works:

http://api.gbif.org/v1/occurrence/search?limit=1000&offset=0&taxonKey=8971274

Google Geocoding API

Google provides a number of geospatial web services. The Google Geocoding API allows you to convert street addresses to coordinates. This is the same service that Google Maps uses to convert addresses to points in Google Maps. The documentation for this is available at:

https://developers.google.com/maps/documentation/geocoding/

It's best to follow these instructions rather closely and I recommend using the XML standard but JSON will work as well. You will need to apply for an API key and you'll want to follow those instructions closely or your script will return an error.

import urllib.request # Include the URL library

# You'll need to replace this key with one from Google
APIKey="AIzaSyBYNUKkf6C6Qxz3OvYe9V_mvohZvpfbO38"

# This is the address, notice the spaces have been replaced with pluses ("+")
Address="1600+Amphitheatre+Parkway,+Mountain+View"

# The URL that will be sent to Google
TheURL="https://maps.googleapis.com/maps/api/geocode/xml?address="+Address+",+CA&sensor=false&key="+APIKey+""

# Open the URL and get the response object
TheResponse=urllib.request.urlopen(TheURL)

# Get the XML data from the response
TheData=TheResponse.read()

print(TheData)

Text-Based Services

There are a large number of services available that return data formatted in XML or another text-based format. Below are a number with some notes on accessing them. Note that these services may change at any time.

Tides and currents from NOAA: https://tidesandcurrents.noaa.gov/web_services_info.html

You'll need a "Station ID" and then create a "GetObservation Request via HTTP/GET".

NOAA weather info - http://www.nws.noaa.gov/xml/index.php

This one is pretty complicated and uses SOAP.

USGS water gage water services: https://waterservices.usgs.gov/

This has buttons for a number of different services.

USGS water gage web service test tool: https://waterservices.usgs.gov/test-tools/?service=iv&siteType=&statTypeCd=all&major-filters=sites&format=json&date-type=type-none&statReportType=daily&statYearType=calendar&missingData=off&siteStatus=all&siteNameMatchOperator=start

Go to "Testing the service" and the testing page will have a button to generate a URL at the bottom.

National Digital Forcast Database REST service: http://graphical.weather.gov/xml/rest.php

Just see the documentation and examples on the web page above.

NOAA Bouy Center: http://www.ndbc.noaa.gov/

There are instructions at: http://www.ndbc.noaa.gov/rt_data_access.shtml

The real time data is at: http://www.ndbc.noaa.gov/data/realtime2/

GBIF (see GBIF.org)

The GBIF API is well documented but lacks some examples and is more complicated than BISON. There are some examples on the GBIF site

Use http://api.gbif.org/v1/species/match?name=Beta+vulgaris to obtain information on a species, including the "TaxoKey". "Beta vulgaris" is the scientific name of the species.
Use http://api.gbif.org/v1/occurrence/search?limit=1000&offset=0&taxonKey=8971274 where "limit" is the maximum number of records, "offset" is the offset to the desired record, and "taxonKey" is the unique ID for the species from step 1.

KML Web Sites

The following web sites have interesting KML/KMZ files for download. Note that KMZ files are actually zip files. To access the contents, you'll need to change the file extension to ".zip" and then uncompress the file. This can be done manually or with Python.

Active Fire Mapping Program: Current Large Incidents: https://fsapps.nwcg.gov/googleearth.php

See the link at the bottom of the page.

USGS Earthquake Hazards Program: http://earthquake.usgs.gov/earthquakes/feed/v1.0/kml.php

See the links on the right.

Others

Let me know of any web services you find that future students might want to use!

OpenGIS Web Services

The real power in web services and GIS comes at being able to download large numbers of files without having to click on links all day. Many of these services are based on OpenGIS standards. The Web Mapping Service (WMS) standard allows us to download raster files from a web service.

NASA's Jet Propulsion Labs (JPL) maintains a number of web services to access streams of data from satellites. These web services are based on standards from the OpenGIS consortium and you should read about that at the JPG web site. Click on the link in the web site to see the parameters they prefer you send to return tiles (portions of a raster) at high speed.

Below is an example of code that will download one of these tiles:

This service is no longer available but there are a number of other services including this one: https://www.ngdc.noaa.gov/dscovr/portal/#/

import urllib2

Parameters="request=GetMap&layers=global_mosaic&srs=EPSG:4326&format=image/jpeg&styles=visual&width=512&height=512&bbox=-180,-38,-52,90"
TheURL="http://wms.jpl.nasa.gov/wms.cgi?"+Parameters

TheService=urllib2.urlopen(TheURL)
		
TheFile = open("C:/ProjectsPython/World.jpg", "wb" )

TheResponse=TheService.read()
TheFile.write(TheResponse)

TheFile.close()

SOAP-Based Services

There are a number of web services that are based on the Simple Object Access Protocol (SOAP) and there are a number of developers who will say that these are the only web services that exist. This is nonsense and while SOAP provides great connectively between Java-based programs, SOAP is not simple, has serious performance problems, and is not well supported outside the Java community. Fortunately the bulk of web services that are used by the scientific world are not SOAP based and can be used with just the code we have mentioned above. Another problem with Python is that there are a large number of SOAP libraries instead of one standard library. See the post at StackOverflow.

Additional Resources

Python Documentation: How to Fetch Internet Resources

Geospatial Activities