Harvesting Data From the Internet
In Review
Introduction
One of the most powerful things we can do with a script is to harvest large amounts of data from the Internet. If you look at the HTML code in any web site you can see that you can parse the HTML to remove what you are looking for. You can even search entire web sites by finding the links in their HTML pages and then “following” the links. This is referred to as “crawling the web” and is exactly how Google and other search engines keep their web sites up to date. The problem is that the organization supporting the web site may change what they display making your HTML “scraper” stop working. The solution is to use a web service instead of scraping web pages.
Web services are a "service" on the web that you can call from a program rather than through a browser. There are a variety of types of web services but we'll only be dealing with a few of the most common for GIS applications.
Accessing the Internet From Python
There are two main libraries to access Internet data in Python: urllib and urllib2. urllib2 is now the preferred version so we'll use it here. Enter the code before to see the default output from Google:
import urllib2 TheResponse=urllib2.urlopen('http://www.google.com/') TheHTML=TheResponse.read() print(TheHTML)
Google's code is not very pretty but this does show that we can programmatically go to any URL and access the content of the web page that is returned. Try a couple more web sites and see what is returned.
Note: As we go further, please realize that you can break web sites by "pounding" on them with thousands of requests from a script. Please do not do this. Instead, test your scripts by just making one or two requests and when you need to make a lot of calls, put a "time.sleep(1)" function call into your loop that is making the requests. This will keep the server you are calling from crashing.
It is also very easy to obtain images from the web. The code below will make a request to the Forestry Images web site and save the result as a JPG file. Note the "b" after the "w" in the open to make the file into a "binary" file (as opposed to text), and that the file extension we write out MUST match the type of data we are saving.
import urllib2 TheURL="http://www.forestryimages.org/images/768x512/1455160.jpg" TheService=urllib2.urlopen(TheURL) TheFile = open("C:/ProjectsPython/Beetle.jpg", "wb" ) TheResponse=TheService.read() TheFile.write(TheResponse) TheFile.close()
Try the code above with a few other images from the web and then we'll move to web services.
Harvesting Data from GBIF
Note: GBIF has changed their web services to be JSON based so this section needs to be updated to JSON.
Google Geocoding API
Google provides a number of geospatial web services. The Google Geocoding API allows you to convert street addresses to coordinates. This is the same service that Google Maps uses to convert addresses to points in Google Maps. The documentation for this is available at:
https://developers.google.com/maps/documentation/geocoding/
It's best to follow these instructions rather closely and I recommend using the XML standard but JSON will work as well. You will need to apply for an API key and you'll want to follow those instructions closely or your script will return an error.
# You'll need to replace this key with one from Google APIKey="AIzaSyBYNUKkf6C6Qxz3OvYe9V_mvohZvpfbO38" # This is the address, notice the spaces have been replaced with pluses ("+") Address="1600+Amphitheatre+Parkway,+Mountain+View" # The URL that will be sent to Google TheURL="https://maps.googleapis.com/maps/api/geocode/xml?address="+Address+",+CA&sensor=false&key="+APIKey+"" # Open the URL and get the response object TheResponse=urllib2.urlopen(TheURL) # Get the XML data from the response TheData=TheResponse.read()
Other Text-Based Services
There are a large number of services available that return data formatted in XML or another text-based format. These include:
Yahoo’s weather service: http://developer.yahoo.com/weather/
Yahoo’s weather service is an example of a Really Simple Syndication (RSS) protocol but is just another XML text format so it is pretty easy to parse for what you are interested in.
http://opendap.co-ops.nos.noaa.gov/ioos-dif-sos/ - tides and currents from NOAA
http://www.nws.noaa.gov/xml/index.php - noaa weather info
http://www.eia.gov/pub/oil_gas/natural_gas/feature_articles/2003/market_hubs/mkthubsweb.html - have to sign up
http://waterservices.usgs.gov/rest/DV-Service.html - usgs water gage data
http://graphical.weather.gov/xml/rest.php - National Digital Forcast Database REST service
OpenGIS Web Services
The real power in web services and GIS comes at being able to download large numbers of files without having to click on links all day. Many of these services are based on OpenGIS standards. The Web Mapping Service (WMS) standard allows us to download raster files from a web service.
NASA's Jet Propulsion Labs (JPL) maintains a number of web services to access streams of data from satellites. These web services are based on standards from the OpenGIS consortium and you should read about that at the JPG web site. Click on the link in the web site to see the parameters they prefer you send to return tiles (portions of a raster) at high speed.
Below is an example of code that will download one of these tiles:
import urllib2 Parameters="request=GetMap&layers=global_mosaic&srs=EPSG:4326&format=image/jpeg&styles=visual&width=512&height=512&bbox=-180,-38,-52,90" TheURL="http://wms.jpl.nasa.gov/wms.cgi?"+Parameters TheService=urllib2.urlopen(TheURL) TheFile = open("C:/ProjectsPython/World.jpg", "wb" ) TheResponse=TheService.read() TheFile.write(TheResponse) TheFile.close()
SOAP-Based Services
There are a number of web services that are based on the Simple Object Access Protocol (SOAP) and there are a number of developers who will say that these are the only web services that exist. This is nonsense and while SOAP provides great connectively between Java-based programs, SOAP is not simple, has serious performance problems, and is not well supported outside the Java community. Fortunately the bulk of web services that are used by the scientific world are not SOAP based and can be used with just the code we have mentioned above. Another problem with Python is that there are a large number of SOAP libraries instead of one standard library. See the post at StackOverflow.
Additional Resources
Python Documentation: urllib2
Python Documentation: How to Fetch Internet Resources