How the Internet Works
In Review
Introduction
All of the computers on the Internet are physically connected to one other. Some folks believe the Internet uses satellites but it just takes too long to bounce signals out into space and back again. Microwaves are used for short distances but the rest of the Internet is connected with millions of miles and cables laid over telephone poles, through the ground, and under the ocean. Think about this for a minute. You have a physical connection through copper wire and fiber optic cable to millions of computers scattered all over the world. This may be the single most important development of the last century as it allows very low cost communication of a large variety of data between most of the people on the earth. The implications of the Internet are only just starting to be felt by society.
The Internet can be access through browsers on a variety of devices to over 2 billion web sites. The Internet is also used for a wide variety of communication uses that most people never see. This includes bank transactions, secure government communications, and others. The internet contains a huge variety of GIS data, much of which is available for public use. Other sources are available for purchase. All of these sources of GIS data can be access from a few simple scripting commands. You can even create your own "services" to provide data to others on the Internet .Enough philosophy, now back to work.
How the web works
Every computer has an address that uniquely identifies it to the internet. This is referred to as an “Internet Protocol” or “IP” address. “Protocol” is another word for a language that allows two computers to talk to one another. An example of an IP address would be “128.54.123.1”. Each IP address contains 4 bytes and each byte can be 0 to 255. Servers on the Internet have “fixed addresses” that never change and other computers have “dynamic addresses” that are assigned when they are turned on or connect to the network. These addresses allow two computers to talk with each other on what is otherwise a party line (see Wikipedia if you are too young to know what a party lines is) with millions of callers.
When you get on the Internet you probably type in a “domain name” such as “www.Google.com”. This name is converted into an IP address by a “domain name server” and then the browser actually uses an IP address that is hidden from you. If you want to find out what the IP address is you can perform a “Whois” search from a domain name provider but you really don’t need to.
In any internet transition there is a client and a server. The client is the one making the request and the server is the one who hopefully responds with what the client wants. If type “www.Goole.com” into your web browser’s Uniform Resource Locator (URL) text box and then hit return you will see Google’s home page. Right click on the page and select “View Source”. This will show you the Hypertext Markup Language (HTML) code that was returned from Google’s web site. HTML is the primary language that is used to create web pages and is what is typically returned in a response from a server.
If you type a search into Google such as “HTTP” you will see a list of web sites that contain information on Hypertext Transfer Protocol (HTTP). Take a closer look at what Google’s web page put into the URL field when you entered this search.
http://www.google.com/search?hl=en&q=HTTP&btnG=Google+Search
This link is a URL but it is also the core of a Hyper Text Transfer Protocol (HTTP) request. The first part “http” indicates that we will be using HTTP to make the request. HTTP is the protocol used by the world-wide-web to communicate web pages, images, and other web content. Having a specified protocol is very important to make sure the client and server are using the same language. There are other protocols such as File Transfer Protocol (FTP) and Small Mail Transfer Protocol (SMTP). You have probably used both these protocols to move files and receive email even though your software hides it from you.
Network Stacks
To have data go from a script to another server and back there is actually a large number of different protocols and hardware “layers” involved. These layers make up a “Network Stack”. The Transaction Control Protocol (TCP) and Internet Protocol (IP) are part of this stack for the Internet. Fortunately we only need to worry about part of HTTP and the contents of it’s messages to be able to harvest data from the Internet.
The next part of the URL “www.google.com” is the domain name and indicates the server you wish to communicate with. This value is converted into an IP number by the browser before the actual request is sent to the server. You can substitute an IP address for the domain name if you know the address of a server.
After the domain name is the address of the file that the client wants to be called by the server. In the example above this is simply “search”. This file is typically in a special “web site” folder on the server so that only part of the path to the file is shown in the URL. The file path can also be hidden by some web servers. If you right-click on any image that is displayed in a web page and select “properties” you will see the location of the file within the servers web folder. If you right-click on the Google logo you will see an address such as shown below. The “/intl/en_ALL/images/logo.gif” is the address of the image within the web folder on Google’s server.
http://www.google.com/intl/en_ALL/images/logo.gif
After the question mark (“?”) there can be parameters with information for the web page. in the case of Google the parameters include information on the search to be executed. The parameters in the example above are:
hl=en q=HTTP btnG=Google+Search
These parameters are defined by Google so they can be a little hard to decipher but we can guess that the “hl” is the language, “q” is the question, and “btnG” tells Google to do a search. These parameters are passed to a program on one of Google’s servers which then perform a search and return the results in HTML.
To access the Internet from a script all we have to do is send HTTP requests to a server and then parse the data that is returned.