Incomplete
If you view GIS Programming not be seen as writing individual and islated programs to solve small problems you will never get the full benefit of programming. View GIS programming in the larger context of a community of GIS users, analysits, managers, and developers working to create datasets and results to help solve problems that may take decades to truely tackle. This view requires that we take a longer term view of programming.
Programming is the act of writing code and is fairly easy to master. A "developer" is someone who develops or "creates" something for others to use to help them solve problems. This broadens our expertise to include "Software Lifecycles", "Data Management Plans", "Project Management", and beomcing part of a community that supports us in writing software and that we support.
Data management is as critical as any other step in research and can mean the difference between success and failure. Datasets have been thrown away because field crews took geographic coordinates in different Datums (see section X) and did not record the format of the data. The data were integrated before the problem was found, some measurements with thousands of meters in error, and all the data had be to discarded. You may know of datasets that are sitting in a box on disks that cannot be read, or files of incomprehensible data that have no explanatory metadata. There are also international efforts to increase data availably on the web including the Global Biodiversity Information Facility (www.gbif.org). Unfortunately, most researchers do not know how to access these data. A framework for data management can turns these problems into opportunities.
Researchers in natural resources manage large databases which include datasets from a variety of sources. These datasets may be distributed in space and time and collected from different organizations and individuals. This makes integration difficult, and can limit the types of analyses available. Developments in computer technology such as programming, relational databases, and web services provide the opportunity to conquer many of these challenges. Unfortunately, these technologies remain unfamiliar to most researchers. This book will create a framework for management of natural resource data, make researchers comfortable with programming, and introduce them to relational databases and web services.
A traditional research framework generally includes Data Collection, Analysis and Publication. To facilitate better data management, this book adds Planning, Integration, Maintenance, and Distribution steps to this traditional framework (Figure 1). In this book, we will focus on 1) how to plan for successful data integration, 2) how to acquire data through the Internet and other means, 3) how to integrate datasets while maintaining the highest possible accuracy, precision, and applicability, 4) how to insure data is maintained over time, and 5) how to distribute the data to others through the Internet and other means.
Figure 1. Data management flow chart
2.1 Planning
Planning is the most critical phase of any project and even more so of your overall data management. Planning should include looking at each of the steps described below and making sure the time and resources are available to complete each step.
2.2 Data Collection
The traditional step of data collection changes only in that we need to structure our field data to insure that the data can be used in a broader context
2.2.1 Categorical Variables
It is easier to integrate disparate datasets if the data are in continuous measurements rather than categorical. Consider what happens if we take two datasets, one has recorded the size of trees as small, medium, and large while another recorded them simply as small and large. How do we combine this data? To make matters worse, let’s say that the first dataset used a diameter at breast height of .5 meters as the separation between small and medium and a DBH of 1 meter for the separation between medium and large. The second dataset then used a DBH of .75 meters for the separation between small and large. Because the categories were not created with the same definitions, the data can never be integrated in a meaningful way. If the DBH had been recorded in meters the data could be integrated.
2.2.2 Accuracy
Many projects do not record the accuracy of their measurements. This is also critical when combining datasets. One group may be using very sophisticated equipment while others may be guessing. If you record the accuracy of the measurements, you can determine the accuracy of an integrated database that uses those measurements. It is also important to maintain the accuracy of your data for predictive studies.
2.2 Acquisition
You may find data in old paper forms, Excel files, or on the Internet. The paper-based data can simply be typed into the computer. Excel files can be opened either manually or automatically to have their data copied into an integrated spreadsheet. This book will show you how to use VBA to create “web crawlers” that obtain data from the Internet and reformat the data to include it in your research.
2.3 Integration
Integrating data will be the most challenging and frustrating part of managing your data. Sometimes you will have the data you need without documentation that describes it in enough detail to use correctly. Or you will find disks that may contain data in a format that cannot be read. I have encountered both of these problems repeatedly. I recommend you look for datasets with good documentation or the contact information of the original creators of the data. If the documentation is thin, you can interview the original creators to fill in the gaps. Of course you should include this new information when you republish the data.
2.4 Analysis
The biggest challenge I have faced with integrated data is that the analysis tools can not handle large datasets (or the analyses took a really long time). This book will provide some potential solutions for this problem, but if your datasets are really large (over 100,000 records) you will want to bring in some professional database managers and possibly programmers (or contact the author).
2.5 Maintenance
Computers and data are like cars. They need regular maintenance, or when you go back to them they just won’t work. It is essential that you establish a plan for backing up your data and the associated documentation in at least two places. The first place is usually your work computer. Check with your Information Technology (IT) personnel and see if they have “offsite” backup. This is a backup in a separate physical location. If this doesn’t seem important, consider that most of the servers for Colorado State University administration are in the basement of the Engineering building. When it flooded during the Spring Creek flood of 1997, they were all destroyed. If they did not take the time to have offsite backup, all the administrative records would have been destroyed (including all of the students’ grades!). If your IT folks do not have offsite backup, it’s relatively easy to copy your programs and data to your home computer on a periodic basis.
Data and documentation has traditionally been stored on “floppy” disks and, more recently, external hard drives and “thumb” drives. All of these devices age with time and will eventually become unusable. You will want to update your backups periodically so that if you need them they actually work!
Another critical item for maintenance is to make sure that all the documentation for your work is easily accessible and complete enough that if you leave, someone else can come in and use their data. An easy check is to ask a new student to review your documentation and data to see if they could use it. A great first project for a new graduate student could be to do some analyses on an existing dataset. This gets them started without having to do field work.
2.6 Publication
The only thing to remember about publication within the data management framework is that you may be publishing results and even data that were collected by someone else. Always give credit to the original project and make sure you have permission to publish the data. This means you need to keep track of the sources of the data for any analysis based on integrated datasets.
2.7 Distribution
It is relatively easy to make your data available to others by simply placing an Excel file on the Internet with some associated documentation. When finished with this book you will also be able to publish your data on the Internet using an Access database.
As in publication, you must make sure you have permission to distribute the data you have used. Some individuals and organizations will make their data available for analysis but not publication or distribution. Make sure you check first.
Exercises
1. Create a plan for how you will manage data over the course of your research career. Include all the items above. This should be a relatively general description that identifies the tools you will use for data management.
© Copyright 2018 HSU - All rights reserved.