Wednesday, January 14, 2015

Answer: Where's the lake with the...



So.....  THAT was interesting!  


As you remember, before the break I posted a little story with a few "simple" questions to answer.  And it's taken me this long to get back to you with my commentary... 

Because this story is so interesting, I'm going to break my answer up into 3 separate posts.  (That way, you won't be overwhelmed with the long, slightly complicated story.)  Read my posts today, tomorrow, and Friday, and I hope you'll learn a lot along the way--just as I did.  



Our Story: 
   Little Rob was found dead of unknown causes in his apartment.  Near his body were scattered the contents of the rucksack he was packing, clearly anticipating a trip to the mountains.  In the pack was a vial of Acetazolamide pills, lots of fishing gear, and a road map of northern California.  For food he only had breakfast and lunch, so the suspicion for the murder falls on his hiking partner, Big Jim, who dropped one packet of freeze-dried dinner on his way out the front door.  There is also a list of larger lakes and reservoirs that have recently been stocked, with several circled as places to go fishing. Unfortunately, the piece of paper was torn in half, with most of the information missing.  

Your job is to try and find Big Jim.  Given what you know, you decide to start your search on high-altitude lakes in California that have been stocked with fish.  


1.  Can you find all of the lakes and reservoirs in California that have been stocked with fish in 2014, in Northern California, and are above 8,000 feet in elevation?  
2.  Since you know they're interested in fishing in larger lakes, can you find those Californian lakes more than 8,000 feet high that are also greater than 500 acres in surface area? 
3.  How do you know that Little Rob and Big Jim were planning on fishing at high altitude?  


This is a Challenge that requires you to find several different data sources.  

For #1, you need a (a) list of all lakes and reservoirs in California, (b) a list of all lakes and reservoirs in California that have been stocked with fish, (c) the latitude of all the lakes in California, and (d) the altitude of the lakes.  

You might ask, "isn't (b) overlapping with (a)"  (that is, wouldn't a list of stocked lakes be sufficient?), but I like to get multiple data sources for something like this.  Lists of things (like lakes) differ in their completeness, and sometimes you'll learn from the overlaps or items that are missing on one list, but present on the other.  

For #2, you need a (e) list of the surface area of lakes.  

IF you got incredibly lucky, you might find a single table that gives all of these features (lake-name, stocking-date, lat-long, altitude, surface-area), but that seems unlikely.  

I started my search with the obvious: 

     [ list of lakes in California ] 

Which leads to the Wikipedia result.  The article says there are ~3,000 named bodies of water in CA, and this is a good list to start with, but it's only around 420 lakes long.  


Fig 1


And I didn't really want to deal with working out the lat-long from the GNIS Id(s).  I thought I could do a little better by searching for a table from a slightly more authoritative source.  

So... I did another query to find the lake: 

     [ USGS database California lakes reservoirs ] 

and that led me to the Geographic Names Information Service (what GNIS stands for!). 

They have an interface that looks like this, making it easy to download a data file with lots of information about the lakes and reservoirs of California.  


Fig 2


I can grab that file (which, as you can see, uses vertical bars--aka "pipes"--to separate the values).  


Fig 3


This is getting pretty close.  This file has 2000 lakes in California, with the lat-long broken out, AND the elevation.  

I saved this file to my drive and then imported into a Fusion Table.  

(Why  a Fusion Table?  A:  Because I know I'm going to have to combine multiple data sets together--such this one above with another one that has fish-stocking information and surface areas.  A FT is the right tool to combine--or fuse--tables of information together.)  

In this case, I just created the FT (in Google Drive) and when it asked for the delimiting character, I said |  (the vertical bar, or "pipe") character.  

That FT is Lakes in CA, which looks like this:


Fig 4. Note that the attribution link is filled out telling where the data is from  (USGS/GNIS system)

Now we're starting to get somewhere.  This table is 2000 rows long (which is a suspicious number--exactly 2000 rows?).  

Let's now try to find the master list of stocked lakes in California.  

Now, since I live in California I happen to know that the agency here is called "California Department of Fish and Wildlife" (but you could easily figure that out on your own).  So my first query was: 

      [ California Department of Fish and Wildlife fish planting schedule ] 

Interesting.  Why did I say "planting" rather than "stocking"?  Truth is, I tried the term "stocking" first, and quickly learned that the preferred term of art was "planting."  Fine.  I can learn quickly, so I did this query which took me to the CDFW schedule at nrm.dfg.ca.gov/FishPlants/

Fig. 5

(You can also see their map of planted stocks, although I didn't use this map in my solution. It's very slow to load.)    
From this data source, I selected the "Export" button, which gave me a big CSV of all the lakes by planting (i.e., "stocking") date.  Here, it's the Fish Planting spreadsheet.  


Fig 6


Note that this list in the Fish Planting spreadsheet also includes rivers and such, but I don't care at this point because we're going to fuse this table with the previous table, and all the rivers won't match anything, so they'll just drop away.  So I just made a NEW Fusion Table with this data about fish planting in it, giving me the Fish Planting2 Fusion Table.  


So now I have two Fusion Tables:  (1) Lakes in CA, and  (2) FishPlantingSchedule2.  

As you can see, they both have a "Name" column in common between the two.  I'm going to fuse the tables on that column. In essence, the two tables will be joined together, with any row that shares the same name (such as "Avocado Lake") will be merged together.  

I do this by using the "Merge tables" option in the Lakes in CA FT.  

Fig 7

Which gives me, the  Merge of Lakes in CA and FishPlantingSchedule2   

If you scroll down a bit, you'll see a couple of the lakes that were found to be in common across the two tables: 

Fig 8


Note that Avocado Lake was actually planted twice--once on 12/21/14 and again at 1/11/2015, that's why it shows up in 2 separate rows.  

To see JUST the lakes that have been planted (out of the 2000 rows), I create a Filter to show me any row that has a date between Jan-1-2014 and Jan-31-2015.  

Fig 9


(To do that, just click on the column--Announce--and then click on the big blue Filter button.  Add the dates you want to filter on...)  

Now I can sort by Elevation ("Ele (ft)")  to see which lakes are highest... 

Fig 10


It's easy to see that there are 9 lakes (that have been planted) that are above 8,000 feet in elevation.  

But notice that Little Lake appears three times, while Horseshoe Lake and Pyramid Lake appear twice (they were all planted on separate days).  So there are REALLY only 4 lakes here.  

Now, which ones of these are in Northern California?  

The easiest way to figure this out is to map these lakes onto Google's My Maps.  So I did that... exporting the top lakes from the Fusion Table to a Google spreadsheet, and then IMPORTING that sheet into a new Google My Map.  (Here's how to import the data.)  

This is great, right? My map of the top 9 lakes above 8,000 feet now looks like this: 

Fig 11


There's just one little problem...  None of these lakes are high altitude!  What went wrong? 

In the map above the "Horseshoe Lake" shown is in Alameda county, at an altitude of about 10 feet.  Not 10,000 feet--just 10 feet.  (It's actually an old quarry near San Francisco Bay, just barely above sea level.)  

If  you zoom in on the map, you'll see it there:  Horseshoe Lake, part of a big park in Alameda County.  Obviously, it was planted with fish at some point.  So what happened?  


We need to go deeper.... 


Clearly, something is wonky with our data.  But what?  

Now I have to track my data really carefully.  Where did each item come from, and how was it handled?  

As I've said in previous posts, you have to track your metadata.  Whenever you create a data table (either as a spreadsheet or a Fusion Table), be sure to write down where it came from and what you did to it.  

To figure this out, I decided to track down the Horseshoe Lake example.  

First, I went back to my original FT, Lakes in CA, which has all of the lakes and lat/longs from USGS.  

I then filtered that data set to see ALL of the lakes with "horse" in the name.  (Just click on "Filter" and enter the text "horse.")  As you can see, here they all are... 

Fig 12


There are 6 lakes names "Horseshoe Lake" (and 1 named "Horseshoes Lake"), and as you can see, several of them are above 8,000 feet.  

Oddly, none of them are in Alameda county!  What gives?  

Let's go look at the other datasource--the Fish Planting spreadsheet and do the same thing, look for all lakes with the string "horse" in it.  When I do that, I see: 

Fig 13


Okay.. this is interesting.  The lat/long of this lake is 37.57, -122.0014 -- which is what we see in the map above.  BUT it's in Alameda!  And more strangely, this Horseshoe Lake doesn't appear in the master list of California lakes!    

So I thought I'd plot out all of the lakes in the list shown in Figure 12... And--guess what--they all seem wrong!  That is, the lat/longs don't seem to point to the lakes they list.  That is, the lat/long for Horseshoe Lake in Sierra county (393912N, 1203930W) is close, but not EXACTLY on the lake. 

Another mystery... 


I'm going to stop here for today, but let's look at what we need to figure out. 

1.  Why are the lat/longs for the lakes in USGS (FT, Lakes in CA, Fig 12) wrong?  That doesn't make sense--they're the people who provided the original data!  

2.  Why are the Horseshoe lakes plotted on the map in Fig 11 all in the wrong place? 

3.  (Important clue) Why is the lat/long of the Horseshoe Lake in Alameda county not the same as any lat/long in the master list?  

4.  (Another important clue) Why do all of the Horseshoe lakes in the Fusion Table of Fig 10 have the same lat/long??  


All these questions will be answered in TOMORROW's installment of SearchResearch.  

The quest continues... 




13 comments:

  1. I came upon the same issue when I had identified Pyramid, Little Lake, Cunningham Lake. Not Horseshoe Lake because no planting had been done in our time frame and in fact that still is in the future. When I checked Google map the lakes above were low elevations and small lakes. I didn’t understand it but they didn’t seem to impact my results. One of them was even, I believe, in a municipal park. Had any lakes been of interest I would have dug deeper. If you had not seen the fish planting at Horselake would you have dug deeper?

    ReplyDelete
    Replies
    1. I would have dug deeper. I used the Horseshoe Lake as an example, but I actually checked several other lakes and found similar problems. This is one of the points of this Challenge--when doing complex data analysis tasks, you ALWAYS have to double (and triple) check your work.

      Delete
  2. Good morning, Dr. Russell.

    I'm rereading the first answer. It is very interesting all your work and how you are solving it.

    I have some technical questions:
    A) When you downloaded lakes, why you didn't select only the ones with elevation 8000 or more?
    B)In fish planting, you selected announced date, that is data not just for the 15 time window, right? Why you first created a spreadsheet and not directly a FT? I did first in all cases spreadsheet first, and now I noticed that I can avoid that.

    About your questions, I think that maybe coordinates only give one specific point not all the space that Lakes and Reservoirs have. In the fishing planting, maybe it needs a special entry point and that is why coordinates vary so much.

    Looking forward for answer part 2 and 3. Have great day

    ReplyDelete
    Replies
    1. I noticed that your merged FT already has the two values for "Lat/Latitude" I am working on the formula to get the result in Spreadsheet. Did you, Dr. Russell, created the function or you downloaded results with both data on it?

      As already mentioned by Rosemary, Fish and Wildlife gives us DEC degrees. USGS for downloading gives DMS, and if we click on any name on the page shows both, but no way to download both.

      Delete
    2. That's right. This is one of the data-wrangling issues that comes up.

      Delete
  3. I have on a few occasions tackled a similar type of challenge to this, comparing data on First Nation reserves in Canada, for example mapping the communities that have boil water advisories. It reminds me of this challenge because each reserve can have multiple different names with multiple different spelling, or sometimes be referred to by a number, and their are unassociated communities in different parts of Canada that have the same name.

    In these cases doing a merge by name, gives you some very messy results. I cleaned the data I was using manually, by looking for results that didn't have any coordinates in my merged table and looking at each of those communities one by one to find their Lat/Long. I am curious to see how you tackled this lake data.

    ReplyDelete
    Replies
    1. This is a real problem (making sure all of the names of the lakes are the same). I'd hoped to tackle this question using OpenRefine, but will cover this topic in a future post. (Not this week!)

      Delete
  4. Tim brings up an excellent example of how difficult managing the data can be. In your merged table first we have the list of lakes data and to the right we have the fish planting data merged. The list of lakes uses the DMS coordinates (degrees minutes and seconds) and to the right we see coordinates lat/long. The list of lakes DMS coordinates are different for each record but to the right through merging we get lat/long repeats exactly the same. I think that occurred based on how the sheets are merged by Fusion. We likely need to use the same coordinates systems to get true merging. Names alone are not sufficient. Ideally we would have a single reference that matched in both sets of data but this is not the case as is.

    ReplyDelete
    Replies
    1. You are right, Rosemary. We can add a column with coordinates in the same unit Decimal or "degrees minutes and seconds". My approach was different. I believe it is easier to find match between lakes planted and elevation and with those search for coordinates. Other option is select in USGS only North part of California.

      Delete
    2. Whether we use a batch coordinate converter http://www.earthpoint.us/BatchConvert.aspx
      or if a function can be completed within Google Sheets I would say this can be done without one by one conversions. That being said, I haven't tried. Its getting both sets of data using one definite common reference that makes each row of data unique & cross-matched. At least I think that may be the necessary step required.

      Delete
    3. Nice to see Ramón still working on the challenge. I spent some time yesterday & I was going to wait and see what Dr.Dan had for us. But perhaps he has discovered the same problem I have seen. As you may recall I ignored the “inconsistencies” for lakes that didn’t fit the criteria we needed. But going back I looked again at Cunningham Lake. Here is a link http://cartographic.info/usa/map.php?id=258988 Specifically I refer you to the coordinates, the name and the map. If you check the List of Lakes provided by Dr. Dan you will see the degree coordinates match. If you check the Fish plantings table you’ll see decimal coordinates don’t match. If you zoom in on the map you’ll see the lake is Lockwood Lake not Cunningham Lake. What is going on? It doesn’t seem possible that CDFW would have the wrong coordinates because fishermen using this database would use their gps & these coordinates. I am absolutely stumped. Any ideas?

      Delete
    4. Ok just had a thought. Thinking back to my navigating skills when using legends at the bottom of topographical maps and the settings required for gps units you needed to know which datum has been used. Datums such as NAD27, NAD83 & WGS84. Heres a wiki article on the subject. http://en.wikipedia.org/wiki/North_American_Datum

      As well declinations (magnetic vs true north) are always changing which is why before planning a trip you need up to date declinations to get accurate bearings which is available online.

      I don’t think any of this has a bearing on our results. Or does it?

      Delete
  5. …is Big Jim bank fishing, trolling or setting a 32 mile modified drift net for Goldens on Cottonwood number 3…? I feel like I've been banging my head at the 5.9' elevation at the base of the Dawn Wall with this prolonged sear…zzzzZZZzzzzch.
    NatGeo - i became distracted…
    TC or KJ may have had a fish in their pockets?
    afterwards, NYT
    9.5 fingers
    good visuals
    …where did I leave my Fission Table in Fig 14? …is today tomorrow yet?

    ReplyDelete