NUFORC UFO sighting reports, Part I: Geocoding for UFO sightings

For thousands more years the mighty ships tore across the empty wastes of space and finally dived screaming on to the first planet they came across - which happened to be the Earth - where due to a terrible miscalculation of scale the entire battle fleet was accidentally swallowed by a small dog. - Hitchhiker’s Guide to the Galaxy, Adam Douglas

Source

There are many versions of UFO sightings dataset, I got mine here. It is minimally cleaned, but has enough information for me to do the cleaning myself. Which I quite like since it allows me to ensure data quality at the same time.

Missing Data

At the beginning, I had these columns:

-----------------------------------
columns           | missing values
------------------+----------------
summary           | 36
city              | 244
state             | 8373
date_time         | 2369
shape             | 3515
duration          | 3713
stats             | 53
report_link       | 0
text              | 54
posted            | 2369
city_latitude     | 22736
city_longitude    | 22736

A quick look at the data will show that the stats column provides information for every column except summary, text, report_link,city_latitude, city_longitude. We can fill in a lot of these columns with the following function:

def fill_na(data,
            column_missing,
            column_source,
            before,
            after,
            min_length):
    """ Takes the following as input:
            data {pandas dataframe} : data
            column_missing {string} : name of column that needs filling
            column_source {string} : name of column with data to fill missing
                                     data
            before: {list of string} : string(s) that occur before the data
                                       needed to fill na.
            after: {list of string} : string(s) that occur after the data
                                      needed to fill na.
        Reports errors.
        Returns:
            data {pandas dataframe} : data with nas filled if data is available."""
    missing = data[column_missing].isna()
    missing_index = np.arange(data.shape[0])[missing.values]
    for i in missing_index:
        source = data.loc[i, column_source]
        before_index = []
        for b in before:
            try:
                index = source.index(b)+len(b)
            except:
                continue
            before_index.append(index)
        if len(before_index) < 1:
            print(f"{i}: Unable to find {before}. \n {source} \n")
            continue
        if after == None:
            after_index = [len(source)]
        else:
            after_index = []
            for a in after:
                try:
                    index = source.index(a)
                    after_index.append(index)
                except:
                    continue
        if len(after_index) < 1:
            print(f"{i}: Unable to find {after}. \n {source} \n")
            continue
        min_index = max(before_index)
        max_index = min(after_index)
        try:
            replacement = source[min_index:max_index].strip()
        except:
            print(f"{i}: Can't replace. \n")
            print(f" min_index = {min_index}, max_index = {max_index} \n")
            print(f" source: {source}")
            continue
        if len(replacement) < min_length:
            continue
        data.loc[i, column_missing] = replacement
    return data

After filling in data with existing data in the dataset (from stats column):

--------------------------------------------------------------------------
columns           | missing values | now       | notes
------------------+----------------+-----------+--------------------------
summary           | 36             | 36        | did nothing
city              | 244            | dropped   | replaced by "Location"
state             | 8373           | dropped   | replaced by "Location"
date_time         | 2369           | 63        | older dates were missing
shape             | 3515           | 3515      | filled from stats
duration          | 3713           | 3708      | filled from stats
stats             | 53             | 53        | did nothing
report_link       | 0              | 0         | did nothing
text              | 54             | 54        | did nothing
posted            | 2369           | dropped   | replaced by "reported"
city_latitude     | 22736          | 22715     | filled w/ existing data
city_longitude    | 22736          | 22715     | filled w/ existing data

new columns:
-------------------------------------------------------------------------
columns           | missing values | notes
------------------+----------------+-------------------------------------
Location          | 53             | will be easier to geocode
reported          | 103            | more meaningful than "posted"

Where exactly?

So as indicated in the table, we are only missing 53 locations. A quick glance at the actual text will reveal that many people who reported to this service elected to remain anonymous. Some of them were not describing unidentified flying objects but abduction experiences that happened at home or other locations. So it makes sense that some fields are left out, including the location field.

As for the remaining locations, most of them are not detailed addresses. Most of them are cities and states/provinces speckled across the globe. It would be very nice to plot the location data on maps. Therefore, we need the longitude and latitude of these locations.

Obviously, we can google our addresses/descriptions of locations and get the coordinates. With 30 thousand unique locations, we can’t do it by hand. We need to automatically request information from Google. However, this is a costly service for Google if we bombard 30 thousand every minute for the whole day. Geocoding is expensive. For unlimited and fast service, we have to pay. Or be able to host our own 20GB server ( DataScienceToolkit ), which obviously involves paying.

I’m doing this free and slow. Google allows, for free, a usage limit of 2500 queries per day at the rate of 50 requests per second. There are many other geocoding options. You can use another service once you hit the limit with one and so maximize your output perday without paying. But not all geocoding options are created equal. For example, DataScienceToolkit only has street level detail for locations in the US and UK. These are things to check and take into account when making choices.

For this project, most of the locations are geocoded for the dataset by the contributer who scraped for this dataset.

Google meets its match:

1519: error in response! 
 Location: Moon (near the moon surface),.

{'results': [], 'status': 'ZERO_RESULTS'}

That’s actually pretty funny, it would be interesting to see how many people called in, describing an experience that wasn’t on the earth surface. Hopefully by the time we’re done, we would only have 100 or so missing values and we’ll manually inspect them for extraplanetary elements.

Ok anyway, geocoding:

1. Get a Google API key

Here. Like all things Google, it’s very easy to set up.

2. Reduce workload by removing redundancies

We can remove duplicates from Locations. I wasn’t expecting this to do much, but surprisingly, there are only about 30k unique locations in 110k entries.

>>> ufo.shape
(112095, 11)
>>> len(ufo.Location.unique())
30431

Many locations have been geocoded before, when we remove those, there were only about 14k unique locations that need geocoding. Using only Google API, we can get this done in 6 days. If you’re new to geocoding, it’s going to feel ridiculous that 14k datapoints will take 6 days to obtain because you may be used to having lightning fast results from computing tasks. But some operations are computationally expensive. But this is an acceptable timeline for me since there are other analysis we can do in the meantime.

>>> len(locs_w_geocode)
16014 # locations with geocode
>>> len(locs_wo_geocode)
14418 # need to be geocoded

3. Dealing with API limits and catching errors

There are several things to take into account while geocoding. Firstly, we need to read and understand the rules of the service we’re using. For example, Google allows 2500 queries per day, and a certain amount of queries per second. An error will be thrown and the process terminated if the code goes beyond this limit. Secondly, because it’s not feasible for me to look through each entry, there are very likely queries that will result in Google throwing an error (joke entries, invalid format, typos, lack of information, etc.). If we are making 500 queries in one go, and there’s an error at the 500th query, we need to ensure we do not lose the results from the first 499 queries. And finally, we need a function that would stop when there’s an error and report that error, so we may manually correct it and restart the process.

The following geocode function does all of this. There’s a sleep timer to slow down the search as this is not a time sensitve task for me, it was going to take 6 days anyway, does it matter if it takes 10 minutes or one hour per day? Not at all, so there’s really no need to hit the top speed and risk getting an error. There are options for two APIs, but only Google API was used.

The dictionaries are saved into a pickle file as soon as the function finishes. My Mac is overworked from my Masters and things crash without warning all the time. So leaving the dictionaries containing our precious queries lying there over lunch, let alone overnight, is not a good idea.

The columns can then be filled with the remaining functions once all locations that can be geocoded has been geocoded.

def geocode(
    lat,
    long,
    country,
    locations,
    start,
    end,
    api,
    apikey):
    import requests
    import time
    from requests.packages.urllib3.exceptions import InsecureRequestWarning

    requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
    
    baseurl = {"google" : "https://maps.googleapis.com/maps/api/geocode/json?address=",
               "dstk" : "http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="}
    base = baseurl[api]
    for i in range(start, end):
        if locations[i] in lat:
            continue
        if locations[i] in long:
            continue
        if len(locations[i]) < 5:
            continue
        add = "+".join(locations[i].split(" "))
        url = base+add+apikey
        time.sleep(0.2)
        try:
            resp = requests.get(url, verify = False)
        except:
            print(f"{i}: error in retrieval! \n Location: {locations[i]}.\n")
            return lat, long, country
        resp_json = resp.json()
        if resp_json["status"] != "OK":
            if resp_json["status"] != "OVER_QUERY_LIMIT":
                print(f"{i}: error in response! \n Location: {locations[i]}.\n")
                print(resp_json)
                continue
            else:
                print(f"{i}: Over query limit! \n Location: {locations[i]}.\n")
                print(resp_json)
                return lat, long, country
        try:
            lat[locations[i]]= resp_json["results"][0]["geometry"]["location"]["lat"]
            long[locations[i]] = resp_json["results"][0]["geometry"]["location"]["lng"]
            # find country
            for a in resp_json["results"][0]["address_components"]:
                if "country" in a["types"]:
                    country[locations[i]] = a["long_name"]
                    break
        except:
            print(f"{i}: error in parsing!\n Location: {locations[i]}.")
            print(resp_json)
            return lat, long, country
        if i % 10 == 0:
            print(f"{i} \n Location: {locations[i]}, {(lat[locations[i]]), (long[locations[i]])}, {country[locations[i]]} \n")
    return lat, long, country

def fill_col_from_dict(data, column, dictionary):
    for i in data.index:
        try:
            data.loc[i,column] = dictionary[data.loc[i, "Location"]]
        except KeyError:
            print(data.loc[i, "Location"],
                  "long: ",
                  data.loc[i, "city_longitude"],
                  "lat: ",
                  data.loc[i, "city_latitude"])
    data.loc[data.Location.isna(), ["city_latitude"]] = np.nan
    data.loc[data.Location.isna(), ["city_longitude"]] = np.nan
    return data

def fill_lat_long(data, lat, long):
    print("filling latitude.")
    data = fill_col_from_dict(data, "city_latitude", lat)
    print("filling longitude.")
    data = fill_col_from_dict(data, "city_longitude", long)
    print("done.")
    return data

After running all functions:

>>> print(sum(ufo.city_latitude.isna()), "entries are missing in city latitude.")
357 entries are missing in city latitude.
>>> print(sum(ufo.city_longitude.isna()), "entries are missing in city longitude.")
357 entries are missing in city longitude.

Cleaning is mostly done now, and we’ve filled in a lot of missing data. I also converted the date_time and reported columns into datetime.date formats. This is fairly easy straightforward with the datetime package.

Next up will be some data visualization/analysis of this dataset.