on
NUFORC UFO sighting reports, Part I: Geocoding for UFO sightings
For thousands more years the mighty ships tore across the empty wastes of space and finally dived screaming on to the first planet they came across - which happened to be the Earth - where due to a terrible miscalculation of scale the entire battle fleet was accidentally swallowed by a small dog. - Hitchhiker’s Guide to the Galaxy, Adam Douglas
Source
There are many versions of UFO sightings dataset, I got mine here. It is minimally cleaned, but has enough information for me to do the cleaning myself. Which I quite like since it allows me to ensure data quality at the same time.
Missing Data
At the beginning, I had these columns:
-----------------------------------
columns | missing values
------------------+----------------
summary | 36
city | 244
state | 8373
date_time | 2369
shape | 3515
duration | 3713
stats | 53
report_link | 0
text | 54
posted | 2369
city_latitude | 22736
city_longitude | 22736
A quick look at the data will show that the stats
column provides information for every column except summary
, text
, report_link
,city_latitude
, city_longitude
. We can fill in a lot of these columns with the following function:
def fill_na(data,
column_missing,
column_source,
before,
after,
min_length):
""" Takes the following as input:
data {pandas dataframe} : data
column_missing {string} : name of column that needs filling
column_source {string} : name of column with data to fill missing
data
before: {list of string} : string(s) that occur before the data
needed to fill na.
after: {list of string} : string(s) that occur after the data
needed to fill na.
Reports errors.
Returns:
data {pandas dataframe} : data with nas filled if data is available."""
missing = data[column_missing].isna()
missing_index = np.arange(data.shape[0])[missing.values]
for i in missing_index:
source = data.loc[i, column_source]
before_index = []
for b in before:
try:
index = source.index(b)+len(b)
except:
continue
before_index.append(index)
if len(before_index) < 1:
print(f"{i}: Unable to find {before}. \n {source} \n")
continue
if after == None:
after_index = [len(source)]
else:
after_index = []
for a in after:
try:
index = source.index(a)
after_index.append(index)
except:
continue
if len(after_index) < 1:
print(f"{i}: Unable to find {after}. \n {source} \n")
continue
min_index = max(before_index)
max_index = min(after_index)
try:
replacement = source[min_index:max_index].strip()
except:
print(f"{i}: Can't replace. \n")
print(f" min_index = {min_index}, max_index = {max_index} \n")
print(f" source: {source}")
continue
if len(replacement) < min_length:
continue
data.loc[i, column_missing] = replacement
return data
After filling in data with existing data in the dataset (from stats
column):
--------------------------------------------------------------------------
columns | missing values | now | notes
------------------+----------------+-----------+--------------------------
summary | 36 | 36 | did nothing
city | 244 | dropped | replaced by "Location"
state | 8373 | dropped | replaced by "Location"
date_time | 2369 | 63 | older dates were missing
shape | 3515 | 3515 | filled from stats
duration | 3713 | 3708 | filled from stats
stats | 53 | 53 | did nothing
report_link | 0 | 0 | did nothing
text | 54 | 54 | did nothing
posted | 2369 | dropped | replaced by "reported"
city_latitude | 22736 | 22715 | filled w/ existing data
city_longitude | 22736 | 22715 | filled w/ existing data
new columns:
-------------------------------------------------------------------------
columns | missing values | notes
------------------+----------------+-------------------------------------
Location | 53 | will be easier to geocode
reported | 103 | more meaningful than "posted"
Where exactly?
So as indicated in the table, we are only missing 53 locations. A quick glance at the actual text will reveal that many people who reported to this service elected to remain anonymous. Some of them were not describing unidentified flying objects but abduction experiences that happened at home or other locations. So it makes sense that some fields are left out, including the location field.
As for the remaining locations, most of them are not detailed addresses. Most of them are cities and states/provinces speckled across the globe. It would be very nice to plot the location data on maps. Therefore, we need the longitude and latitude of these locations.
Obviously, we can google our addresses/descriptions of locations and get the coordinates. With 30 thousand unique locations, we can’t do it by hand. We need to automatically request information from Google. However, this is a costly service for Google if we bombard 30 thousand every minute for the whole day. Geocoding is expensive. For unlimited and fast service, we have to pay. Or be able to host our own 20GB server ( DataScienceToolkit ), which obviously involves paying.
I’m doing this free and slow. Google allows, for free, a usage limit of 2500 queries per day at the rate of 50 requests per second. There are many other geocoding options. You can use another service once you hit the limit with one and so maximize your output perday without paying. But not all geocoding options are created equal. For example, DataScienceToolkit only has street level detail for locations in the US and UK. These are things to check and take into account when making choices.
For this project, most of the locations are geocoded for the dataset by the contributer who scraped for this dataset.
Google meets its match:
1519: error in response!
Location: Moon (near the moon surface),.
{'results': [], 'status': 'ZERO_RESULTS'}
That’s actually pretty funny, it would be interesting to see how many people called in, describing an experience that wasn’t on the earth surface. Hopefully by the time we’re done, we would only have 100 or so missing values and we’ll manually inspect them for extraplanetary elements.
Ok anyway, geocoding:
1. Get a Google API key
Here. Like all things Google, it’s very easy to set up.
2. Reduce workload by removing redundancies
We can remove duplicates from Locations
. I wasn’t expecting this to do much, but surprisingly, there are only about 30k unique locations in 110k entries.
>>> ufo.shape
(112095, 11)
>>> len(ufo.Location.unique())
30431
Many locations have been geocoded before, when we remove those, there were only about 14k unique locations that need geocoding. Using only Google API, we can get this done in 6 days. If you’re new to geocoding, it’s going to feel ridiculous that 14k datapoints will take 6 days to obtain because you may be used to having lightning fast results from computing tasks. But some operations are computationally expensive. But this is an acceptable timeline for me since there are other analysis we can do in the meantime.
>>> len(locs_w_geocode)
16014 # locations with geocode
>>> len(locs_wo_geocode)
14418 # need to be geocoded
3. Dealing with API limits and catching errors
There are several things to take into account while geocoding. Firstly, we need to read and understand the rules of the service we’re using. For example, Google allows 2500 queries per day, and a certain amount of queries per second. An error will be thrown and the process terminated if the code goes beyond this limit. Secondly, because it’s not feasible for me to look through each entry, there are very likely queries that will result in Google throwing an error (joke entries, invalid format, typos, lack of information, etc.). If we are making 500 queries in one go, and there’s an error at the 500th query, we need to ensure we do not lose the results from the first 499 queries. And finally, we need a function that would stop when there’s an error and report that error, so we may manually correct it and restart the process.
The following geocode
function does all of this. There’s a sleep timer to slow down the search as this is not a time sensitve task for me, it was going to take 6 days anyway, does it matter if it takes 10 minutes or one hour per day? Not at all, so there’s really no need to hit the top speed and risk getting an error. There are options for two APIs, but only Google API was used.
The dictionaries are saved into a pickle
file as soon as the function finishes. My Mac is overworked from my Masters and things crash without warning all the time. So leaving the dictionaries containing our precious queries lying there over lunch, let alone overnight, is not a good idea.
The columns can then be filled with the remaining functions once all locations that can be geocoded has been geocoded.
def geocode(
lat,
long,
country,
locations,
start,
end,
api,
apikey):
import requests
import time
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
baseurl = {"google" : "https://maps.googleapis.com/maps/api/geocode/json?address=",
"dstk" : "http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="}
base = baseurl[api]
for i in range(start, end):
if locations[i] in lat:
continue
if locations[i] in long:
continue
if len(locations[i]) < 5:
continue
add = "+".join(locations[i].split(" "))
url = base+add+apikey
time.sleep(0.2)
try:
resp = requests.get(url, verify = False)
except:
print(f"{i}: error in retrieval! \n Location: {locations[i]}.\n")
return lat, long, country
resp_json = resp.json()
if resp_json["status"] != "OK":
if resp_json["status"] != "OVER_QUERY_LIMIT":
print(f"{i}: error in response! \n Location: {locations[i]}.\n")
print(resp_json)
continue
else:
print(f"{i}: Over query limit! \n Location: {locations[i]}.\n")
print(resp_json)
return lat, long, country
try:
lat[locations[i]]= resp_json["results"][0]["geometry"]["location"]["lat"]
long[locations[i]] = resp_json["results"][0]["geometry"]["location"]["lng"]
# find country
for a in resp_json["results"][0]["address_components"]:
if "country" in a["types"]:
country[locations[i]] = a["long_name"]
break
except:
print(f"{i}: error in parsing!\n Location: {locations[i]}.")
print(resp_json)
return lat, long, country
if i % 10 == 0:
print(f"{i} \n Location: {locations[i]}, {(lat[locations[i]]), (long[locations[i]])}, {country[locations[i]]} \n")
return lat, long, country
def fill_col_from_dict(data, column, dictionary):
for i in data.index:
try:
data.loc[i,column] = dictionary[data.loc[i, "Location"]]
except KeyError:
print(data.loc[i, "Location"],
"long: ",
data.loc[i, "city_longitude"],
"lat: ",
data.loc[i, "city_latitude"])
data.loc[data.Location.isna(), ["city_latitude"]] = np.nan
data.loc[data.Location.isna(), ["city_longitude"]] = np.nan
return data
def fill_lat_long(data, lat, long):
print("filling latitude.")
data = fill_col_from_dict(data, "city_latitude", lat)
print("filling longitude.")
data = fill_col_from_dict(data, "city_longitude", long)
print("done.")
return data
After running all functions:
>>> print(sum(ufo.city_latitude.isna()), "entries are missing in city latitude.")
357 entries are missing in city latitude.
>>> print(sum(ufo.city_longitude.isna()), "entries are missing in city longitude.")
357 entries are missing in city longitude.
Cleaning is mostly done now, and we’ve filled in a lot of missing data. I also converted the date_time
and reported
columns into datetime.date
formats. This is fairly easy straightforward with the datetime
package.
Next up will be some data visualization/analysis of this dataset.