Data Collection and Potential Analysis
As is known, there are several factors correlated strongly and directly with crimes rates such as education and income levels of the criminals, and many studies have explained the relationship between them. However, indirect factors are rarely given credits. Therefore, we decided to dig into some direct factors along with some indirect factors. The factors we are investigating with such as moon phases, obesity rates and smoking rates all have indirect correlations with crime rates, and we chose the overall education level as our interested direct factor, hence we plan to collect all those data mentioned. In the meantime, we will collect 10 datasets for 10 different major cities for better comparison and comprehensive insights.
-
Crime Data
For attaining the crime data from different 10 cities, we collected data from Data.gov, including crime data from Los Angeles, Chicago, Seattle, New York City, etc. We chose 2017 as the study year. In this way, we filtered the whole data and extracted the crime data in 2017. The crime data includes each crime record in this year and the record contains victim’s information, location, crime type and other crime-related information.
-
Demographics Data
In the crime dataset, it includes these three features: victim’s age, victim’s gender, and victims’ race. Based on these demographics information, we’ll analyze how crime rates and crime types are featured in terms of age, gender and race. Starting from victims, we can get demographic information which is essential, since it will be the base of further investigation.
-
Weather and Moon Phase Data
Moon phase datasets are retrieved via worldweatheronline.com using API, including each day’s moon phase, moon illumination, sun hour and temperature. The outcome csv files are assigned with names as Outfile_Moonphase_[City Name], e.g. Outfile_Moonphase_Huston. As we mainly focus on the moon phase, other attributes in this dataset may also contribute to the effect so we listed them all out, but it is possible that not all of them will be included if some of them are considered disruptive. Also, we will check if there is any correlation between moon phase and certain types of crimes. The ideal research outcome with moon phase datasets will confirm with the study of “Human Tidal Wave” that the moon phase can cause “biological tides”. We might be able find the patterns of how the moon phase affects the rate of crime committed. Also, sun hours and temperature may have the same influence, but the pattern may be more erratic. And the weather attribute contains 35 weather types including: blizzard, light snow, overcast, sunny, heavy rain and etc.
-
Education Data
-
Education Data
For education datasets, we plan to collect school data for elementary, secondary and postsecondary institutions. We’ll first count the densities of schools based on zip codes and categorize schools into different types (regular, special education, vocational/technical, etc.). Then, we can analyze how the school densities and school types in each city/county relate to different types of crimes. And we can further investigate and to see if school population for each area has an effect on crime rates.
-
Health Data
-
Health Data
Interested in the relationship between the health status and the crime rates in terms of counties, we collected the smoking rates, obesity, exercise facility, and food insecurity of all counties in the United States. We believe these factors may affect the crime rates geographically. The dataset contains three years of data from 2015 to 2017. We will investigate if there is a trend of health status affecting crime rates periodically and how much each of these factors promote the crime rates.
Data Issues and Data Cleaning
-
Crime Data
There are over 20 columns in crime datasets in each city, however, some of the columns (features) are not necessary for analyzing crime, for example, weapon usage, weapon description. In addition, due to the lack of information from the crime report, some columns contain plenty of missing values (over 60%). In this way, some of these columns are supposed to remove in order to reduce the dimensions of data.
After checking the variables’ information and description, these 12 columns have been remained for further analysis:
'Date Occurred', 'Area ID', 'Area Name', 'Crime Code', 'Crime Code Description', 'Victim Age', 'Victim Sex', 'Victim Descent', 'Premise Code', 'Premise Description', 'Status Description', 'Location '.
-
Weather and Moon Phase Data
The moon phase data has been retrieved through API in json format. After being checked missing data and noise data in python script, both numerical and categorical data in the weather and moon phase datasets are in appropriate ranges. Therefore, technically there are no data issues in those datasets. In other words, the datasets don’t have any missing values or noise.
-
Education Data
For the three datasets that are collected separately for education data, there is no direct sign of empty values for each attribute, but there exists several other issues with the datasets:
-
The inconsistency of the attributes: same column (Type, Status, Level) with different attribute directories
-
Primary zip codes and secondary zip codes misses the starting 0
-
Attribute “Population” has 0 and -999
For the education data, the unique values are first printed for categorical attributes, and no sign of abnormal data. Then, the numerical attributes mentioned in the previous section are checked, such as “zip codes” and “population”, etc. In addition, there are variable directory layouts available for each dataset, and several categorical attributes are already being grouped. By checking with the layout files, there are several attributes that are actually being labeled as missing values (such as 0 and -999).
-
Health Data
The dataset has its unique geographical ID number for counties. In order to create a column designating each observation’s county name, another dataset containing the ID number and the name of each county from the same website is collected and merged with the previous dataset.
From the ID and name of county dataset, the unnecessary attributes such as “display_name”, “image_author”, “image_link”, “image_meta” and etc. are deleted. By deleting the trivial columns, a solid merged dataset of counties’ health status is formed.
When checking the cleanliness of the data, there exists neither missing values nor abnormal data for both obesity and smoking rates. Therefore, there are no data issues in the health dataset.