Explore the SAT test in NYC¶
The SAT, or Scholastic Aptitude Test, is a test that high schoolers take in the US before applying to college. Colleges take the test scores into account when making admissions decisions, so it's fairly important to do well on. The test is divided into 3 sections, each of which is scored out of 800 points. The total score is out of 2400 (although this has changed back and forth a few times, the scores in this dataset are out of 2400). High schools are often ranked by their average SAT scores, and high SAT scores are considered a sign of how good a school district is.
Combine 7 related datasets together. Datasets descriptions can be found in Readme. Briefly, the datasets we use are listed as follows:
SAT results - contains average SAT scores on the three subjects by each school and the number of SAT takers.
High School Directory - contains information of each high school.
Math test results contains math test results for each school.
Class size - class size of each high school.
AP test results - Advanced Placement exam results for each high school. Passing AP exams can get you college credit in the US.
Graduation outcomes - percentage of graduated students and other outcome information.
Demographics and accountability - demographic information for each school.
School district maps - contains information on the layout of the school districts, so that we can map them out.
School survey - surveys of parents, teachers, and students at each school (scores).
Showcase:
- Data cleaning
- Data wrangling
- Data preprocessing
- Mapping and visulization
- Insights
Understanding the data¶
We first do a quick look at the data, i.e. what contains in each file. We read in all files and see what attributes are in each dataset.
import pandas as pd
import numpy as np
files = ['ap_college_board.csv', 'class_size.csv',
'demographics_and_accountability.csv',
'graduation_outcomes.csv', 'high_school_directory.csv',
'math_test_results.csv', 'sat_results.csv']
data = {}
for f in files:
d = pd.read_csv("data/{}".format(f))
data[f.replace(".csv", "")] = d
for k,v in data.items():
print("\n"+k+"\n")
print(v.head(5))
Combine Files into 1 Single Dataset¶
DBN column appears in all the datasets, it is a unique code for each school in NYC. The problem now is that two of the datasets, class_size, and hs_directory, don't have a DBN field. In the high_school_directory data, it's just named dbn, so we can just rename the column, or copy it over into a new column called DBN. In the class_size data, we'll need to try a different approach.
Most DBN in the dataset looks like "01M448". For example:
data["ap_college_board"]["DBN"].head(3)
Although the class_size dataset does not contain the DBN columnn, it has certain columns that can be used to build a DBN column. Let's look at the columns in class_size. The first 3 columns are just what we need.
data["class_size"].head()
Now we have enough information to build the DBN columns for class_size and high_school_directory. We add a new column DBN to these two datasets.
data["class_size"]["DBN"] = (data["class_size"].apply
(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis = 1))
data["high_school_directory"]["DBN"] = data["high_school_directory"]["dbn"]
Adding in the survey data.¶
Student, parent, and teacher suverys about the quality of schools.
One of the most potentially interesting datasets to look at is the dataset on student, parent, and teacher surveys about the quality of schools. These surveys include information about the perceived safety of each school, academic standards, and more. Before we combine our datasets, let's add in the survey data.
In this case, we'll add the survey data into our data dictionary, and then combine all the datasets afterwards. The survey data consists of 2 files, one for all schools, and one for school district 75.
survey1 = pd.read_excel("data/survey_all.xlsx", sheet_name = 'Sheet1')
survey2 = pd.read_excel("data/survey_d75.xlsx", sheet_name = 'Sheet1')
print(survey1.head())
print(survey2.head())
survey1["d75"] = False
survey2["d75"] = True
survey = pd.concat([survey1, survey2], axis = 0, sort = False)
print(survey1.shape)
print(survey2.shape)
print(survey.shape)
survey.head()
Too many columns are in the survey data, and most of them are extraneous. By looking at the column dictionary of the survey data, let's just keep the important fields and remove any extraneous columns.
import warnings
warnings.filterwarnings('ignore')
survey["DBN"] = survey["dbn"]
survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_10", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11",]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
survey.shape
Condensing data, more preprocessing¶
The datasets such as class_size, demographics_and_accountabilities, have multiple rows for each school, whereas the datasets such as sat_results and high_school_directory only has one row per school. We need to find a way to combine these datasets together. For example:
print(data["demographics_and_accountability"].head())
print(data["class_size"].head())
print(data["sat_results"].head())
print(data["high_school_directory"].head())
In order to concatenate these datasets into 1 dataset, we need to make every data set contains only one row per school. If not, we cannnot compare SAT results to the class size. We can accomplish this by first understanding the data better, then by doing some aggregation. With the class_size dataset, it looks like GRADE and PROGRAM TYPE have multiple values for each school. By restricting each field to a single value, we can filter most of the duplicate rows. In the below code, we:
- Only select values from class_size where the GRADE field is 09-12.
- Only select values from class_size where the PROGRAM TYPE field is GEN ED.
- Group the class_size dataset by DBN, and take the average of each column. -Essentially, we'll find the average class_size values for each school.
- Reset the index, so DBN is added back in as a column.
class_size = data["class_size"]
# we only keep the high school data, which is grade 9 to 12 with general education
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
#the aggregation by mean will drop non-numerical columns
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
# condensing the demographics_and_accountability dataset
# we choose the school year of 20112012
demographics_and_accountability = data["demographics_and_accountability"]
demographics_and_accountability = demographics_and_accountability[demographics_and_accountability["schoolyear"] == 20112012]
data["demographics_and_accountability"] = demographics_and_accountability
#condensing the math_test_results dataset
# we choose the year of 2011, and Grade 8
data["math_test_results"] = data["math_test_results"][data["math_test_results"]["Year"] == 2011]
data["math_test_results"] = data["math_test_results"][data["math_test_results"]["Grade"] == '8']
data["math_test_results"].head()
# condesing the graduation_outcomes dataset
data["graduation_outcomes"] = data["graduation_outcomes"][data["graduation_outcomes"]["Cohort"] == "2006"]
data["graduation_outcomes"] = data["graduation_outcomes"][data["graduation_outcomes"]["Demographic"] == "Total Cohort"]
data["graduation_outcomes"].head()
Feature engineering¶
For the sat_results dataset, we add a new column to compute sat_score to store the total score of 'SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score'
cols = ['SAT Math Avg. Score',
'SAT Critical Reading Avg. Score',
'SAT Writing Avg. Score']
# convert string to numerical data
for c in cols:
data["sat_results"][c] = data["sat_results"][c].convert_objects(convert_numeric = True)
data["sat_results"]["sat_score"] = (data["sat_results"][cols[0]]+
data["sat_results"][cols[1]]+
data["sat_results"][cols[2]])
data["sat_results"]["sat_score"].head()
Now, we are finding the coordinates of each school so that we can make maps. The coordinates can be parsed out from the "Location 1" column in the high_school_directory dataset.
data["high_school_directory"]['latitude'] = data["high_school_directory"]['Location 1'].apply(lambda x: x.split("\n")[-1].replace("(","").replace(")","").split(", ")[0])
data["high_school_directory"]['longitude'] = data["high_school_directory"]['Location 1'].apply(lambda x: x.split("\n")[-1].replace("(","").replace(")","").split(", ")[1])
# convert string to numerical data
for c in ["latitude", "longitude"]:
data["high_school_directory"][c] = data["high_school_directory"][c].convert_objects(convert_numeric = True)
print(data["high_school_directory"]['latitude'].head())
print(data["high_school_directory"]['longitude'].head())
We have done the preliminaries now. Let's see what are in each dataset.
for k,v in data.items():
print("\n"+k+"\n")
print(v.head())
Combining all datasets together¶
From the above display, we see that each dataset has a column of "DBN". We can use this column to combine all datasets into one.
Handle missing values during combining¶
When we join them, it's important to note that some of the datasets are missing high schools that exist in the sat_results dataset. To resolve this, we'll need to merge the datasets that have missing rows using the outer join strategy, so we don't lose data.
It's a common thing to have missing values.
flat_data_names = [k for k,v in data.items()]
flat_data = [data[k] for k in flat_data_names]
# flat_data[0] is our first dataset which is ap_college_board
full = flat_data[0]
for i, f in enumerate(flat_data[1:]):
name = flat_data_names[i+1]
print(name)
print(len(f["DBN"]) - len(f["DBN"].unique()))
join_type = "inner"
if name in ["sat_results", "ap_college_board", "graduation_outcomes"]:
join_type = "outer"
if name not in ["ap_college_board"]:
full = full.merge(f, on="DBN", how=join_type)
full.shape
full.head()
Adding missing values¶
The full dataset contains almost all the information we need for analysis. But, there are a few missing pieces. We may want to correlate the AP (Advanced Placement) exam results with SAT results, but first we need to convert those columns to numbers, then fill in the missing values.
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
full[col] = full[col].infer_objects()
full[cols] = full[cols].fillna(value=0)
Add a new column "school_dist" that indicates the school district of each school. It can be extracted from the "DBN" column.
full["school_dist"] = full["DBN"].apply(lambda x: x[:2])
Finally, fill any missing values with the mean of the column.
full = full.fillna(full.mean())
full.head()
Computing variable correlations¶
A good way to explore a dataset and see what columns are related to the one you care about is to compute correlations. This will tell you which columns are closely related to the column you're interested in. The sat_score column is what we care about. So let's see the its correlations with other columns.
full.corr()['sat_score']
There are so many variables, it's hard to find interesting patterns. Let's see the top 10 biggest absolute correlations.
full.corr()['sat_score'].abs().sort_values(ascending = False).head(10)
It is easy to recognize that the most correlated columns to the sat_score column are the test scores of each subjects. Of course, this is true. Because, the sat_score is the sum of the three subjects.
We could use a pair plot to visualize the correlations.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
sns.set(style = "ticks", color_codes = True)
cols = ['sat_score', 'SAT Writing Avg. Score', 'SAT Critical Reading Avg. Score',
'SAT Math Avg. Score']
vis_corr = full[cols]
sns.pairplot(vis_corr)
plt.show()
Gender influence¶
Now let's look at how gender correlates to the sat_score by investigating the female percentage and male percentage.
We can find something interesting that the female percentage is positively correlated to sat_score, whereas male percentage is negatively correlated to sat_score.
Females are somehow better at studying than male students. LOL.
print(full.corr()['sat_score'][['female_per', 'male_per']])
full.corr()['sat_score'][['female_per', 'male_per']].plot.bar()
Race influence¶
We have the data of percentage from four races (white, black, asian, hispanic). Let's find out how they are affecting the sat_score.
I don't judge too much on races, just providing the facts. Findings:
- Asian and White percentages have positve correlation with the sat_score, whereas Black and Hispanic percentages have negative correlations with the sat_score.
- Asian students has the highest positve correlation among the four groups.
- Hispanic students has the highest negative correlations among the four groups.
full.corr()['sat_score'][['white_per', 'black_per', 'asian_per', 'hispanic_per']]
full.corr()['sat_score'][['white_per', 'black_per', 'asian_per', 'hispanic_per']].plot.bar(color = 'b')
Atittude of respondents' influence¶
Now let's see how students', parents', and teachers' ratings influence the SAT scores of the school.
Findings:
- Students' ratings have a strong positive correlation to the SAT results. It might be due to students have a better understanding of their schools' performance.
- Parents' ratings also have a positive correlation to the SAT results.
- Teachers' ratings have a negative correlation which is interesting.
# rr_s students' ratings
# rr_p parents' ratings
# rr_t teachers' ratings
full.corr()['sat_score'][['rr_s', 'rr_p', 'rr_t']]
# visulize the correlations of respondents' ratings to the SAT scores
full.corr()['sat_score'][['rr_s', 'rr_p', 'rr_t']].plot(kind = 'bar', color = 'b')
Foreign language speaker differences¶
Foreign language speaker means that English is not their first language. In other words, they are English language learners.
print(full.corr()['sat_score']['ell_percent'])
full.plot.scatter(x='ell_percent', y='sat_score')
import folium
from folium import plugins
schools_map = folium.Map(location = [40.7425, -73.9250],
zoom_start = 10)
marker_cluster = plugins.MarkerCluster().add_to(schools_map)
full["school_name"].fillna("Missing", inplace = True)
full["DBN"].fillna("Missing", inplace = True)
for row in full.iterrows():
folium.Marker(location = [row[1]['latitude'], row[1]['longitude']],
popup = "{}".format(row[1]['DBN'])).add_to(marker_cluster)
# schools_map.createmap('schools.html')
schools_map
Let's use a heatmap to better visualize the assembly of schools in this area.
schools_heatmap = folium.Map(location=[full['latitude'].mean(), full['longitude'].mean()], zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row["latitude"], row["longitude"]] for name, row in full.iterrows()]))
schools_heatmap.save("generated_maps/heatmap.html")
schools_heatmap
District level mapping¶
We can investigate the SAT results at district level, by:
- Grouping full dataset by district
- Computing the average of each column for each school district
- Converting the 'school_dist' field to remvoe leading 0s, so we can match our geographic district data.
district_data = full.groupby("school_dist").agg(np.mean)
district_data.reset_index(inplace=True)
district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)))
def show_district_map(col, df):
geo_path = 'data/school_districts.geojson'
districts = folium.Map(location=[full['latitude'].mean(), full['longitude'].mean()], zoom_start=10)
districts.choropleth(
geo_data=geo_path,
data=df,
columns=['school_dist', col],
key_on='feature.properties.school_dist',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
)
# folium.GeoJson(geo_path, name = 'geojson').add_to(districts)
districts.save("generated_maps/{}.html".format(col))
return districts
sat_districs = show_district_map("sat_score", district_data)
from IPython.display import IFrame
IFrame('generated_maps/sat_score.html', width=1000, height=400)
Now we can clearly seen on the distribution of SAT scores in the NYC. Dark green areas on the map indicate high average scores. The color close to yellow indicates low average SAT score areas.
# English language learners distribution
ell_districs = show_district_map("ell_percent", district_data)
IFrame('generated_maps/ell_percent.html', width=1000, height=400)
Comments
comments powered by Disqus