Explore the SAT test in NYC¶

The SAT, or Scholastic Aptitude Test, is a test that high schoolers take in the US before applying to college. Colleges take the test scores into account when making admissions decisions, so it's fairly important to do well on. The test is divided into 3 sections, each of which is scored out of 800 points. The total score is out of 2400 (although this has changed back and forth a few times, the scores in this dataset are out of 2400). High schools are often ranked by their average SAT scores, and high SAT scores are considered a sign of how good a school district is.

Combine 7 related datasets together. Datasets descriptions can be found in Readme. Briefly, the datasets we use are listed as follows:

SAT results - contains average SAT scores on the three subjects by each school and the number of SAT takers.
High School Directory - contains information of each high school.
Math test results contains math test results for each school.
Class size - class size of each high school.
AP test results - Advanced Placement exam results for each high school. Passing AP exams can get you college credit in the US.
Graduation outcomes - percentage of graduated students and other outcome information.
Demographics and accountability - demographic information for each school.
School district maps - contains information on the layout of the school districts, so that we can map them out.
School survey - surveys of parents, teachers, and students at each school (scores).

Showcase:

Data cleaning
Data wrangling
Data preprocessing
Mapping and visulization
Insights

Understanding the data¶

We first do a quick look at the data, i.e. what contains in each file. We read in all files and see what attributes are in each dataset.

In [1]:

import pandas as pd
import numpy as np

files = ['ap_college_board.csv', 'class_size.csv',
         'demographics_and_accountability.csv',
        'graduation_outcomes.csv', 'high_school_directory.csv', 
         'math_test_results.csv', 'sat_results.csv']

data = {}
for f in files:
    d = pd.read_csv("data/{}".format(f))
    data[f.replace(".csv", "")] = d

In [2]:

for k,v in data.items():
    print("\n"+k+"\n")
    print(v.head(5))

ap_college_board

      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  

class_size

   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT   
2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED   
3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT   
4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED   

  CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY)  \
0                                    -                                   -   
1                                    -                                   -   
2                                    -                                   -   
3                                    -                                   -   
4                                    -                                   -   

  SERVICE CATEGORY(K-9* ONLY)  NUMBER OF STUDENTS / SEATS FILLED  \
0                           -                               19.0   
1                           -                               21.0   
2                           -                               17.0   
3                           -                               17.0   
4                           -                               15.0   

   NUMBER OF SECTIONS  AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  \
0                 1.0                19.0                    19.0   
1                 1.0                21.0                    21.0   
2                 1.0                17.0                    17.0   
3                 1.0                17.0                    17.0   
4                 1.0                15.0                    15.0   

   SIZE OF LARGEST CLASS DATA SOURCE  SCHOOLWIDE PUPIL-TEACHER RATIO  
0                   19.0         ATS                             NaN  
1                   21.0         ATS                             NaN  
2                   17.0         ATS                             NaN  
3                   17.0         ATS                             NaN  
4                   15.0         ATS                             NaN  

demographics_and_accountability

      DBN                       Name  schoolyear fl_percent  frl_percent  \
0  01M015  P.S. 015 ROBERTO CLEMENTE    20052006       89.4          NaN   
1  01M015  P.S. 015 ROBERTO CLEMENTE    20062007       89.4          NaN   
2  01M015  P.S. 015 ROBERTO CLEMENTE    20072008       89.4          NaN   
3  01M015  P.S. 015 ROBERTO CLEMENTE    20082009       89.4          NaN   
4  01M015  P.S. 015 ROBERTO CLEMENTE    20092010                    96.5   

   total_enrollment prek   k grade1 grade2    ...     black_num black_per  \
0               281   15  36     40     33    ...            74      26.3   
1               243   15  29     39     38    ...            68      28.0   
2               261   18  43     39     36    ...            77      29.5   
3               252   17  37     44     32    ...            75      29.8   
4               208   16  40     28     32    ...            67      32.2   

  hispanic_num hispanic_per white_num white_per male_num male_per female_num  \
0          189         67.3         5       1.8    158.0     56.2      123.0   
1          153         63.0         4       1.6    140.0     57.6      103.0   
2          157         60.2         7       2.7    143.0     54.8      118.0   
3          149         59.1         7       2.8    149.0     59.1      103.0   
4          118         56.7         6       2.9    124.0     59.6       84.0   

  female_per  
0       43.8  
1       42.4  
2       45.2  
3       40.9  
4       40.4  

[5 rows x 38 columns]

graduation_outcomes

    Demographic     DBN                            School Name    Cohort  \
0  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2003   
1  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2004   
2  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2005   
3  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2006   
4  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL  2006 Aug   

   Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
0             5               s                         s                 s   
1            55              37                     67.3%                17   
2            64              43                     67.2%                27   
3            78              43                     55.1%                36   
4            78              44                     56.4%                37   

  Total Regents - % of cohort Total Regents - % of grads  \
0                           s                          s   
1                       30.9%                      45.9%   
2                       42.2%                      62.8%   
3                       46.2%                      83.7%   
4                       47.4%                      84.1%   

             ...            Regents w/o Advanced - n  \
0            ...                                   s   
1            ...                                  17   
2            ...                                  27   
3            ...                                  36   
4            ...                                  37   

  Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads  \
0                                  s                                 s   
1                              30.9%                             45.9%   
2                              42.2%                             62.8%   
3                              46.2%                             83.7%   
4                              47.4%                             84.1%   

  Local - n Local - % of cohort   Local - % of grads Still Enrolled - n  \
0         s                   s                    s                  s   
1        20               36.4%                54.1%                 15   
2        16                 25%  37.200000000000003%                  9   
3         7                  9%                16.3%                 16   
4         7                  9%                15.9%                 15   

  Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort  
0                            s               s                         s  
1                        27.3%               3                      5.5%  
2                        14.1%               9                     14.1%  
3                        20.5%              11                     14.1%  
4                        19.2%              11                     14.1%  

[5 rows x 23 columns]

high_school_directory

      dbn                                        school_name    borough  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number  grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262             9.0              12   
1          X400    718-842-0687  718-589-9849             9.0              12   
2          X240    718-294-8111  718-294-8109             6.0              12   
3          M520  718-935-3477             NaN             9.0              10   
4          Q695    718-969-3155  718-969-3552             6.0              12   

   expgrade_span_min  expgrade_span_max  \
0                NaN                NaN   
1                NaN                NaN   
2                NaN                NaN   
3                9.0               14.0   
4                NaN                NaN   

                         ...                         priority08 priority09  \
0                        ...                                NaN        NaN   
1                        ...                                NaN        NaN   
2                        ...                                NaN        NaN   
3                        ...                                NaN        NaN   
4                        ...                                NaN        NaN   

  priority10                                         Location 1  \
0        NaN  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...   
1        NaN  1110 Boston Road\nBronx, NY 10456\n(40.8276026...   
2        NaN  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...   
3        NaN  411 Pearl Street\nNew York, NY 10038\n(40.7106...   
4        NaN  160 20 Goethals Avenue\nJamaica, NY 11432\n(40...   

  Community Board  Council District Census Tract        BIN           BBL  \
0             9.0              35.0        213.0  3029686.0  3.011870e+09   
1             3.0              16.0        135.0  2004526.0  2.026340e+09   
2             4.0              14.0        209.0  2008336.0  2.028590e+09   
3             1.0               1.0         29.0  1001388.0  1.001130e+09   
4             8.0              24.0       1267.0  4539721.0  4.068580e+09   

                                                 NTA  
0  Crown Heights South                           ...  
1  Morrisania-Melrose                            ...  
2  West Concourse                                ...  
3  Chinatown                                     ...  
4  Pomonok-Flushing Heights-Hillcrest            ...  

[5 rows x 64 columns]

math_test_results

      DBN Grade  Year      Category  Number Tested Mean Scale Score Level 1 #  \
0  01M015     3  2006  All Students             39              667         2   
1  01M015     3  2007  All Students             31              672         2   
2  01M015     3  2008  All Students             37              668         0   
3  01M015     3  2009  All Students             33              668         0   
4  01M015     3  2010  All Students             26              677         6   

  Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 # Level 4 %  \
0      5.1%        11     28.2%        20     51.3%         6     15.4%   
1      6.5%         3      9.7%        22       71%         4     12.9%   
2        0%         6     16.2%        29     78.4%         2      5.4%   
3        0%         4     12.1%        28     84.8%         1        3%   
4     23.1%        12     46.2%         6     23.1%         2      7.7%   

  Level 3+4 # Level 3+4 %  
0          26       66.7%  
1          26       83.9%  
2          31       83.8%  
3          29       87.9%  
4           8       30.8%  

sat_results

      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score  \
0                     29                             355                 404   
1                     91                             383                 423   
2                     70                             377                 402   
3                      7                             414                 401   
4                     44                             390                 433   

  SAT Writing Avg. Score  
0                    363  
1                    366  
2                    370  
3                    359  
4                    384

Combine Files into 1 Single Dataset¶

DBN column appears in all the datasets, it is a unique code for each school in NYC. The problem now is that two of the datasets, class_size, and hs_directory, don't have a DBN field. In the high_school_directory data, it's just named dbn, so we can just rename the column, or copy it over into a new column called DBN. In the class_size data, we'll need to try a different approach.

Most DBN in the dataset looks like "01M448". For example:

In [3]:

data["ap_college_board"]["DBN"].head(3)

Out[3]:

0    01M448
1    01M450
2    01M515
Name: DBN, dtype: object

Although the class_size dataset does not contain the DBN columnn, it has certain columns that can be used to build a DBN column. Let's look at the columns in class_size. The first 3 columns are just what we need.

In [4]:

data["class_size"].head()

Out[4]:

	CSD	BOROUGH	SCHOOL CODE	SCHOOL NAME	GRADE	PROGRAM TYPE	CORE SUBJECT (MS CORE and 9-12 ONLY)	CORE COURSE (MS CORE and 9-12 ONLY)	SERVICE CATEGORY(K-9* ONLY)	NUMBER OF STUDENTS / SEATS FILLED	NUMBER OF SECTIONS	AVERAGE CLASS SIZE	SIZE OF SMALLEST CLASS	SIZE OF LARGEST CLASS	DATA SOURCE	SCHOOLWIDE PUPIL-TEACHER RATIO
0	1	M	M015	P.S. 015 Roberto Clemente	0K	GEN ED	-	-	-	19.0	1.0	19.0	19.0	19.0	ATS	NaN
1	1	M	M015	P.S. 015 Roberto Clemente	0K	CTT	-	-	-	21.0	1.0	21.0	21.0	21.0	ATS	NaN
2	1	M	M015	P.S. 015 Roberto Clemente	01	GEN ED	-	-	-	17.0	1.0	17.0	17.0	17.0	ATS	NaN
3	1	M	M015	P.S. 015 Roberto Clemente	01	CTT	-	-	-	17.0	1.0	17.0	17.0	17.0	ATS	NaN
4	1	M	M015	P.S. 015 Roberto Clemente	02	GEN ED	-	-	-	15.0	1.0	15.0	15.0	15.0	ATS	NaN

Now we have enough information to build the DBN columns for class_size and high_school_directory. We add a new column DBN to these two datasets.

In [5]:

data["class_size"]["DBN"] = (data["class_size"].apply
                             (lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis = 1))
data["high_school_directory"]["DBN"] = data["high_school_directory"]["dbn"]

Adding in the survey data.¶

Student, parent, and teacher suverys about the quality of schools.

One of the most potentially interesting datasets to look at is the dataset on student, parent, and teacher surveys about the quality of schools. These surveys include information about the perceived safety of each school, academic standards, and more. Before we combine our datasets, let's add in the survey data.

In this case, we'll add the survey data into our data dictionary, and then combine all the datasets afterwards. The survey data consists of 2 files, one for all schools, and one for school district 75.

In [6]:

survey1 = pd.read_excel("data/survey_all.xlsx", sheet_name = 'Sheet1')
survey2 = pd.read_excel("data/survey_d75.xlsx", sheet_name = 'Sheet1')
print(survey1.head())
print(survey2.head())

      dbn                    sch_type                        location  \
0  01M015           Elementary School       P.S. 015 Roberto Clemente   
1  01M019           Elementary School             P.S. 019 Asher Levy   
2  01M020           Elementary School            P.S. 020 Anna Silver   
3  01M034  Elementary / Middle School  P.S. 034 Franklin D. Roosevelt   
4  01M063           Elementary School       P.S. 063 William McKinley   

   enrollment    borough              principal studentsurvey  rr_s  rr_t  \
0         198  MANHATTAN        Thomas Staebell            No   NaN    93   
1         286  MANHATTAN    JACQUELINE FLANAGAN            No   NaN    69   
2         539  MANHATTAN              James Lee            No   NaN    59   
3         396  MANHATTAN  Joyce Stallings Harte           Yes  91.0    48   
4         172  MANHATTAN     DARLENE DESPEIGNES            No   NaN    55   

   rr_p     ...      s_N_q14e_3  s_N_q14e_4  s_N_q14f_1  s_N_q14f_2  \
0    63     ...             NaN         NaN         NaN         NaN   
1    33     ...             NaN         NaN         NaN         NaN   
2    44     ...             NaN         NaN         NaN         NaN   
3    39     ...            26.0        14.0        28.0        58.0   
4    73     ...             NaN         NaN         NaN         NaN   

   s_N_q14f_3  s_N_q14f_4  s_N_q14g_1  s_N_q14g_2  s_N_q14g_3  s_N_q14g_4  
0         NaN         NaN         NaN         NaN         NaN         NaN  
1         NaN         NaN         NaN         NaN         NaN         NaN  
2         NaN         NaN         NaN         NaN         NaN         NaN  
3        39.0        25.0        46.0        41.0        17.0        12.0  
4         NaN         NaN         NaN         NaN         NaN         NaN  

[5 rows x 1976 columns]
      dbn  sch_type   location  enrollment   borough            principal  \
0  75K004        ES  P.S. K004         320  BROOKLYN        Deborah Evans   
1  75K036  ES/MS/HS  P.S. K036         292  BROOKLYN    Johanna Schneider   
2  75K053  ES/MS/HS  P.S. K053         375  BROOKLYN        AMY BLUTSTEIN   
3  75K077  ES/MS/HS  P.S. K077         294  BROOKLYN  Merryl Redner-Cohen   
4  75K140     ES/MS  P.S. K140         282  BROOKLYN   Michelle Carpenter   

  studentsurvey  rr_s  rr_t  rr_p    ...     s_q14_2  s_q14_3  s_q14_4  \
0            NO   NaN    72    33    ...         NaN      NaN      NaN   
1           Yes   8.0    20     9    ...        36.0     18.0     45.0   
2           Yes  83.0    79    56    ...        11.0     15.0     10.0   
3           Yes  76.0    73    54    ...         8.0      4.0     10.0   
4           Yes  88.0    87    42    ...        33.0     41.0      9.0   

   s_q14_5  s_q14_6  s_q14_7  s_q14_8  s_q14_9  s_q14_10  s_q14_11  
0      NaN      NaN      NaN      NaN      NaN       NaN       NaN  
1      0.0      0.0      0.0      0.0      0.0       0.0       0.0  
2     15.0     11.0      6.0     18.0      4.0       1.0       1.0  
3     13.0     17.0     13.0     10.0      8.0      12.0       2.0  
4      0.0      0.0      0.0      0.0      0.0       0.0       3.0  

[5 rows x 1800 columns]

In [7]:

survey1["d75"] = False
survey2["d75"] = True
survey = pd.concat([survey1, survey2], axis = 0, sort = False)
print(survey1.shape)
print(survey2.shape)
print(survey.shape)
survey.head()

(1597, 1977)
(56, 1801)
(1653, 2852)

Out[7]:

	dbn	sch_type	location	enrollment	borough	principal	studentsurvey	rr_s	rr_t	rr_p	...	s_q14_2	s_q14_3	s_q14_4	s_q14_5	s_q14_6	s_q14_7	s_q14_8	s_q14_9	s_q14_10	s_q14_11
0	01M015	Elementary School	P.S. 015 Roberto Clemente	198	MANHATTAN	Thomas Staebell	No	NaN	93	63	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	01M019	Elementary School	P.S. 019 Asher Levy	286	MANHATTAN	JACQUELINE FLANAGAN	No	NaN	69	33	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	01M020	Elementary School	P.S. 020 Anna Silver	539	MANHATTAN	James Lee	No	NaN	59	44	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	01M034	Elementary / Middle School	P.S. 034 Franklin D. Roosevelt	396	MANHATTAN	Joyce Stallings Harte	Yes	91.0	48	39	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	01M063	Elementary School	P.S. 063 William McKinley	172	MANHATTAN	DARLENE DESPEIGNES	No	NaN	55	73	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 2852 columns

Too many columns are in the survey data, and most of them are extraneous. By looking at the column dictionary of the survey data, let's just keep the important fields and remove any extraneous columns.

In [8]:

import warnings
warnings.filterwarnings('ignore')
survey["DBN"] = survey["dbn"]
survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_10", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11",]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
survey.shape

Out[8]:

(1653, 23)

Condensing data, more preprocessing¶

The datasets such as class_size, demographics_and_accountabilities, have multiple rows for each school, whereas the datasets such as sat_results and high_school_directory only has one row per school. We need to find a way to combine these datasets together. For example:

In [9]:

print(data["demographics_and_accountability"].head())
print(data["class_size"].head())

      DBN                       Name  schoolyear fl_percent  frl_percent  \
0  01M015  P.S. 015 ROBERTO CLEMENTE    20052006       89.4          NaN   
1  01M015  P.S. 015 ROBERTO CLEMENTE    20062007       89.4          NaN   
2  01M015  P.S. 015 ROBERTO CLEMENTE    20072008       89.4          NaN   
3  01M015  P.S. 015 ROBERTO CLEMENTE    20082009       89.4          NaN   
4  01M015  P.S. 015 ROBERTO CLEMENTE    20092010                    96.5   

   total_enrollment prek   k grade1 grade2    ...     black_num black_per  \
0               281   15  36     40     33    ...            74      26.3   
1               243   15  29     39     38    ...            68      28.0   
2               261   18  43     39     36    ...            77      29.5   
3               252   17  37     44     32    ...            75      29.8   
4               208   16  40     28     32    ...            67      32.2   

  hispanic_num hispanic_per white_num white_per male_num male_per female_num  \
0          189         67.3         5       1.8    158.0     56.2      123.0   
1          153         63.0         4       1.6    140.0     57.6      103.0   
2          157         60.2         7       2.7    143.0     54.8      118.0   
3          149         59.1         7       2.8    149.0     59.1      103.0   
4          118         56.7         6       2.9    124.0     59.6       84.0   

  female_per  
0       43.8  
1       42.4  
2       45.2  
3       40.9  
4       40.4  

[5 rows x 38 columns]
   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT   
2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED   
3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT   
4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED   

  CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY)  \
0                                    -                                   -   
1                                    -                                   -   
2                                    -                                   -   
3                                    -                                   -   
4                                    -                                   -   

  SERVICE CATEGORY(K-9* ONLY)  NUMBER OF STUDENTS / SEATS FILLED  \
0                           -                               19.0   
1                           -                               21.0   
2                           -                               17.0   
3                           -                               17.0   
4                           -                               15.0   

   NUMBER OF SECTIONS  AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  \
0                 1.0                19.0                    19.0   
1                 1.0                21.0                    21.0   
2                 1.0                17.0                    17.0   
3                 1.0                17.0                    17.0   
4                 1.0                15.0                    15.0   

   SIZE OF LARGEST CLASS DATA SOURCE  SCHOOLWIDE PUPIL-TEACHER RATIO     DBN  
0                   19.0         ATS                             NaN  01M015  
1                   21.0         ATS                             NaN  01M015  
2                   17.0         ATS                             NaN  01M015  
3                   17.0         ATS                             NaN  01M015  
4                   15.0         ATS                             NaN  01M015

In [10]:

print(data["sat_results"].head())
print(data["high_school_directory"].head())

      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score  \
0                     29                             355                 404   
1                     91                             383                 423   
2                     70                             377                 402   
3                      7                             414                 401   
4                     44                             390                 433   

  SAT Writing Avg. Score  
0                    363  
1                    366  
2                    370  
3                    359  
4                    384  
      dbn                                        school_name    borough  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number  grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262             9.0              12   
1          X400    718-842-0687  718-589-9849             9.0              12   
2          X240    718-294-8111  718-294-8109             6.0              12   
3          M520  718-935-3477             NaN             9.0              10   
4          Q695    718-969-3155  718-969-3552             6.0              12   

   expgrade_span_min  expgrade_span_max   ...   priority09 priority10  \
0                NaN                NaN   ...          NaN        NaN   
1                NaN                NaN   ...          NaN        NaN   
2                NaN                NaN   ...          NaN        NaN   
3                9.0               14.0   ...          NaN        NaN   
4                NaN                NaN   ...          NaN        NaN   

                                          Location 1 Community Board  \
0  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...             9.0   
1  1110 Boston Road\nBronx, NY 10456\n(40.8276026...             3.0   
2  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...             4.0   
3  411 Pearl Street\nNew York, NY 10038\n(40.7106...             1.0   
4  160 20 Goethals Avenue\nJamaica, NY 11432\n(40...             8.0   

  Council District  Census Tract        BIN           BBL  \
0             35.0         213.0  3029686.0  3.011870e+09   
1             16.0         135.0  2004526.0  2.026340e+09   
2             14.0         209.0  2008336.0  2.028590e+09   
3              1.0          29.0  1001388.0  1.001130e+09   
4             24.0        1267.0  4539721.0  4.068580e+09   

                                                 NTA     DBN  
0  Crown Heights South                           ...  17K548  
1  Morrisania-Melrose                            ...  09X543  
2  West Concourse                                ...  09X327  
3  Chinatown                                     ...  02M280  
4  Pomonok-Flushing Heights-Hillcrest            ...  28Q680  

[5 rows x 65 columns]

In order to concatenate these datasets into 1 dataset, we need to make every data set contains only one row per school. If not, we cannnot compare SAT results to the class size. We can accomplish this by first understanding the data better, then by doing some aggregation. With the class_size dataset, it looks like GRADE and PROGRAM TYPE have multiple values for each school. By restricting each field to a single value, we can filter most of the duplicate rows. In the below code, we:

Only select values from class_size where the GRADE field is 09-12.
Only select values from class_size where the PROGRAM TYPE field is GEN ED.
Group the class_size dataset by DBN, and take the average of each column. -Essentially, we'll find the average class_size values for each school.
Reset the index, so DBN is added back in as a column.

In [11]:

class_size = data["class_size"]
# we only keep the high school data, which is grade 9 to 12 with general education
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

#the aggregation by mean will drop non-numerical columns
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

In [12]:

# condensing the demographics_and_accountability dataset
# we choose the school year of 20112012
demographics_and_accountability = data["demographics_and_accountability"]
demographics_and_accountability = demographics_and_accountability[demographics_and_accountability["schoolyear"] == 20112012]
data["demographics_and_accountability"] = demographics_and_accountability

In [13]:

#condensing the math_test_results dataset
# we choose the year of 2011, and Grade 8
data["math_test_results"] = data["math_test_results"][data["math_test_results"]["Year"] == 2011]
data["math_test_results"] = data["math_test_results"][data["math_test_results"]["Grade"] == '8']
data["math_test_results"].head()

Out[13]:

	DBN	Grade	Year	Category	Number Tested	Mean Scale Score	Level 1 #	Level 1 %	Level 2 #	Level 2 %	Level 3 #	Level 3 %	Level 4 #	Level 4 %	Level 3+4 #	Level 3+4 %
111	01M034	8	2011	All Students	48	646	15	31.3%	22	45.8%	11	22.9%	0	0%	11	22.9%
280	01M140	8	2011	All Students	61	665	1	1.6%	43	70.5%	17	27.9%	0	0%	17	27.9%
346	01M184	8	2011	All Students	49	727	0	0%	0	0%	5	10.2%	44	89.8%	49	100%
388	01M188	8	2011	All Students	49	658	10	20.4%	26	53.1%	10	20.4%	3	6.1%	13	26.5%
411	01M292	8	2011	All Students	49	650	15	30.6%	25	51%	7	14.3%	2	4.1%	9	18.4%

In [14]:

# condesing the graduation_outcomes dataset
data["graduation_outcomes"] = data["graduation_outcomes"][data["graduation_outcomes"]["Cohort"] == "2006"]
data["graduation_outcomes"] = data["graduation_outcomes"][data["graduation_outcomes"]["Demographic"] == "Total Cohort"]
data["graduation_outcomes"].head()

Out[14]:

	Demographic	DBN	School Name	Cohort	Total Cohort	Total Grads - n	Total Grads - % of cohort	Total Regents - n	Total Regents - % of cohort	Total Regents - % of grads	...	Regents w/o Advanced - n	Regents w/o Advanced - % of cohort	Regents w/o Advanced - % of grads	Local - n	Local - % of cohort	Local - % of grads	Still Enrolled - n	Still Enrolled - % of cohort	Dropped Out - n	Dropped Out - % of cohort
3	Total Cohort	01M292	HENRY STREET SCHOOL FOR INTERNATIONAL	2006	78	43	55.1%	36	46.2%	83.7%	...	36	46.2%	83.7%	7	9%	16.3%	16	20.5%	11	14.1%
10	Total Cohort	01M448	UNIVERSITY NEIGHBORHOOD HIGH SCHOOL	2006	124	53	42.7%	42	33.9%	79.2%	...	34	27.4%	64.2%	11	8.9%	20.8%	46	37.1%	20	16.100000000000001%
17	Total Cohort	01M450	EAST SIDE COMMUNITY SCHOOL	2006	90	70	77.8%	67	74.400000000000006%	95.7%	...	67	74.400000000000006%	95.7%	3	3.3%	4.3%	15	16.7%	5	5.6%
24	Total Cohort	01M509	MARTA VALLE HIGH SCHOOL	2006	84	47	56%	40	47.6%	85.1%	...	23	27.4%	48.9%	7	8.300000000000001%	14.9%	25	29.8%	5	6%
31	Total Cohort	01M515	LOWER EAST SIDE PREPARATORY HIGH SCHO	2006	193	105	54.4%	91	47.2%	86.7%	...	22	11.4%	21%	14	7.3%	13.3%	53	27.5%	35	18.100000000000001%

5 rows × 23 columns

Feature engineering¶

For the sat_results dataset, we add a new column to compute sat_score to store the total score of 'SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score'

In [15]:

cols = ['SAT Math Avg. Score', 
        'SAT Critical Reading Avg. Score', 
        'SAT Writing Avg. Score']
# convert string to numerical data
for c in cols:
    data["sat_results"][c] = data["sat_results"][c].convert_objects(convert_numeric = True)

data["sat_results"]["sat_score"] = (data["sat_results"][cols[0]]+
                                    data["sat_results"][cols[1]]+
                                    data["sat_results"][cols[2]])
data["sat_results"]["sat_score"].head()

Out[15]:

0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64

Now, we are finding the coordinates of each school so that we can make maps. The coordinates can be parsed out from the "Location 1" column in the high_school_directory dataset.

In [16]:

data["high_school_directory"]['latitude'] = data["high_school_directory"]['Location 1'].apply(lambda x: x.split("\n")[-1].replace("(","").replace(")","").split(", ")[0])
data["high_school_directory"]['longitude'] = data["high_school_directory"]['Location 1'].apply(lambda x: x.split("\n")[-1].replace("(","").replace(")","").split(", ")[1])
# convert string to numerical data
for c in ["latitude", "longitude"]:
    data["high_school_directory"][c] = data["high_school_directory"][c].convert_objects(convert_numeric = True)
    
print(data["high_school_directory"]['latitude'].head())
print(data["high_school_directory"]['longitude'].head())

0    40.670299
1    40.827603
2    40.842414
3    40.710679
4    40.718810
Name: latitude, dtype: float64
0   -73.961648
1   -73.904475
2   -73.916162
3   -74.000807
4   -73.806500
Name: longitude, dtype: float64

We have done the preliminaries now. Let's see what are in each dataset.

In [17]:

for k,v in data.items():
    print("\n"+k+"\n")
    print(v.head())

ap_college_board

      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  

class_size

      DBN  CSD  NUMBER OF STUDENTS / SEATS FILLED  NUMBER OF SECTIONS  \
0  01M292    1                            88.0000            4.000000   
1  01M332    1                            46.0000            2.000000   
2  01M378    1                            33.0000            1.000000   
3  01M448    1                           105.6875            4.750000   
4  01M450    1                            57.6000            2.733333   

   AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  SIZE OF LARGEST CLASS  \
0           22.564286                   18.50              26.571429   
1           22.000000                   21.00              23.500000   
2           33.000000                   33.00              33.000000   
3           22.231250                   18.25              27.062500   
4           21.200000                   19.40              22.866667   

   SCHOOLWIDE PUPIL-TEACHER RATIO  
0                             NaN  
1                             NaN  
2                             NaN  
3                             NaN  
4                             NaN  

demographics_and_accountability

       DBN                                              Name  schoolyear  \
6   01M015  P.S. 015 ROBERTO CLEMENTE                           20112012   
13  01M019  P.S. 019 ASHER LEVY                                 20112012   
20  01M020  PS 020 ANNA SILVER                                  20112012   
27  01M034  PS 034 FRANKLIN D ROOSEVELT                         20112012   
35  01M063  PS 063 WILLIAM MCKINLEY                             20112012   

   fl_percent  frl_percent  total_enrollment prek    k grade1 grade2  \
6         NaN         89.4               189   13   31     35     28   
13        NaN         61.5               328   32   46     52     54   
20        NaN         92.5               626   52  102    121     87   
27        NaN         99.7               401   14   34     38     36   
35        NaN         78.9               176   18   20     30     21   

      ...     black_num black_per hispanic_num hispanic_per white_num  \
6     ...            63      33.3          109         57.7         4   
13    ...            81      24.7          158         48.2        28   
20    ...            55       8.8          357         57.0        16   
27    ...            90      22.4          275         68.6         8   
35    ...            41      23.3          110         62.5        15   

   white_per male_num male_per female_num female_per  
6        2.1     97.0     51.3       92.0       48.7  
13       8.5    147.0     44.8      181.0       55.2  
20       2.6    330.0     52.7      296.0       47.3  
27       2.0    204.0     50.9      197.0       49.1  
35       8.5     97.0     55.1       79.0       44.9  

[5 rows x 38 columns]

graduation_outcomes

     Demographic     DBN                            School Name Cohort  \
3   Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL   2006   
10  Total Cohort  01M448    UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   2006   
17  Total Cohort  01M450             EAST SIDE COMMUNITY SCHOOL   2006   
24  Total Cohort  01M509                MARTA VALLE HIGH SCHOOL   2006   
31  Total Cohort  01M515  LOWER EAST SIDE PREPARATORY HIGH SCHO   2006   

    Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
3             78              43                     55.1%                36   
10           124              53                     42.7%                42   
17            90              70                     77.8%                67   
24            84              47                       56%                40   
31           193             105                     54.4%                91   

   Total Regents - % of cohort Total Regents - % of grads  \
3                        46.2%                      83.7%   
10                       33.9%                      79.2%   
17         74.400000000000006%                      95.7%   
24                       47.6%                      85.1%   
31                       47.2%                      86.7%   

              ...            Regents w/o Advanced - n  \
3             ...                                  36   
10            ...                                  34   
17            ...                                  67   
24            ...                                  23   
31            ...                                  22   

   Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads  \
3                               46.2%                             83.7%   
10                              27.4%                             64.2%   
17                74.400000000000006%                             95.7%   
24                              27.4%                             48.9%   
31                              11.4%                               21%   

   Local - n Local - % of cohort Local - % of grads Still Enrolled - n  \
3          7                  9%              16.3%                 16   
10        11                8.9%              20.8%                 46   
17         3                3.3%               4.3%                 15   
24         7  8.300000000000001%              14.9%                 25   
31        14                7.3%              13.3%                 53   

   Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort  
3                         20.5%              11                     14.1%  
10                        37.1%              20       16.100000000000001%  
17                        16.7%               5                      5.6%  
24                        29.8%               5                        6%  
31                        27.5%              35       18.100000000000001%  

[5 rows x 23 columns]

high_school_directory

      dbn                                        school_name    borough  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number  grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262             9.0              12   
1          X400    718-842-0687  718-589-9849             9.0              12   
2          X240    718-294-8111  718-294-8109             6.0              12   
3          M520  718-935-3477             NaN             9.0              10   
4          Q695    718-969-3155  718-969-3552             6.0              12   

   expgrade_span_min  expgrade_span_max    ...      \
0                NaN                NaN    ...       
1                NaN                NaN    ...       
2                NaN                NaN    ...       
3                9.0               14.0    ...       
4                NaN                NaN    ...       

                                          Location 1 Community Board  \
0  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...             9.0   
1  1110 Boston Road\nBronx, NY 10456\n(40.8276026...             3.0   
2  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...             4.0   
3  411 Pearl Street\nNew York, NY 10038\n(40.7106...             1.0   
4  160 20 Goethals Avenue\nJamaica, NY 11432\n(40...             8.0   

  Council District Census Tract        BIN           BBL  \
0             35.0        213.0  3029686.0  3.011870e+09   
1             16.0        135.0  2004526.0  2.026340e+09   
2             14.0        209.0  2008336.0  2.028590e+09   
3              1.0         29.0  1001388.0  1.001130e+09   
4             24.0       1267.0  4539721.0  4.068580e+09   

                                                 NTA     DBN   latitude  \
0  Crown Heights South                           ...  17K548  40.670299   
1  Morrisania-Melrose                            ...  09X543  40.827603   
2  West Concourse                                ...  09X327  40.842414   
3  Chinatown                                     ...  02M280  40.710679   
4  Pomonok-Flushing Heights-Hillcrest            ...  28Q680  40.718810   

   longitude  
0 -73.961648  
1 -73.904475  
2 -73.916162  
3 -74.000807  
4 -73.806500  

[5 rows x 67 columns]

math_test_results

        DBN Grade  Year      Category  Number Tested Mean Scale Score  \
111  01M034     8  2011  All Students             48              646   
280  01M140     8  2011  All Students             61              665   
346  01M184     8  2011  All Students             49              727   
388  01M188     8  2011  All Students             49              658   
411  01M292     8  2011  All Students             49              650   

    Level 1 # Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 #  \
111        15     31.3%        22     45.8%        11     22.9%         0   
280         1      1.6%        43     70.5%        17     27.9%         0   
346         0        0%         0        0%         5     10.2%        44   
388        10     20.4%        26     53.1%        10     20.4%         3   
411        15     30.6%        25       51%         7     14.3%         2   

    Level 4 % Level 3+4 # Level 3+4 %  
111        0%          11       22.9%  
280        0%          17       27.9%  
346     89.8%          49        100%  
388      6.1%          13       26.5%  
411      4.1%           9       18.4%  

sat_results

      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers  SAT Critical Reading Avg. Score  \
0                     29                            355.0   
1                     91                            383.0   
2                     70                            377.0   
3                      7                            414.0   
4                     44                            390.0   

   SAT Math Avg. Score  SAT Writing Avg. Score  sat_score  
0                404.0                   363.0     1122.0  
1                423.0                   366.0     1172.0  
2                402.0                   370.0     1149.0  
3                401.0                   359.0     1174.0  
4                433.0                   384.0     1207.0  

survey

      DBN  rr_s  rr_t  rr_p    N_s   N_t    N_p  saf_p_11  com_p_11  eng_p_11  \
0  01M015   NaN    93    63    NaN  27.0  104.0       NaN       NaN       NaN   
1  01M019   NaN    69    33    NaN  24.0   89.0       NaN       NaN       NaN   
2  01M020   NaN    59    44    NaN  32.0  207.0       NaN       NaN       NaN   
3  01M034  91.0    48    39  159.0  22.0  119.0       NaN       NaN       NaN   
4  01M063   NaN    55    73    NaN  17.0  117.0       NaN       NaN       NaN   

      ...      eng_t_10  aca_t_11  saf_s_11  com_s_11  eng_s_11  aca_s_11  \
0     ...           7.9       NaN       NaN       NaN       NaN       NaN   
1     ...           7.8       NaN       NaN       NaN       NaN       NaN   
2     ...           6.9       NaN       NaN       NaN       NaN       NaN   
3     ...           6.7       NaN       NaN       NaN       NaN       NaN   
4     ...           7.5       NaN       NaN       NaN       NaN       NaN   

   saf_tot_11  com_tot_11  eng_tot_11  aca_tot_11  
0         NaN         NaN         NaN         NaN  
1         NaN         NaN         NaN         NaN  
2         NaN         NaN         NaN         NaN  
3         NaN         NaN         NaN         NaN  
4         NaN         NaN         NaN         NaN  

[5 rows x 23 columns]

Combining all datasets together¶

From the above display, we see that each dataset has a column of "DBN". We can use this column to combine all datasets into one.

Handle missing values during combining¶

When we join them, it's important to note that some of the datasets are missing high schools that exist in the sat_results dataset. To resolve this, we'll need to merge the datasets that have missing rows using the outer join strategy, so we don't lose data.

It's a common thing to have missing values.

In [18]:

flat_data_names = [k for k,v in data.items()]
flat_data = [data[k] for k in flat_data_names]

# flat_data[0] is our first dataset which is ap_college_board
full = flat_data[0]
for i, f in enumerate(flat_data[1:]):
    name = flat_data_names[i+1]
    print(name)
    print(len(f["DBN"]) - len(f["DBN"].unique()))
    join_type = "inner"
    if name in ["sat_results", "ap_college_board", "graduation_outcomes"]:
        join_type = "outer"
    if name not in ["ap_college_board"]:
        full = full.merge(f, on="DBN", how=join_type)

full.shape

class_size
0
demographics_and_accountability
0
graduation_outcomes
0
high_school_directory
0
math_test_results
0
sat_results
0
survey
0

Out[18]:

(455, 180)

In [19]:

full.head()

Out[19]:

	DBN	SchoolName	AP Test Takers	Total Exams Taken	Number of Exams with scores 3 4 or 5	CSD	NUMBER OF STUDENTS / SEATS FILLED	NUMBER OF SECTIONS	AVERAGE CLASS SIZE	SIZE OF SMALLEST CLASS	...	eng_t_10	aca_t_11	saf_s_11	com_s_11	eng_s_11	aca_s_11	saf_tot_11	com_tot_11	eng_tot_11	aca_tot_11
0	01M450	EAST SIDE COMMUNITY HS	19	21	s	1.0	57.600000	2.733333	21.200000	19.400000	...	8.1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	01M539	NEW EXPLORATIONS SCI,TECH,MATH	255	377	191	1.0	156.368421	6.157895	25.510526	19.473684	...	6.6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	02M408	PROFESSIONAL PERFORMING ARTS	20	20	15	2.0	104.882353	3.529412	28.952941	25.117647	...	8.1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	02M655	LIFE SCIENCES SECONDARY SCHL	50	90	10	2.0	138.052632	5.789474	23.921053	16.684211	...	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	03M415	WADLEIGH ARTS HIGH SCHOOL	65	73	s	3.0	142.846154	6.000000	23.600000	18.307692	...	5.7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 180 columns

Adding missing values¶

The full dataset contains almost all the information we need for analysis. But, there are a few missing pieces. We may want to correlate the AP (Advanced Placement) exam results with SAT results, but first we need to convert those columns to numbers, then fill in the missing values.

In [20]:

cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    full[col] = full[col].infer_objects()

full[cols] = full[cols].fillna(value=0)

Add a new column "school_dist" that indicates the school district of each school. It can be extracted from the "DBN" column.

In [21]:

full["school_dist"] = full["DBN"].apply(lambda x: x[:2])

Finally, fill any missing values with the mean of the column.

In [22]:

full = full.fillna(full.mean())
full.head()

Out[22]:

	DBN	SchoolName	AP Test Takers	Total Exams Taken	Number of Exams with scores 3 4 or 5	CSD	NUMBER OF STUDENTS / SEATS FILLED	NUMBER OF SECTIONS	AVERAGE CLASS SIZE	SIZE OF SMALLEST CLASS	...	aca_t_11	saf_s_11	com_s_11	eng_s_11	aca_s_11	saf_tot_11	com_tot_11	eng_tot_11	aca_tot_11	school_dist
0	01M450	EAST SIDE COMMUNITY HS	19	21	s	1.0	57.600000	2.733333	21.200000	19.400000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01
1	01M539	NEW EXPLORATIONS SCI,TECH,MATH	255	377	191	1.0	156.368421	6.157895	25.510526	19.473684	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01
2	02M408	PROFESSIONAL PERFORMING ARTS	20	20	15	2.0	104.882353	3.529412	28.952941	25.117647	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	02
3	02M655	LIFE SCIENCES SECONDARY SCHL	50	90	10	2.0	138.052632	5.789474	23.921053	16.684211	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	02
4	03M415	WADLEIGH ARTS HIGH SCHOOL	65	73	s	3.0	142.846154	6.000000	23.600000	18.307692	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	03

5 rows × 181 columns

Computing variable correlations¶

A good way to explore a dataset and see what columns are related to the one you care about is to compute correlations. This will tell you which columns are closely related to the column you're interested in. The sat_score column is what we care about. So let's see the its correlations with other columns.

In [23]:

full.corr()['sat_score']

Out[23]:

CSD                                  0.053884
NUMBER OF STUDENTS / SEATS FILLED    0.082401
NUMBER OF SECTIONS                   0.069660
AVERAGE CLASS SIZE                   0.085135
SIZE OF SMALLEST CLASS               0.048874
SIZE OF LARGEST CLASS                0.091046
SCHOOLWIDE PUPIL-TEACHER RATIO            NaN
schoolyear                                NaN
frl_percent                         -0.188389
total_enrollment                     0.138870
ell_num                             -0.123795
ell_percent                         -0.135156
sped_num                            -0.064685
sped_percent                        -0.157326
asian_num                            0.193204
asian_per                            0.191027
black_num                            0.027135
black_per                           -0.064892
hispanic_num                        -0.054458
hispanic_per                        -0.102325
white_num                            0.166775
white_per                            0.168373
male_num                             0.084583
male_per                            -0.071738
female_num                           0.177495
female_per                           0.071738
Total Cohort                         0.061225
grade_span_min                      -0.048926
grade_span_max                            NaN
expgrade_span_min                         NaN
                                       ...   
latitude                            -0.061575
longitude                           -0.006395
Year                                      NaN
Number Tested                        0.089128
SAT Critical Reading Avg. Score      0.976192
SAT Math Avg. Score                  0.953759
SAT Writing Avg. Score               0.981303
sat_score                            1.000000
rr_s                                 0.283358
rr_t                                -0.056380
rr_p                                 0.057937
N_s                                  0.393010
N_t                                  0.242898
N_p                                  0.371894
saf_p_11                                  NaN
com_p_11                                  NaN
eng_p_11                                  NaN
aca_p_11                                  NaN
saf_t_11                                  NaN
com_t_11                                  NaN
eng_t_10                             0.018455
aca_t_11                                  NaN
saf_s_11                                  NaN
com_s_11                                  NaN
eng_s_11                                  NaN
aca_s_11                                  NaN
saf_tot_11                                NaN
com_tot_11                                NaN
eng_tot_11                                NaN
aca_tot_11                                NaN
Name: sat_score, Length: 69, dtype: float64

There are so many variables, it's hard to find interesting patterns. Let's see the top 10 biggest absolute correlations.

In [24]:

full.corr()['sat_score'].abs().sort_values(ascending = False).head(10)

Out[24]:

sat_score                          1.000000
SAT Writing Avg. Score             0.981303
SAT Critical Reading Avg. Score    0.976192
SAT Math Avg. Score                0.953759
N_s                                0.393010
N_p                                0.371894
rr_s                               0.283358
N_t                                0.242898
asian_num                          0.193204
asian_per                          0.191027
Name: sat_score, dtype: float64

It is easy to recognize that the most correlated columns to the sat_score column are the test scores of each subjects. Of course, this is true. Because, the sat_score is the sum of the three subjects.

We could use a pair plot to visualize the correlations.

In [25]:

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('ggplot')

sns.set(style = "ticks", color_codes = True)

cols = ['sat_score', 'SAT Writing Avg. Score', 'SAT Critical Reading Avg. Score',
       'SAT Math Avg. Score']
vis_corr = full[cols]

sns.pairplot(vis_corr)
plt.show()

Gender influence¶

Now let's look at how gender correlates to the sat_score by investigating the female percentage and male percentage.

We can find something interesting that the female percentage is positively correlated to sat_score, whereas male percentage is negatively correlated to sat_score.

Females are somehow better at studying than male students. LOL.

In [26]:

print(full.corr()['sat_score'][['female_per', 'male_per']])
full.corr()['sat_score'][['female_per', 'male_per']].plot.bar()

female_per    0.071738
male_per     -0.071738
Name: sat_score, dtype: float64

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0xa27dda0>

Race influence¶

We have the data of percentage from four races (white, black, asian, hispanic). Let's find out how they are affecting the sat_score.

I don't judge too much on races, just providing the facts. Findings:

Asian and White percentages have positve correlation with the sat_score, whereas Black and Hispanic percentages have negative correlations with the sat_score.
Asian students has the highest positve correlation among the four groups.
Hispanic students has the highest negative correlations among the four groups.

In [27]:

full.corr()['sat_score'][['white_per', 'black_per', 'asian_per', 'hispanic_per']]

Out[27]:

white_per       0.168373
black_per      -0.064892
asian_per       0.191027
hispanic_per   -0.102325
Name: sat_score, dtype: float64

In [28]:

full.corr()['sat_score'][['white_per', 'black_per', 'asian_per', 'hispanic_per']].plot.bar(color = 'b')

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x9e20f60>

Atittude of respondents' influence¶

Now let's see how students', parents', and teachers' ratings influence the SAT scores of the school.

Findings:

Students' ratings have a strong positive correlation to the SAT results. It might be due to students have a better understanding of their schools' performance.
Parents' ratings also have a positive correlation to the SAT results.
Teachers' ratings have a negative correlation which is interesting.

In [29]:

# rr_s students' ratings
# rr_p parents' ratings
# rr_t teachers' ratings
full.corr()['sat_score'][['rr_s', 'rr_p', 'rr_t']]

Out[29]:

rr_s    0.283358
rr_p    0.057937
rr_t   -0.056380
Name: sat_score, dtype: float64

In [30]:

# visulize the correlations of respondents' ratings to the SAT scores
full.corr()['sat_score'][['rr_s', 'rr_p', 'rr_t']].plot(kind = 'bar', color = 'b')

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0xba73748>

Foreign language speaker differences¶

Foreign language speaker means that English is not their first language. In other words, they are English language learners.

In [31]:

print(full.corr()['sat_score']['ell_percent'])
full.plot.scatter(x='ell_percent', y='sat_score')

-0.13515582997474967

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x9e5e898>

Setting the context¶

Map¶

We'll map out the positions of the schools, which will give us a better visualization.

This is done by:

Setup a map centered on NYC.
Add a marker for each school on the map.
Display the map.

In [32]:

import folium
from folium import plugins

schools_map = folium.Map(location = [40.7425, -73.9250],
                        zoom_start = 10)
marker_cluster = plugins.MarkerCluster().add_to(schools_map)
full["school_name"].fillna("Missing", inplace = True)
full["DBN"].fillna("Missing", inplace = True)
for row in full.iterrows():
    folium.Marker(location = [row[1]['latitude'], row[1]['longitude']],
                  popup = "{}".format(row[1]['DBN'])).add_to(marker_cluster)
# schools_map.createmap('schools.html')
schools_map

Out[32]:

Let's use a heatmap to better visualize the assembly of schools in this area.

In [33]:

schools_heatmap = folium.Map(location=[full['latitude'].mean(), full['longitude'].mean()], zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row["latitude"], row["longitude"]] for name, row in full.iterrows()]))
schools_heatmap.save("generated_maps/heatmap.html")
schools_heatmap

Out[33]:

District level mapping¶

We can investigate the SAT results at district level, by:

Grouping full dataset by district
Computing the average of each column for each school district
Converting the 'school_dist' field to remvoe leading 0s, so we can match our geographic district data.

In [34]:

district_data = full.groupby("school_dist").agg(np.mean)
district_data.reset_index(inplace=True)
district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)))

In [35]:

def show_district_map(col, df):
    geo_path = 'data/school_districts.geojson'
    districts = folium.Map(location=[full['latitude'].mean(), full['longitude'].mean()], zoom_start=10)
    districts.choropleth(
        geo_data=geo_path,
        data=df,
        columns=['school_dist', col],
        key_on='feature.properties.school_dist',
        fill_color='YlGn',
        fill_opacity=0.7,
        line_opacity=0.2,
    )
#     folium.GeoJson(geo_path, name = 'geojson').add_to(districts)
    districts.save("generated_maps/{}.html".format(col))
    return districts

sat_districs = show_district_map("sat_score", district_data)
from IPython.display import IFrame
IFrame('generated_maps/sat_score.html', width=1000, height=400)

Out[35]:

Now we can clearly seen on the distribution of SAT scores in the NYC. Dark green areas on the map indicate high average scores. The color close to yellow indicates low average SAT score areas.

In [36]:

# English language learners distribution
ell_districs = show_district_map("ell_percent", district_data)
IFrame('generated_maps/ell_percent.html', width=1000, height=400)

Out[36]:

SAT Results Analysis NYC

Explore the SAT test in NYC¶

Understanding the data¶

Combine Files into 1 Single Dataset¶

Adding in the survey data.¶

Condensing data, more preprocessing¶

Feature engineering¶

Combining all datasets together¶

Handle missing values during combining¶

Adding missing values¶

Computing variable correlations¶

Gender influence¶

Race influence¶

Atittude of respondents' influence¶

Foreign language speaker differences¶

Setting the context¶

Map¶

District level mapping¶

Comments

Explore the SAT test in NYC¶

Understanding the data¶

Combine Files into 1 Single Dataset¶

Adding in the survey data.¶

Condensing data, more preprocessing¶

Feature engineering¶

Combining all datasets together¶

Handle missing values during combining¶

Adding missing values¶

Computing variable correlations¶

Gender influence¶

Race influence¶

Atittude of respondents' influence¶

Foreign language speaker differences¶

Setting the context¶

Map¶

District level mapping¶

Related Posts:

Comments