Use Pandas in ArcGIS#

pandaslogo

Pandas is widely used in data science and data analysis, and is a popular choice for cleaning, transforming, and analyzing data in Python.

The name “pandas” is derived from “panel data”, which is an econometric term for data sets that include observations over multiple time periods for the same individuals.

  • pandas is built on top of NumPy

  • the “excel spreadsheet” in Python

  • Powerful indexing selection

  • easily integrated with other libraries, such as matplotlib, scikit-learn, and plotly.

Like NumPy, the convention of importing pandas is following. And as yet another routine, the two packages are usually imported together with numpy being the first.

import numpy as np
import pandas as pd

1. pandas.Series#

Series is the fundamental data structure of pandas. A pandas.Series object is essentially a 1-d array of indexed values.

Let’s again start by creating a pd.Series using a list.

my_list = list(range(10))
my_list
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
pd.Series(my_list) # note that the S in Series is upper case
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

The above code converts a list to a Series. Note that the default index for Series starts with index 0.

But pd.Series can have non-numeric and non-sequential indices, whereas a np.ndarray is only indexed by whole numbers.

Let’s customize the index of this Series.

import string
string.ascii_lowercase[:10]
'abcdefghij'
labels = list(string.ascii_lowercase[:10])
labels
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
pd.Series(my_list, index=labels)
a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64
my_sr = pd.Series(my_list, labels)

We can use values to obtain the values as a 1-d np.ndarray.

my_sr.values
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

Likewise, we can also retrieve a Series’s index.

my_sr.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

1.1 Indexing and slicing for Series#

  • indexing by label

  • indexing by location

my_sr['c']   # indexing by label
2
my_sr[2]   # indexing by position
2
my_sr[2:8]
c    2
d    3
e    4
f    5
g    6
h    7
dtype: int64

1.2 Operations on Series#

Similar to np.ndarray, operations on pd.Series applies to every single element in it.

Let’s create pd.Series to represent the following tables.

Top 3 U.S. State by Population

State

Population

CA

39512223

TA

28995881

FL

21477737

Top 3 U.S. State by Area

State

Area(sq mi)

AL

665384

TA

268596

CA

163694

Let’s use split(), a string manipulation function, to split the string by whitespace, which will result in a list.

"CA TA FL".split(" ")
['CA', 'TA', 'FL']
"CA TA FL".split()  # split() acts on space by default
['CA', 'TA', 'FL']
sr1 = pd.Series(data=[39512223, 28995881, 21477737],
                index="CA TA FL".split())
sr1
CA    39512223
TA    28995881
FL    21477737
dtype: int64

What about population in millions?

round(sr1 / 1000000, 2)
CA    39.51
TA    29.00
FL    21.48
dtype: float64

Similarly, we can split a string by “comma” (or any other characters).

"AL,TA,CA".split(",")
['AL', 'TA', 'CA']
sr2 = pd.Series([665384, 268596, 163694], "AL,TA,CA".split(","))
sr2
AL    665384
TA    268596
CA    163694
dtype: int64
Pandas perform operations based on indicies. In other word, only values of the same
index can be used together to perform some operation.
sr1 / sr2
AL           NaN
CA    241.378566
FL           NaN
TA    107.953510
dtype: float64

Note that only California and Texas have values returned because their indices exist in both Series. On the contrary, Alaska doesn’t appear in sr1 and Florida isn’t in sr2, the division for those returned NaN meaning “not a number.”

2. DataFrame object#

DataFrame: Two-dimensional (rows & columns), size-mutable, potentially heterogeneous tabular data.

DataFrame consists of one or more Series that shares the same labels (indicies).

It is the DataFrame structure makes pandas a powerful tool for scientific data analysis.

2.1 Create DataFrame from Series#

We can convert a pd.Series to a pd.DataFrame of a single column. And note that the column names must be supplied as a list.

pd.DataFrame(sr1, columns=['population'])
population
CA 39512223
TA 28995881
FL 21477737

concat function: series are supplied as a list

pd.concat([sr1, sr2], axis=1) # axis=1 means on columns
0 1
CA 39512223.0 163694.0
TA 28995881.0 268596.0
FL 21477737.0 NaN
AL NaN 665384.0
pd.concat([sr1, sr2]) # by default, axis=0, meaning on rows
CA    39512223
TA    28995881
FL    21477737
AL      665384
TA      268596
CA      163694
dtype: int64

concat function: series are supplied as a dictionary

pd.concat({'population': sr1, 'area (sq mi)': sr2}, axis=1)
population area (sq mi)
CA 39512223.0 163694.0
TA 28995881.0 268596.0
FL 21477737.0 NaN
AL NaN 665384.0

2.2 Convert NumPy’s Structured Array to pd.DataFrame#

Recall how we convert a feature class to a structured ndarray

import arcpy

gdb_worksp = r"..\data\class_data.gdb"
arcpy.env.workspace = gdb_worksp
school_fc = "schools"

school_arr = arcpy.da.FeatureClassToNumPyArray(school_fc, '*') # use '*' to get all fields
school_arr.dtype.names
('OBJECTID_1',
 'Shape',
 'OBJECTID',
 'STATUS',
 'SCORE',
 'SIDE',
 'MATCH_ADDR',
 'FEDERAL_ID',
 'STATE_ID',
 'SCHOOL_ID',
 'NAME',
 'ADDRESS',
 'CITY',
 'ZIPCODE',
 'PHONE',
 'COUNTY',
 'OPERATING',
 'OP_CLASS',
 'ENROLLMENT',
 'PROGRAMS',
 'COMMON_USE',
 'USE',
 'TYPE',
 'ACTIVITY',
 'GRADES',
 'LOW_GRADE',
 'HIGH_GRADE',
 'PRINCIPAL',
 'TEACHERS',
 'STDTCH_RT',
 'MIGRNT_STD',
 'TITLE1SCHO',
 'MAGNETINFO',
 'FREE_LUNCH',
 'REDUCED_LU',
 'FISH_FAC1',
 'FISH_FAC2',
 'COMMENTS',
 'BBSERVICE',
 'BBPROVIDER',
 'BBSPEED',
 'DSTREAMSPD',
 'YR_BUILT',
 'PARCEL_ID',
 'LAT_DD',
 'LONG_DD',
 'USNG_FL_1K',
 'FDOE_MSID',
 'NCES_PUB',
 'NCES_PRIV',
 'FDOE_PRV',
 'SOURCE',
 'DESCRIPT',
 'FLAG',
 'UPDATE_DAY',
 'FGDLAQDATE',
 'AUTOID')
columns = ["NAME", 'OP_CLASS', 'ENROLLMENT', 'TYPE', 'TEACHERS']
school_arr = arcpy.da.FeatureClassToNumPyArray(school_fc, columns)

pd.DataFrame(school_arr, columns=columns)
NAME OP_CLASS ENROLLMENT TYPE TEACHERS
0 GRACE CHRISTIAN SCHOOL OF ALACHUA CO. PRIVATE 0.0 SENIOR HIGH 0.0
1 FAMILY LIFE ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & SECONDARY 0.0
2 FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
3 VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
4 BHAKTIVEDANTA ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & MIDDLE 0.0
... ... ... ... ... ...
107 SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN C... PUBLIC 0.0 COLLEGE/UNIVERSITY 0.0
108 SANTA FE COLLEGE - KIRKPATRICK CENTER PUBLIC 0.0 COLLEGE/UNIVERSITY 0.0
109 ALACHUA COUNTY JAIL PUBLIC 0.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
110 ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIR... PUBLIC 0.0 MIGRANT EDUCATION PROGRAM 4.0
111 OAK HALL LOWER SCHOOL PRIVATE 365.0 ELEMENTARY 16.0

112 rows × 5 columns

2.3 Preview a DataFrame#

The head and tail functions will allow us to preview the top and bottom of a DataFrame, respectively.

school_df = pd.DataFrame(school_arr, columns=columns)
school_df.head()
NAME OP_CLASS ENROLLMENT TYPE TEACHERS
0 GRACE CHRISTIAN SCHOOL OF ALACHUA CO. PRIVATE 0.0 SENIOR HIGH 0.0
1 FAMILY LIFE ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & SECONDARY 0.0
2 FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
3 VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
4 BHAKTIVEDANTA ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & MIDDLE 0.0

And, by default (without giving an argument), both functions return 5 rows.

school_df.tail()
NAME OP_CLASS ENROLLMENT TYPE TEACHERS
107 SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN C... PUBLIC 0.0 COLLEGE/UNIVERSITY 0.0
108 SANTA FE COLLEGE - KIRKPATRICK CENTER PUBLIC 0.0 COLLEGE/UNIVERSITY 0.0
109 ALACHUA COUNTY JAIL PUBLIC 0.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
110 ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIR... PUBLIC 0.0 MIGRANT EDUCATION PROGRAM 4.0
111 OAK HALL LOWER SCHOOL PRIVATE 365.0 ELEMENTARY 16.0

But, we can specify the number we want to preview.

school_df.head(10)
NAME OP_CLASS ENROLLMENT TYPE TEACHERS
0 GRACE CHRISTIAN SCHOOL OF ALACHUA CO. PRIVATE 0.0 SENIOR HIGH 0.0
1 FAMILY LIFE ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & SECONDARY 0.0
2 FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
3 VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
4 BHAKTIVEDANTA ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & MIDDLE 0.0
5 DESTINY CHRISTIAN ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & MIDDLE 0.0
6 INCAF MONTESSORI SCHOOL PRIVATE 0.0 ELEMENTARY 0.0
7 GREAT AMERICAN VISIONS ENTERPRISES,INC PRIVATE 0.0 COMBINATION ELEMENTARY & MIDDLE 0.0
8 JORDAN GLEN SCHOOL INC. PRIVATE 115.0 COMBINATION ELEMENTARY & MIDDLE 14.5
9 QUEEN OF PEACE CATHOLIC ACADEMY PRIVATE 358.0 COMBINATION ELEMENTARY & MIDDLE 28.4

2.4 Indexing and Slicing for DataFrame#

Select single column returns a pd.Series.

school_df["NAME"]
0                  GRACE CHRISTIAN SCHOOL OF ALACHUA CO.
1                                    FAMILY LIFE ACADEMY
2                         FOREST GROVE CHRISTIAN ACADEMY
3                            VAISHNAVA ACADEMY FOR GIRLS
4                                  BHAKTIVEDANTA ACADEMY
                             ...                        
107    SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN C...
108                SANTA FE COLLEGE - KIRKPATRICK CENTER
109                                  ALACHUA COUNTY JAIL
110    ALACHUA  COUNTY STUDENT SERVICES/ MIGRANT/ VIR...
111                                OAK HALL LOWER SCHOOL
Name: NAME, Length: 112, dtype: object

Select multiple columns return a pandas DataFrame

school_df[["NAME", "ENROLLMENT"]] # multiple columns supplied as a list
NAME ENROLLMENT
0 GRACE CHRISTIAN SCHOOL OF ALACHUA CO. 0.0
1 FAMILY LIFE ACADEMY 0.0
2 FOREST GROVE CHRISTIAN ACADEMY 53.0
3 VAISHNAVA ACADEMY FOR GIRLS 19.0
4 BHAKTIVEDANTA ACADEMY 0.0
... ... ...
107 SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN C... 0.0
108 SANTA FE COLLEGE - KIRKPATRICK CENTER 0.0
109 ALACHUA COUNTY JAIL 0.0
110 ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIR... 0.0
111 OAK HALL LOWER SCHOOL 365.0

112 rows × 2 columns

Select row using the .loc[] function (label/index-based). Note that .loc is followed by square brackets not parentheses.

school_df.loc[2]
NAME              FOREST GROVE CHRISTIAN ACADEMY
OP_CLASS                                 PRIVATE
ENROLLMENT                                  53.0
TYPE          COMBINATION ELEMENTARY & SECONDARY
TEACHERS                                     9.4
Name: 2, dtype: object

Select row using the .iloc[] function (position/order-based)

school_df.iloc[2]
NAME              FOREST GROVE CHRISTIAN ACADEMY
OP_CLASS                                 PRIVATE
ENROLLMENT                                  53.0
TYPE          COMBINATION ELEMENTARY & SECONDARY
TEACHERS                                     9.4
Name: 2, dtype: object

Use set_index() to choose a column as index.

  • inplace=True

  • reassign the result of set_index()

school_df.set_index('NAME').head()
OP_CLASS ENROLLMENT TYPE TEACHERS
NAME
GRACE CHRISTIAN SCHOOL OF ALACHUA CO. PRIVATE 0.0 SENIOR HIGH 0.0
FAMILY LIFE ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & SECONDARY 0.0
FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
BHAKTIVEDANTA ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & MIDDLE 0.0
school_df.head()
NAME OP_CLASS ENROLLMENT TYPE TEACHERS
0 GRACE CHRISTIAN SCHOOL OF ALACHUA CO. PRIVATE 0.0 SENIOR HIGH 0.0
1 FAMILY LIFE ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & SECONDARY 0.0
2 FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
3 VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
4 BHAKTIVEDANTA ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & MIDDLE 0.0
school_df.set_index('NAME', inplace=True) # or
# school_df = school_df.set_index('NAME')
school_df.head()
OP_CLASS ENROLLMENT TYPE TEACHERS
NAME
GRACE CHRISTIAN SCHOOL OF ALACHUA CO. PRIVATE 0.0 SENIOR HIGH 0.0
FAMILY LIFE ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & SECONDARY 0.0
FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
BHAKTIVEDANTA ACADEMY PRIVATE 0.0 COMBINATION ELEMENTARY & MIDDLE 0.0
school_df.loc['FOREST GROVE CHRISTIAN ACADEMY']
OP_CLASS                                 PRIVATE
ENROLLMENT                                  53.0
TYPE          COMBINATION ELEMENTARY & SECONDARY
TEACHERS                                     9.4
Name: FOREST GROVE CHRISTIAN ACADEMY, dtype: object
school_df.loc[:, 'ENROLLMENT']
NAME
GRACE CHRISTIAN SCHOOL OF ALACHUA CO.                                                                 0.0
FAMILY LIFE ACADEMY                                                                                   0.0
FOREST GROVE CHRISTIAN ACADEMY                                                                       53.0
VAISHNAVA ACADEMY FOR GIRLS                                                                          19.0
BHAKTIVEDANTA ACADEMY                                                                                 0.0
                                                                                                    ...  
SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN CENTER                                                   0.0
SANTA FE COLLEGE - KIRKPATRICK CENTER                                                                 0.0
ALACHUA COUNTY JAIL                                                                                   0.0
ALACHUA  COUNTY STUDENT SERVICES/ MIGRANT/ VIRTUAL PROGRAM/ MCKAY SCHOLARSHIP/ HOMEBOUND PROGRAM      0.0
OAK HALL LOWER SCHOOL                                                                               365.0
Name: ENROLLMENT, Length: 112, dtype: float64

.iloc for both row indices and column indices.

school_df.iloc[10:20, 1:4]
ENROLLMENT TYPE TEACHERS
NAME
THE ROCK SCHOOL 207.0 COMBINATION ELEMENTARY & SECONDARY 17.9
CHRISTIAN LIFE ACADEMY 52.0 COMBINATION ELEMENTARY & SECONDARY 5.9
SAINT FRANCIS CATHOLIC HIGH SCHOOL 260.0 SENIOR HIGH 19.8
COUNTRYSIDE CHRISTIAN SCHOOL 106.0 COMBINATION ELEMENTARY & SECONDARY 6.5
TRILOGY SCHOOL OF LEARNING ALTERNATIVE 84.0 COMBINATION ELEMENTARY & SECONDARY 11.5
MILLHOPPER MONTESSORI SCHOOL 209.0 COMBINATION ELEMENTARY & MIDDLE 19.1
BNAI ISRAEL DAY SCHOOL 22.0 ELEMENTARY 0.0
GAINESVILLE CONDUCTIVE EDUCATION ACADEMY 0.0 COMBINATION ELEMENTARY & SECONDARY 0.0
BRENTWOOD SCHOOL 246.0 ELEMENTARY 10.8
CORNERSTONE ACADEMY 216.0 COMBINATION ELEMENTARY & SECONDARY 24.8

Similar to indexing on a 2-d NumPy array, we can access individual values of DataFrame based on row index and column index. Below is an example of finding out The Rock School’s (row) size of Enrollment (column).

school_df.loc['THE ROCK SCHOOL', 'ENROLLMENT']
207.0

3. Selection Query on DataFrame#

A single condition returns a Series of booleans again with the same index

school_df["ENROLLMENT"] > 0
NAME
GRACE CHRISTIAN SCHOOL OF ALACHUA CO.                                                               False
FAMILY LIFE ACADEMY                                                                                 False
FOREST GROVE CHRISTIAN ACADEMY                                                                       True
VAISHNAVA ACADEMY FOR GIRLS                                                                          True
BHAKTIVEDANTA ACADEMY                                                                               False
                                                                                                    ...  
SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN CENTER                                                 False
SANTA FE COLLEGE - KIRKPATRICK CENTER                                                               False
ALACHUA COUNTY JAIL                                                                                 False
ALACHUA  COUNTY STUDENT SERVICES/ MIGRANT/ VIRTUAL PROGRAM/ MCKAY SCHOLARSHIP/ HOMEBOUND PROGRAM    False
OAK HALL LOWER SCHOOL                                                                                True
Name: ENROLLMENT, Length: 112, dtype: bool
school_df.ENROLLMENT.head()
NAME
GRACE CHRISTIAN SCHOOL OF ALACHUA CO.     0.0
FAMILY LIFE ACADEMY                       0.0
FOREST GROVE CHRISTIAN ACADEMY           53.0
VAISHNAVA ACADEMY FOR GIRLS              19.0
BHAKTIVEDANTA ACADEMY                     0.0
Name: ENROLLMENT, dtype: float64
school_df.ENROLLMENT.index.name == school_df.index.name
True
school_df.index.values
array(['GRACE CHRISTIAN SCHOOL OF ALACHUA CO.', 'FAMILY LIFE ACADEMY',
       'FOREST GROVE CHRISTIAN ACADEMY', 'VAISHNAVA ACADEMY FOR GIRLS',
       'BHAKTIVEDANTA ACADEMY', 'DESTINY CHRISTIAN ACADEMY',
       'INCAF MONTESSORI SCHOOL',
       'GREAT AMERICAN VISIONS ENTERPRISES,INC',
       'JORDAN GLEN SCHOOL INC.', 'QUEEN OF PEACE CATHOLIC ACADEMY',
       'THE ROCK SCHOOL', 'CHRISTIAN LIFE ACADEMY',
       'SAINT FRANCIS CATHOLIC HIGH SCHOOL',
       'COUNTRYSIDE CHRISTIAN SCHOOL',
       'TRILOGY SCHOOL OF LEARNING ALTERNATIVE',
       'MILLHOPPER MONTESSORI SCHOOL', 'BNAI ISRAEL DAY SCHOOL',
       'GAINESVILLE CONDUCTIVE EDUCATION ACADEMY', 'BRENTWOOD SCHOOL',
       'CORNERSTONE ACADEMY', 'WESTWOOD HILLS CHRISTIAN SCHOOL',
       'FLOWERS MONTESSORI SCHOOL', 'Z.L. SUNG S.D.A. SCHOOL',
       'GAINESVILLE CONDUCTIVE EDUCATION ACADEMY',
       'KIDS N ALL CHRISTIAN ACADEMY', 'OAK HALL SCHOOL',
       'FREEDOM CHRISTIAN ACADEMY', 'GAINESVILLE COUNTRY DAY SCHOOL',
       'FAITH TABERNACLE OF PRAISE SCHOOL OF MINISTRY',
       "THE CHILDREN'S CENTER", 'CITY COLLEGE',
       'SAINT PATRICK INTERPARISH SCHOOL',
       'STAR CHRISTIAN CENTER AND ACADEMY',
       'COMPASSIONATE OUTREACH MINISTRIES', 'WINDSOR CHRISTIAN ACADEMY',
       'OAK HILL COMMUNITY PRIVATE SCHOOL SYSTEM',
       'NORTH FLORIDA SDA ELEMENTARY',
       'MICANOPY AREA COOPERATIVE SCHOOL, INC.',
       'ARCHER COMMUNITY SCHOOL', 'SANTA FE COLLEGE - DAVIS CENTER',
       'MICANOPY MIDDLE SCHOOL, INC.', 'OAK VIEW MIDDLE SCHOOL',
       'NEWBERRY ELEMENTARY SCHOOL', 'NEWBERRY HIGH SCHOOL',
       'HIGH SPRINGS COMMUNITY - ELEMENTARY AND MIDDLE SCHOOL',
       'UNIVERSITY OF FLORIDA - COMPARITIVE MEDICINE',
       'NORTH AMERICAN FAMILY INSTITUTE ALACHUA ACADEMY', 'PACE',
       'HOSPITAL HOMEBOUND', 'SWEETWATER BRANCH ACADEMY',
       'PROFESSIONAL ACADEMY MAGNET AT LOFTEN HIGH SCHOOL',
       'SANTA FE HIGH SCHOOL', 'MEBANE MIDDLE SCHOOL',
       'ALACHUA LEARNING CENTER - PUBLIC CHARTER SCHOOL',
       'IRBY ELEMENTARY SCHOOL', 'ALACHUA ELEMENTARY SCHOOL',
       'UNIVERSITY OF FLORIDA', 'UNIVERSITY OF FLORIDA - AGRONOMY LAB',
       'SANTA FE COLLEGE - NORTHWEST CAMPUS', 'BUCHHOLZ HIGH SCHOOL',
       'FINLEY ELEMENTARY SCHOOL',
       'P.K. YONGE DEVELOPMENTAL RESEARCH SCHOOL',
       'CHILES ELEMENTARY SCHOOL', 'FT CLARKE MIDDLE SCHOOL',
       'HIDDEN OAK ELEMENTARY SCHOOL', 'TERWILLIGER ELEMENTARY SCHOOL',
       'TALBOT ELEMENTARY SCHOOL', 'NORTON ELEMENTARY SCHOOL',
       'LITTLEWOOD ELEMENTARY SCHOOL', 'WESTWOOD MIDDLE SCHOOL',
       'HEALTHY LEARNING ACADEMY CHARTER SCHOOL',
       'GLEN SPRINGS ELEMENTARY SCHOOL', 'GAINESVILLE HIGH SCHOOL',
       'FOSTER ELEMENTARY SCHOOL', 'GENESIS PREPARATORY SCHOOL',
       'LANIER CENTER & ANCHOR CENTER', 'UNIVERSITY OF FLORIDA',
       'A. QUINN JONES CENTER', 'CHARACTER COUNTS CENTER',
       'SANTA FE COMMUNITY COLLEGE', 'KANAPAHA MIDDLE SCHOOL',
       'WILES ELEMENTARY SCHOOL',
       'FAMILY PROGRESS CENTER - WILES HEAD START',
       'EINSTEIN MONTESSORI SCHOOL',
       'IDYLWILD ELEMENTARY SCHOOL AND MULTI-COUNTY MIGRANT PROGRAM',
       'EXPRESSIONS LEARNING ARTS ACADEMY',
       'GAINESVILLE WILDERNESS INSTITUTE',
       'CARING & SHARING LEARNING SCHOOL',
       'PRAIRIE VIEW ELEMENTARY SCHOOL', 'LINCOLN MIDDLE SCHOOL',
       'WILLIAMS ELEMENTARY SCHOOL',
       'ALACHUA COUNTY SUPERINTENDENT OFFICE - KIRBY-SMITH CENTER',
       'DUVAL ELEMENTARY SCHOOL', 'BISHOP MIDDLE SCHOOL',
       'METCALFE ELEMENTARY SCHOOL', 'THE ONE ROOM SCHOOL HOUSE PROJECT',
       'RAWLINGS ELEMENTARY SCHOOL',
       'FEARNSIDE FAMILY SERVICES CENTER AND HEAD START / PREK E I CENTER',
       'HOGGETOWNE MIDDLE SCHOOL', 'HORIZON CENTER. ALTERNATIVE SCHOOL',
       'LAKE FOREST ELEMENTARY SCHOOL', 'EASTSIDE HIGH SCHOOL',
       'FLORIDA SIATECH AT GAINESVILLE,INC.', 'SHELL ELEMENTARY SCHOOL',
       'HAWTHORNE JR/SR HIGH SCHOOL', 'WALDO COMMUNITY SCHOOL',
       'UNIVERSITY OF FLORIDA',
       'SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN CENTER',
       'SANTA FE COLLEGE - KIRKPATRICK CENTER', 'ALACHUA COUNTY JAIL',
       'ALACHUA  COUNTY STUDENT SERVICES/ MIGRANT/ VIRTUAL PROGRAM/ MCKAY SCHOLARSHIP/ HOMEBOUND PROGRAM',
       'OAK HALL LOWER SCHOOL'], dtype=object)

Similar to NumPy, we can use boolean indexing for Pandas as well, in which we create a condition (return Booleans in a Series) and then use it to filter the original DataFrame.

school_df[school_df["ENROLLMENT"] > 0] # without loc
OP_CLASS ENROLLMENT TYPE TEACHERS
NAME
FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
JORDAN GLEN SCHOOL INC. PRIVATE 115.0 COMBINATION ELEMENTARY & MIDDLE 14.5
QUEEN OF PEACE CATHOLIC ACADEMY PRIVATE 358.0 COMBINATION ELEMENTARY & MIDDLE 28.4
THE ROCK SCHOOL PRIVATE 207.0 COMBINATION ELEMENTARY & SECONDARY 17.9
... ... ... ... ...
FLORIDA SIATECH AT GAINESVILLE,INC. PUBLIC 202.0 SENIOR HIGH 5.0
SHELL ELEMENTARY SCHOOL PUBLIC 193.0 ELEMENTARY 16.0
HAWTHORNE JR/SR HIGH SCHOOL PUBLIC 395.0 COMBINATION JR. HIGH & SENIOR HIGH 23.0
WALDO COMMUNITY SCHOOL PUBLIC 215.0 ELEMENTARY 17.0
OAK HALL LOWER SCHOOL PRIVATE 365.0 ELEMENTARY 16.0

84 rows × 4 columns

school_df.loc[school_df["ENROLLMENT"] > 0] # with loc (recommended)
OP_CLASS ENROLLMENT TYPE TEACHERS
NAME
FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
JORDAN GLEN SCHOOL INC. PRIVATE 115.0 COMBINATION ELEMENTARY & MIDDLE 14.5
QUEEN OF PEACE CATHOLIC ACADEMY PRIVATE 358.0 COMBINATION ELEMENTARY & MIDDLE 28.4
THE ROCK SCHOOL PRIVATE 207.0 COMBINATION ELEMENTARY & SECONDARY 17.9
... ... ... ... ...
FLORIDA SIATECH AT GAINESVILLE,INC. PUBLIC 202.0 SENIOR HIGH 5.0
SHELL ELEMENTARY SCHOOL PUBLIC 193.0 ELEMENTARY 16.0
HAWTHORNE JR/SR HIGH SCHOOL PUBLIC 395.0 COMBINATION JR. HIGH & SENIOR HIGH 23.0
WALDO COMMUNITY SCHOOL PUBLIC 215.0 ELEMENTARY 17.0
OAK HALL LOWER SCHOOL PRIVATE 365.0 ELEMENTARY 16.0

84 rows × 4 columns

For multiple conditions, use parentheses to enclose each condition and then connect multiple conditions by either one of the following symbols.

  • AND operation: &

  • OR operation: |

Let’s find out schools with both positive enrollment and positive number of teachers.

school_df[(school_df["ENROLLMENT"] > 0) & (school_df["TEACHERS"] > 0)]
OP_CLASS ENROLLMENT TYPE TEACHERS
NAME
FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
JORDAN GLEN SCHOOL INC. PRIVATE 115.0 COMBINATION ELEMENTARY & MIDDLE 14.5
QUEEN OF PEACE CATHOLIC ACADEMY PRIVATE 358.0 COMBINATION ELEMENTARY & MIDDLE 28.4
THE ROCK SCHOOL PRIVATE 207.0 COMBINATION ELEMENTARY & SECONDARY 17.9
CHRISTIAN LIFE ACADEMY PRIVATE 52.0 COMBINATION ELEMENTARY & SECONDARY 5.9
... ... ... ... ...
FLORIDA SIATECH AT GAINESVILLE,INC. PUBLIC 202.0 SENIOR HIGH 5.0
SHELL ELEMENTARY SCHOOL PUBLIC 193.0 ELEMENTARY 16.0
HAWTHORNE JR/SR HIGH SCHOOL PUBLIC 395.0 COMBINATION JR. HIGH & SENIOR HIGH 23.0
WALDO COMMUNITY SCHOOL PUBLIC 215.0 ELEMENTARY 17.0
OAK HALL LOWER SCHOOL PRIVATE 365.0 ELEMENTARY 16.0

76 rows × 4 columns

school_df[(school_df["ENROLLMENT"] > 0) | (school_df["TEACHERS"] > 0)]
OP_CLASS ENROLLMENT TYPE TEACHERS
NAME
FOREST GROVE CHRISTIAN ACADEMY PRIVATE 53.0 COMBINATION ELEMENTARY & SECONDARY 9.4
VAISHNAVA ACADEMY FOR GIRLS PRIVATE 19.0 COMBINATION JR. HIGH & SENIOR HIGH 0.0
JORDAN GLEN SCHOOL INC. PRIVATE 115.0 COMBINATION ELEMENTARY & MIDDLE 14.5
QUEEN OF PEACE CATHOLIC ACADEMY PRIVATE 358.0 COMBINATION ELEMENTARY & MIDDLE 28.4
THE ROCK SCHOOL PRIVATE 207.0 COMBINATION ELEMENTARY & SECONDARY 17.9
... ... ... ... ...
SHELL ELEMENTARY SCHOOL PUBLIC 193.0 ELEMENTARY 16.0
HAWTHORNE JR/SR HIGH SCHOOL PUBLIC 395.0 COMBINATION JR. HIGH & SENIOR HIGH 23.0
WALDO COMMUNITY SCHOOL PUBLIC 215.0 ELEMENTARY 17.0
ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIRTUAL PROGRAM/ MCKAY SCHOLARSHIP/ HOMEBOUND PROGRAM PUBLIC 0.0 MIGRANT EDUCATION PROGRAM 4.0
OAK HALL LOWER SCHOOL PRIVATE 365.0 ELEMENTARY 16.0

86 rows × 4 columns

4. groupby() function#

groupby is a useful function to aggregate data.

A groupby operation involves some combination of

  • splitting the object,

  • applying a function, and

  • combining the results.

This can be used to group large amounts of data and compute operations on these groups to understand subgroups in the data.

school_df = school_df.loc[
    (school_df["ENROLLMENT"] > 0) & (school_df["TEACHERS"] > 0),
    ['TYPE','ENROLLMENT', 'TEACHERS']
]
school_df.head()
TYPE ENROLLMENT TEACHERS
NAME
FOREST GROVE CHRISTIAN ACADEMY COMBINATION ELEMENTARY & SECONDARY 53.0 9.4
JORDAN GLEN SCHOOL INC. COMBINATION ELEMENTARY & MIDDLE 115.0 14.5
QUEEN OF PEACE CATHOLIC ACADEMY COMBINATION ELEMENTARY & MIDDLE 358.0 28.4
THE ROCK SCHOOL COMBINATION ELEMENTARY & SECONDARY 207.0 17.9
CHRISTIAN LIFE ACADEMY COMBINATION ELEMENTARY & SECONDARY 52.0 5.9

Without the aggregation function, the return is simply a groupby object.

school_df.groupby("TYPE")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000248CBB228B0>
by_type = school_df.groupby("TYPE")
by_type.sum()
ENROLLMENT TEACHERS
TYPE
COMBINATION ELEMENTARY & MIDDLE 2664.0 176.0
COMBINATION ELEMENTARY & SECONDARY 3094.0 323.3
COMBINATION JR. HIGH & SENIOR HIGH 768.0 63.0
ELEMENTARY 13231.0 961.3
MIDDLE/JR. HIGH 4844.0 286.0
PRE-KINDERGARTEN 163.0 20.0
PRE-KINDERGARTEN-KINDERGARTEN 45.0 6.1
SENIOR HIGH 7885.0 390.8
by_type.max()
ENROLLMENT TEACHERS
TYPE
COMBINATION ELEMENTARY & MIDDLE 938.0 54.0
COMBINATION ELEMENTARY & SECONDARY 1139.0 122.0
COMBINATION JR. HIGH & SENIOR HIGH 395.0 23.0
ELEMENTARY 836.0 56.0
MIDDLE/JR. HIGH 1047.0 62.0
PRE-KINDERGARTEN 100.0 15.0
PRE-KINDERGARTEN-KINDERGARTEN 45.0 6.1
SENIOR HIGH 2221.0 100.0
school_df.groupby("TYPE").mean()
ENROLLMENT TEACHERS
TYPE
COMBINATION ELEMENTARY & MIDDLE 333.000000 22.000000
COMBINATION ELEMENTARY & SECONDARY 238.000000 24.869231
COMBINATION JR. HIGH & SENIOR HIGH 192.000000 15.750000
ELEMENTARY 400.939394 29.130303
MIDDLE/JR. HIGH 605.500000 35.750000
PRE-KINDERGARTEN 81.500000 10.000000
PRE-KINDERGARTEN-KINDERGARTEN 45.000000 6.100000
SENIOR HIGH 1126.428571 55.828571

agg stands for “aggregation” allows us to specify different aggregation functions for different columns.

by_type.agg({'ENROLLMENT': np.sum, 'TEACHERS': np.mean})
ENROLLMENT TEACHERS
TYPE
COMBINATION ELEMENTARY & MIDDLE 2664.0 22.000000
COMBINATION ELEMENTARY & SECONDARY 3094.0 24.869231
COMBINATION JR. HIGH & SENIOR HIGH 768.0 15.750000
ELEMENTARY 13231.0 29.130303
MIDDLE/JR. HIGH 4844.0 35.750000
PRE-KINDERGARTEN 163.0 10.000000
PRE-KINDERGARTEN-KINDERGARTEN 45.0 6.100000
SENIOR HIGH 7885.0 55.828571