Use Pandas in ArcGIS
Contents
Use Pandas in ArcGIS#
Pandas is widely used in data science and data analysis, and is a popular choice for cleaning, transforming, and analyzing data in Python.
The name “pandas” is derived from “panel data”, which is an econometric term for data sets that include observations over multiple time periods for the same individuals.
pandas is built on top of NumPy
the “excel spreadsheet” in Python
Powerful indexing selection
easily integrated with other libraries, such as matplotlib, scikit-learn, and plotly.
Like NumPy, the convention of importing pandas
is following. And as yet another
routine, the two packages are usually imported together with numpy
being the first.
import numpy as np
import pandas as pd
1. pandas.Series
#
Series is the fundamental
data structure of pandas.
A pandas.Series
object is essentially a 1-d array of indexed values.
Let’s again start by creating a pd.Series
using a list.
my_list = list(range(10))
my_list
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
pd.Series(my_list) # note that the S in Series is upper case
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
The above code converts a list to a Series. Note that the default index for
Series starts with index 0
.
But pd.Series
can have non-numeric and non-sequential indices, whereas a
np.ndarray
is only indexed by whole numbers.
Let’s customize the index of this Series.
import string
string.ascii_lowercase[:10]
'abcdefghij'
labels = list(string.ascii_lowercase[:10])
labels
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
pd.Series(my_list, index=labels)
a 0
b 1
c 2
d 3
e 4
f 5
g 6
h 7
i 8
j 9
dtype: int64
my_sr = pd.Series(my_list, labels)
We can use values
to obtain the values as a 1-d np.ndarray
.
my_sr.values
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)
Likewise, we can also retrieve a Series’s index.
my_sr.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
1.1 Indexing and slicing for Series#
indexing by label
indexing by location
my_sr['c'] # indexing by label
2
my_sr[2] # indexing by position
2
my_sr[2:8]
c 2
d 3
e 4
f 5
g 6
h 7
dtype: int64
1.2 Operations on Series#
Similar to np.ndarray
, operations on pd.Series
applies to every single element in it.
Let’s create pd.Series
to represent the following tables.
Top 3 U.S. State by Population
State |
Population |
---|---|
CA |
39512223 |
TA |
28995881 |
FL |
21477737 |
Top 3 U.S. State by Area
State |
Area(sq mi) |
---|---|
AL |
665384 |
TA |
268596 |
CA |
163694 |
Let’s use split()
, a string manipulation function, to split the string
by whitespace, which will result in a list
.
"CA TA FL".split(" ")
['CA', 'TA', 'FL']
"CA TA FL".split() # split() acts on space by default
['CA', 'TA', 'FL']
sr1 = pd.Series(data=[39512223, 28995881, 21477737],
index="CA TA FL".split())
sr1
CA 39512223
TA 28995881
FL 21477737
dtype: int64
What about population in millions?
round(sr1 / 1000000, 2)
CA 39.51
TA 29.00
FL 21.48
dtype: float64
Similarly, we can split a string by “comma” (or any other characters).
"AL,TA,CA".split(",")
['AL', 'TA', 'CA']
sr2 = pd.Series([665384, 268596, 163694], "AL,TA,CA".split(","))
sr2
AL 665384
TA 268596
CA 163694
dtype: int64
Pandas perform operations based on indicies. In other word, only values of the same
index can be used together to perform some operation.
sr1 / sr2
AL NaN
CA 241.378566
FL NaN
TA 107.953510
dtype: float64
Note that only California and Texas have values returned because their indices exist
in both Series. On the contrary, Alaska doesn’t appear in sr1
and Florida isn’t in sr2
,
the division for those returned NaN
meaning “not a number.”
2. DataFrame object#
DataFrame: Two-dimensional (rows & columns), size-mutable, potentially heterogeneous tabular data.
DataFrame
consists of one or more Series
that shares the same labels (indicies).
It is the DataFrame structure makes pandas a powerful tool for scientific data analysis.
2.1 Create DataFrame from Series#
We can convert a pd.Series
to a pd.DataFrame
of a single column.
And note that the column names must be supplied as a list.
pd.DataFrame(sr1, columns=['population'])
population | |
---|---|
CA | 39512223 |
TA | 28995881 |
FL | 21477737 |
concat
function: series are supplied as a list
pd.concat([sr1, sr2], axis=1) # axis=1 means on columns
0 | 1 | |
---|---|---|
CA | 39512223.0 | 163694.0 |
TA | 28995881.0 | 268596.0 |
FL | 21477737.0 | NaN |
AL | NaN | 665384.0 |
pd.concat([sr1, sr2]) # by default, axis=0, meaning on rows
CA 39512223
TA 28995881
FL 21477737
AL 665384
TA 268596
CA 163694
dtype: int64
concat
function: series are supplied as a dictionary
pd.concat({'population': sr1, 'area (sq mi)': sr2}, axis=1)
population | area (sq mi) | |
---|---|---|
CA | 39512223.0 | 163694.0 |
TA | 28995881.0 | 268596.0 |
FL | 21477737.0 | NaN |
AL | NaN | 665384.0 |
2.2 Convert NumPy’s Structured Array to pd.DataFrame
#
Recall how we convert a feature class to a structured ndarray
import arcpy
gdb_worksp = r"..\data\class_data.gdb"
arcpy.env.workspace = gdb_worksp
school_fc = "schools"
school_arr = arcpy.da.FeatureClassToNumPyArray(school_fc, '*') # use '*' to get all fields
school_arr.dtype.names
('OBJECTID_1',
'Shape',
'OBJECTID',
'STATUS',
'SCORE',
'SIDE',
'MATCH_ADDR',
'FEDERAL_ID',
'STATE_ID',
'SCHOOL_ID',
'NAME',
'ADDRESS',
'CITY',
'ZIPCODE',
'PHONE',
'COUNTY',
'OPERATING',
'OP_CLASS',
'ENROLLMENT',
'PROGRAMS',
'COMMON_USE',
'USE',
'TYPE',
'ACTIVITY',
'GRADES',
'LOW_GRADE',
'HIGH_GRADE',
'PRINCIPAL',
'TEACHERS',
'STDTCH_RT',
'MIGRNT_STD',
'TITLE1SCHO',
'MAGNETINFO',
'FREE_LUNCH',
'REDUCED_LU',
'FISH_FAC1',
'FISH_FAC2',
'COMMENTS',
'BBSERVICE',
'BBPROVIDER',
'BBSPEED',
'DSTREAMSPD',
'YR_BUILT',
'PARCEL_ID',
'LAT_DD',
'LONG_DD',
'USNG_FL_1K',
'FDOE_MSID',
'NCES_PUB',
'NCES_PRIV',
'FDOE_PRV',
'SOURCE',
'DESCRIPT',
'FLAG',
'UPDATE_DAY',
'FGDLAQDATE',
'AUTOID')
columns = ["NAME", 'OP_CLASS', 'ENROLLMENT', 'TYPE', 'TEACHERS']
school_arr = arcpy.da.FeatureClassToNumPyArray(school_fc, columns)
pd.DataFrame(school_arr, columns=columns)
NAME | OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|---|
0 | GRACE CHRISTIAN SCHOOL OF ALACHUA CO. | PRIVATE | 0.0 | SENIOR HIGH | 0.0 |
1 | FAMILY LIFE ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & SECONDARY | 0.0 |
2 | FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
3 | VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
4 | BHAKTIVEDANTA ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & MIDDLE | 0.0 |
... | ... | ... | ... | ... | ... |
107 | SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN C... | PUBLIC | 0.0 | COLLEGE/UNIVERSITY | 0.0 |
108 | SANTA FE COLLEGE - KIRKPATRICK CENTER | PUBLIC | 0.0 | COLLEGE/UNIVERSITY | 0.0 |
109 | ALACHUA COUNTY JAIL | PUBLIC | 0.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
110 | ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIR... | PUBLIC | 0.0 | MIGRANT EDUCATION PROGRAM | 4.0 |
111 | OAK HALL LOWER SCHOOL | PRIVATE | 365.0 | ELEMENTARY | 16.0 |
112 rows × 5 columns
2.3 Preview a DataFrame#
The head
and tail
functions will allow us to preview the top
and bottom of a DataFrame, respectively.
school_df = pd.DataFrame(school_arr, columns=columns)
school_df.head()
NAME | OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|---|
0 | GRACE CHRISTIAN SCHOOL OF ALACHUA CO. | PRIVATE | 0.0 | SENIOR HIGH | 0.0 |
1 | FAMILY LIFE ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & SECONDARY | 0.0 |
2 | FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
3 | VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
4 | BHAKTIVEDANTA ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & MIDDLE | 0.0 |
And, by default (without giving an argument), both functions return 5 rows.
school_df.tail()
NAME | OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|---|
107 | SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN C... | PUBLIC | 0.0 | COLLEGE/UNIVERSITY | 0.0 |
108 | SANTA FE COLLEGE - KIRKPATRICK CENTER | PUBLIC | 0.0 | COLLEGE/UNIVERSITY | 0.0 |
109 | ALACHUA COUNTY JAIL | PUBLIC | 0.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
110 | ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIR... | PUBLIC | 0.0 | MIGRANT EDUCATION PROGRAM | 4.0 |
111 | OAK HALL LOWER SCHOOL | PRIVATE | 365.0 | ELEMENTARY | 16.0 |
But, we can specify the number we want to preview.
school_df.head(10)
NAME | OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|---|
0 | GRACE CHRISTIAN SCHOOL OF ALACHUA CO. | PRIVATE | 0.0 | SENIOR HIGH | 0.0 |
1 | FAMILY LIFE ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & SECONDARY | 0.0 |
2 | FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
3 | VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
4 | BHAKTIVEDANTA ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & MIDDLE | 0.0 |
5 | DESTINY CHRISTIAN ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & MIDDLE | 0.0 |
6 | INCAF MONTESSORI SCHOOL | PRIVATE | 0.0 | ELEMENTARY | 0.0 |
7 | GREAT AMERICAN VISIONS ENTERPRISES,INC | PRIVATE | 0.0 | COMBINATION ELEMENTARY & MIDDLE | 0.0 |
8 | JORDAN GLEN SCHOOL INC. | PRIVATE | 115.0 | COMBINATION ELEMENTARY & MIDDLE | 14.5 |
9 | QUEEN OF PEACE CATHOLIC ACADEMY | PRIVATE | 358.0 | COMBINATION ELEMENTARY & MIDDLE | 28.4 |
2.4 Indexing and Slicing for DataFrame#
Select single column returns a pd.Series
.
school_df["NAME"]
0 GRACE CHRISTIAN SCHOOL OF ALACHUA CO.
1 FAMILY LIFE ACADEMY
2 FOREST GROVE CHRISTIAN ACADEMY
3 VAISHNAVA ACADEMY FOR GIRLS
4 BHAKTIVEDANTA ACADEMY
...
107 SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN C...
108 SANTA FE COLLEGE - KIRKPATRICK CENTER
109 ALACHUA COUNTY JAIL
110 ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIR...
111 OAK HALL LOWER SCHOOL
Name: NAME, Length: 112, dtype: object
Select multiple columns return a pandas DataFrame
school_df[["NAME", "ENROLLMENT"]] # multiple columns supplied as a list
NAME | ENROLLMENT | |
---|---|---|
0 | GRACE CHRISTIAN SCHOOL OF ALACHUA CO. | 0.0 |
1 | FAMILY LIFE ACADEMY | 0.0 |
2 | FOREST GROVE CHRISTIAN ACADEMY | 53.0 |
3 | VAISHNAVA ACADEMY FOR GIRLS | 19.0 |
4 | BHAKTIVEDANTA ACADEMY | 0.0 |
... | ... | ... |
107 | SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN C... | 0.0 |
108 | SANTA FE COLLEGE - KIRKPATRICK CENTER | 0.0 |
109 | ALACHUA COUNTY JAIL | 0.0 |
110 | ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIR... | 0.0 |
111 | OAK HALL LOWER SCHOOL | 365.0 |
112 rows × 2 columns
Select row using the .loc[]
function (label/index-based).
Note that .loc
is followed by square brackets not parentheses.
school_df.loc[2]
NAME FOREST GROVE CHRISTIAN ACADEMY
OP_CLASS PRIVATE
ENROLLMENT 53.0
TYPE COMBINATION ELEMENTARY & SECONDARY
TEACHERS 9.4
Name: 2, dtype: object
Select row using the .iloc[]
function (position/order-based)
school_df.iloc[2]
NAME FOREST GROVE CHRISTIAN ACADEMY
OP_CLASS PRIVATE
ENROLLMENT 53.0
TYPE COMBINATION ELEMENTARY & SECONDARY
TEACHERS 9.4
Name: 2, dtype: object
Use set_index()
to choose a column as index.
inplace=True
reassign the result of
set_index()
school_df.set_index('NAME').head()
OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|
NAME | ||||
GRACE CHRISTIAN SCHOOL OF ALACHUA CO. | PRIVATE | 0.0 | SENIOR HIGH | 0.0 |
FAMILY LIFE ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & SECONDARY | 0.0 |
FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
BHAKTIVEDANTA ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & MIDDLE | 0.0 |
school_df.head()
NAME | OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|---|
0 | GRACE CHRISTIAN SCHOOL OF ALACHUA CO. | PRIVATE | 0.0 | SENIOR HIGH | 0.0 |
1 | FAMILY LIFE ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & SECONDARY | 0.0 |
2 | FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
3 | VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
4 | BHAKTIVEDANTA ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & MIDDLE | 0.0 |
school_df.set_index('NAME', inplace=True) # or
# school_df = school_df.set_index('NAME')
school_df.head()
OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|
NAME | ||||
GRACE CHRISTIAN SCHOOL OF ALACHUA CO. | PRIVATE | 0.0 | SENIOR HIGH | 0.0 |
FAMILY LIFE ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & SECONDARY | 0.0 |
FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
BHAKTIVEDANTA ACADEMY | PRIVATE | 0.0 | COMBINATION ELEMENTARY & MIDDLE | 0.0 |
school_df.loc['FOREST GROVE CHRISTIAN ACADEMY']
OP_CLASS PRIVATE
ENROLLMENT 53.0
TYPE COMBINATION ELEMENTARY & SECONDARY
TEACHERS 9.4
Name: FOREST GROVE CHRISTIAN ACADEMY, dtype: object
school_df.loc[:, 'ENROLLMENT']
NAME
GRACE CHRISTIAN SCHOOL OF ALACHUA CO. 0.0
FAMILY LIFE ACADEMY 0.0
FOREST GROVE CHRISTIAN ACADEMY 53.0
VAISHNAVA ACADEMY FOR GIRLS 19.0
BHAKTIVEDANTA ACADEMY 0.0
...
SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN CENTER 0.0
SANTA FE COLLEGE - KIRKPATRICK CENTER 0.0
ALACHUA COUNTY JAIL 0.0
ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIRTUAL PROGRAM/ MCKAY SCHOLARSHIP/ HOMEBOUND PROGRAM 0.0
OAK HALL LOWER SCHOOL 365.0
Name: ENROLLMENT, Length: 112, dtype: float64
.iloc
for both row indices and column indices.
school_df.iloc[10:20, 1:4]
ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|
NAME | |||
THE ROCK SCHOOL | 207.0 | COMBINATION ELEMENTARY & SECONDARY | 17.9 |
CHRISTIAN LIFE ACADEMY | 52.0 | COMBINATION ELEMENTARY & SECONDARY | 5.9 |
SAINT FRANCIS CATHOLIC HIGH SCHOOL | 260.0 | SENIOR HIGH | 19.8 |
COUNTRYSIDE CHRISTIAN SCHOOL | 106.0 | COMBINATION ELEMENTARY & SECONDARY | 6.5 |
TRILOGY SCHOOL OF LEARNING ALTERNATIVE | 84.0 | COMBINATION ELEMENTARY & SECONDARY | 11.5 |
MILLHOPPER MONTESSORI SCHOOL | 209.0 | COMBINATION ELEMENTARY & MIDDLE | 19.1 |
BNAI ISRAEL DAY SCHOOL | 22.0 | ELEMENTARY | 0.0 |
GAINESVILLE CONDUCTIVE EDUCATION ACADEMY | 0.0 | COMBINATION ELEMENTARY & SECONDARY | 0.0 |
BRENTWOOD SCHOOL | 246.0 | ELEMENTARY | 10.8 |
CORNERSTONE ACADEMY | 216.0 | COMBINATION ELEMENTARY & SECONDARY | 24.8 |
Similar to indexing on a 2-d NumPy array, we can access individual values
of DataFrame
based on row index and column index.
Below is an example of finding out The Rock School’s (row) size of Enrollment (column).
school_df.loc['THE ROCK SCHOOL', 'ENROLLMENT']
207.0
3. Selection Query on DataFrame#
A single condition returns a Series of booleans again with the same index
school_df["ENROLLMENT"] > 0
NAME
GRACE CHRISTIAN SCHOOL OF ALACHUA CO. False
FAMILY LIFE ACADEMY False
FOREST GROVE CHRISTIAN ACADEMY True
VAISHNAVA ACADEMY FOR GIRLS True
BHAKTIVEDANTA ACADEMY False
...
SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN CENTER False
SANTA FE COLLEGE - KIRKPATRICK CENTER False
ALACHUA COUNTY JAIL False
ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIRTUAL PROGRAM/ MCKAY SCHOLARSHIP/ HOMEBOUND PROGRAM False
OAK HALL LOWER SCHOOL True
Name: ENROLLMENT, Length: 112, dtype: bool
school_df.ENROLLMENT.head()
NAME
GRACE CHRISTIAN SCHOOL OF ALACHUA CO. 0.0
FAMILY LIFE ACADEMY 0.0
FOREST GROVE CHRISTIAN ACADEMY 53.0
VAISHNAVA ACADEMY FOR GIRLS 19.0
BHAKTIVEDANTA ACADEMY 0.0
Name: ENROLLMENT, dtype: float64
school_df.ENROLLMENT.index.name == school_df.index.name
True
school_df.index.values
array(['GRACE CHRISTIAN SCHOOL OF ALACHUA CO.', 'FAMILY LIFE ACADEMY',
'FOREST GROVE CHRISTIAN ACADEMY', 'VAISHNAVA ACADEMY FOR GIRLS',
'BHAKTIVEDANTA ACADEMY', 'DESTINY CHRISTIAN ACADEMY',
'INCAF MONTESSORI SCHOOL',
'GREAT AMERICAN VISIONS ENTERPRISES,INC',
'JORDAN GLEN SCHOOL INC.', 'QUEEN OF PEACE CATHOLIC ACADEMY',
'THE ROCK SCHOOL', 'CHRISTIAN LIFE ACADEMY',
'SAINT FRANCIS CATHOLIC HIGH SCHOOL',
'COUNTRYSIDE CHRISTIAN SCHOOL',
'TRILOGY SCHOOL OF LEARNING ALTERNATIVE',
'MILLHOPPER MONTESSORI SCHOOL', 'BNAI ISRAEL DAY SCHOOL',
'GAINESVILLE CONDUCTIVE EDUCATION ACADEMY', 'BRENTWOOD SCHOOL',
'CORNERSTONE ACADEMY', 'WESTWOOD HILLS CHRISTIAN SCHOOL',
'FLOWERS MONTESSORI SCHOOL', 'Z.L. SUNG S.D.A. SCHOOL',
'GAINESVILLE CONDUCTIVE EDUCATION ACADEMY',
'KIDS N ALL CHRISTIAN ACADEMY', 'OAK HALL SCHOOL',
'FREEDOM CHRISTIAN ACADEMY', 'GAINESVILLE COUNTRY DAY SCHOOL',
'FAITH TABERNACLE OF PRAISE SCHOOL OF MINISTRY',
"THE CHILDREN'S CENTER", 'CITY COLLEGE',
'SAINT PATRICK INTERPARISH SCHOOL',
'STAR CHRISTIAN CENTER AND ACADEMY',
'COMPASSIONATE OUTREACH MINISTRIES', 'WINDSOR CHRISTIAN ACADEMY',
'OAK HILL COMMUNITY PRIVATE SCHOOL SYSTEM',
'NORTH FLORIDA SDA ELEMENTARY',
'MICANOPY AREA COOPERATIVE SCHOOL, INC.',
'ARCHER COMMUNITY SCHOOL', 'SANTA FE COLLEGE - DAVIS CENTER',
'MICANOPY MIDDLE SCHOOL, INC.', 'OAK VIEW MIDDLE SCHOOL',
'NEWBERRY ELEMENTARY SCHOOL', 'NEWBERRY HIGH SCHOOL',
'HIGH SPRINGS COMMUNITY - ELEMENTARY AND MIDDLE SCHOOL',
'UNIVERSITY OF FLORIDA - COMPARITIVE MEDICINE',
'NORTH AMERICAN FAMILY INSTITUTE ALACHUA ACADEMY', 'PACE',
'HOSPITAL HOMEBOUND', 'SWEETWATER BRANCH ACADEMY',
'PROFESSIONAL ACADEMY MAGNET AT LOFTEN HIGH SCHOOL',
'SANTA FE HIGH SCHOOL', 'MEBANE MIDDLE SCHOOL',
'ALACHUA LEARNING CENTER - PUBLIC CHARTER SCHOOL',
'IRBY ELEMENTARY SCHOOL', 'ALACHUA ELEMENTARY SCHOOL',
'UNIVERSITY OF FLORIDA', 'UNIVERSITY OF FLORIDA - AGRONOMY LAB',
'SANTA FE COLLEGE - NORTHWEST CAMPUS', 'BUCHHOLZ HIGH SCHOOL',
'FINLEY ELEMENTARY SCHOOL',
'P.K. YONGE DEVELOPMENTAL RESEARCH SCHOOL',
'CHILES ELEMENTARY SCHOOL', 'FT CLARKE MIDDLE SCHOOL',
'HIDDEN OAK ELEMENTARY SCHOOL', 'TERWILLIGER ELEMENTARY SCHOOL',
'TALBOT ELEMENTARY SCHOOL', 'NORTON ELEMENTARY SCHOOL',
'LITTLEWOOD ELEMENTARY SCHOOL', 'WESTWOOD MIDDLE SCHOOL',
'HEALTHY LEARNING ACADEMY CHARTER SCHOOL',
'GLEN SPRINGS ELEMENTARY SCHOOL', 'GAINESVILLE HIGH SCHOOL',
'FOSTER ELEMENTARY SCHOOL', 'GENESIS PREPARATORY SCHOOL',
'LANIER CENTER & ANCHOR CENTER', 'UNIVERSITY OF FLORIDA',
'A. QUINN JONES CENTER', 'CHARACTER COUNTS CENTER',
'SANTA FE COMMUNITY COLLEGE', 'KANAPAHA MIDDLE SCHOOL',
'WILES ELEMENTARY SCHOOL',
'FAMILY PROGRESS CENTER - WILES HEAD START',
'EINSTEIN MONTESSORI SCHOOL',
'IDYLWILD ELEMENTARY SCHOOL AND MULTI-COUNTY MIGRANT PROGRAM',
'EXPRESSIONS LEARNING ARTS ACADEMY',
'GAINESVILLE WILDERNESS INSTITUTE',
'CARING & SHARING LEARNING SCHOOL',
'PRAIRIE VIEW ELEMENTARY SCHOOL', 'LINCOLN MIDDLE SCHOOL',
'WILLIAMS ELEMENTARY SCHOOL',
'ALACHUA COUNTY SUPERINTENDENT OFFICE - KIRBY-SMITH CENTER',
'DUVAL ELEMENTARY SCHOOL', 'BISHOP MIDDLE SCHOOL',
'METCALFE ELEMENTARY SCHOOL', 'THE ONE ROOM SCHOOL HOUSE PROJECT',
'RAWLINGS ELEMENTARY SCHOOL',
'FEARNSIDE FAMILY SERVICES CENTER AND HEAD START / PREK E I CENTER',
'HOGGETOWNE MIDDLE SCHOOL', 'HORIZON CENTER. ALTERNATIVE SCHOOL',
'LAKE FOREST ELEMENTARY SCHOOL', 'EASTSIDE HIGH SCHOOL',
'FLORIDA SIATECH AT GAINESVILLE,INC.', 'SHELL ELEMENTARY SCHOOL',
'HAWTHORNE JR/SR HIGH SCHOOL', 'WALDO COMMUNITY SCHOOL',
'UNIVERSITY OF FLORIDA',
'SANTA FE COLLEGE - CHARLES L BLOUNT DOWNTOWN CENTER',
'SANTA FE COLLEGE - KIRKPATRICK CENTER', 'ALACHUA COUNTY JAIL',
'ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIRTUAL PROGRAM/ MCKAY SCHOLARSHIP/ HOMEBOUND PROGRAM',
'OAK HALL LOWER SCHOOL'], dtype=object)
Similar to NumPy, we can use boolean indexing for Pandas as well, in which we create a condition (return Booleans in a Series) and then use it to filter the original DataFrame.
school_df[school_df["ENROLLMENT"] > 0] # without loc
OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|
NAME | ||||
FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
JORDAN GLEN SCHOOL INC. | PRIVATE | 115.0 | COMBINATION ELEMENTARY & MIDDLE | 14.5 |
QUEEN OF PEACE CATHOLIC ACADEMY | PRIVATE | 358.0 | COMBINATION ELEMENTARY & MIDDLE | 28.4 |
THE ROCK SCHOOL | PRIVATE | 207.0 | COMBINATION ELEMENTARY & SECONDARY | 17.9 |
... | ... | ... | ... | ... |
FLORIDA SIATECH AT GAINESVILLE,INC. | PUBLIC | 202.0 | SENIOR HIGH | 5.0 |
SHELL ELEMENTARY SCHOOL | PUBLIC | 193.0 | ELEMENTARY | 16.0 |
HAWTHORNE JR/SR HIGH SCHOOL | PUBLIC | 395.0 | COMBINATION JR. HIGH & SENIOR HIGH | 23.0 |
WALDO COMMUNITY SCHOOL | PUBLIC | 215.0 | ELEMENTARY | 17.0 |
OAK HALL LOWER SCHOOL | PRIVATE | 365.0 | ELEMENTARY | 16.0 |
84 rows × 4 columns
school_df.loc[school_df["ENROLLMENT"] > 0] # with loc (recommended)
OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|
NAME | ||||
FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
JORDAN GLEN SCHOOL INC. | PRIVATE | 115.0 | COMBINATION ELEMENTARY & MIDDLE | 14.5 |
QUEEN OF PEACE CATHOLIC ACADEMY | PRIVATE | 358.0 | COMBINATION ELEMENTARY & MIDDLE | 28.4 |
THE ROCK SCHOOL | PRIVATE | 207.0 | COMBINATION ELEMENTARY & SECONDARY | 17.9 |
... | ... | ... | ... | ... |
FLORIDA SIATECH AT GAINESVILLE,INC. | PUBLIC | 202.0 | SENIOR HIGH | 5.0 |
SHELL ELEMENTARY SCHOOL | PUBLIC | 193.0 | ELEMENTARY | 16.0 |
HAWTHORNE JR/SR HIGH SCHOOL | PUBLIC | 395.0 | COMBINATION JR. HIGH & SENIOR HIGH | 23.0 |
WALDO COMMUNITY SCHOOL | PUBLIC | 215.0 | ELEMENTARY | 17.0 |
OAK HALL LOWER SCHOOL | PRIVATE | 365.0 | ELEMENTARY | 16.0 |
84 rows × 4 columns
For multiple conditions, use parentheses to enclose each condition and then connect multiple conditions by either one of the following symbols.
AND operation:
&
OR operation:
|
Let’s find out schools with both positive enrollment and positive number of teachers.
school_df[(school_df["ENROLLMENT"] > 0) & (school_df["TEACHERS"] > 0)]
OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|
NAME | ||||
FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
JORDAN GLEN SCHOOL INC. | PRIVATE | 115.0 | COMBINATION ELEMENTARY & MIDDLE | 14.5 |
QUEEN OF PEACE CATHOLIC ACADEMY | PRIVATE | 358.0 | COMBINATION ELEMENTARY & MIDDLE | 28.4 |
THE ROCK SCHOOL | PRIVATE | 207.0 | COMBINATION ELEMENTARY & SECONDARY | 17.9 |
CHRISTIAN LIFE ACADEMY | PRIVATE | 52.0 | COMBINATION ELEMENTARY & SECONDARY | 5.9 |
... | ... | ... | ... | ... |
FLORIDA SIATECH AT GAINESVILLE,INC. | PUBLIC | 202.0 | SENIOR HIGH | 5.0 |
SHELL ELEMENTARY SCHOOL | PUBLIC | 193.0 | ELEMENTARY | 16.0 |
HAWTHORNE JR/SR HIGH SCHOOL | PUBLIC | 395.0 | COMBINATION JR. HIGH & SENIOR HIGH | 23.0 |
WALDO COMMUNITY SCHOOL | PUBLIC | 215.0 | ELEMENTARY | 17.0 |
OAK HALL LOWER SCHOOL | PRIVATE | 365.0 | ELEMENTARY | 16.0 |
76 rows × 4 columns
school_df[(school_df["ENROLLMENT"] > 0) | (school_df["TEACHERS"] > 0)]
OP_CLASS | ENROLLMENT | TYPE | TEACHERS | |
---|---|---|---|---|
NAME | ||||
FOREST GROVE CHRISTIAN ACADEMY | PRIVATE | 53.0 | COMBINATION ELEMENTARY & SECONDARY | 9.4 |
VAISHNAVA ACADEMY FOR GIRLS | PRIVATE | 19.0 | COMBINATION JR. HIGH & SENIOR HIGH | 0.0 |
JORDAN GLEN SCHOOL INC. | PRIVATE | 115.0 | COMBINATION ELEMENTARY & MIDDLE | 14.5 |
QUEEN OF PEACE CATHOLIC ACADEMY | PRIVATE | 358.0 | COMBINATION ELEMENTARY & MIDDLE | 28.4 |
THE ROCK SCHOOL | PRIVATE | 207.0 | COMBINATION ELEMENTARY & SECONDARY | 17.9 |
... | ... | ... | ... | ... |
SHELL ELEMENTARY SCHOOL | PUBLIC | 193.0 | ELEMENTARY | 16.0 |
HAWTHORNE JR/SR HIGH SCHOOL | PUBLIC | 395.0 | COMBINATION JR. HIGH & SENIOR HIGH | 23.0 |
WALDO COMMUNITY SCHOOL | PUBLIC | 215.0 | ELEMENTARY | 17.0 |
ALACHUA COUNTY STUDENT SERVICES/ MIGRANT/ VIRTUAL PROGRAM/ MCKAY SCHOLARSHIP/ HOMEBOUND PROGRAM | PUBLIC | 0.0 | MIGRANT EDUCATION PROGRAM | 4.0 |
OAK HALL LOWER SCHOOL | PRIVATE | 365.0 | ELEMENTARY | 16.0 |
86 rows × 4 columns
4. groupby()
function#
groupby is a useful function to aggregate data.
A groupby
operation involves some combination of
splitting the object,
applying a function, and
combining the results.
This can be used to group large amounts of data and compute operations on these groups to understand subgroups in the data.
school_df = school_df.loc[
(school_df["ENROLLMENT"] > 0) & (school_df["TEACHERS"] > 0),
['TYPE','ENROLLMENT', 'TEACHERS']
]
school_df.head()
TYPE | ENROLLMENT | TEACHERS | |
---|---|---|---|
NAME | |||
FOREST GROVE CHRISTIAN ACADEMY | COMBINATION ELEMENTARY & SECONDARY | 53.0 | 9.4 |
JORDAN GLEN SCHOOL INC. | COMBINATION ELEMENTARY & MIDDLE | 115.0 | 14.5 |
QUEEN OF PEACE CATHOLIC ACADEMY | COMBINATION ELEMENTARY & MIDDLE | 358.0 | 28.4 |
THE ROCK SCHOOL | COMBINATION ELEMENTARY & SECONDARY | 207.0 | 17.9 |
CHRISTIAN LIFE ACADEMY | COMBINATION ELEMENTARY & SECONDARY | 52.0 | 5.9 |
Without the aggregation function, the return is simply a groupby
object.
school_df.groupby("TYPE")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000248CBB228B0>
by_type = school_df.groupby("TYPE")
by_type.sum()
ENROLLMENT | TEACHERS | |
---|---|---|
TYPE | ||
COMBINATION ELEMENTARY & MIDDLE | 2664.0 | 176.0 |
COMBINATION ELEMENTARY & SECONDARY | 3094.0 | 323.3 |
COMBINATION JR. HIGH & SENIOR HIGH | 768.0 | 63.0 |
ELEMENTARY | 13231.0 | 961.3 |
MIDDLE/JR. HIGH | 4844.0 | 286.0 |
PRE-KINDERGARTEN | 163.0 | 20.0 |
PRE-KINDERGARTEN-KINDERGARTEN | 45.0 | 6.1 |
SENIOR HIGH | 7885.0 | 390.8 |
by_type.max()
ENROLLMENT | TEACHERS | |
---|---|---|
TYPE | ||
COMBINATION ELEMENTARY & MIDDLE | 938.0 | 54.0 |
COMBINATION ELEMENTARY & SECONDARY | 1139.0 | 122.0 |
COMBINATION JR. HIGH & SENIOR HIGH | 395.0 | 23.0 |
ELEMENTARY | 836.0 | 56.0 |
MIDDLE/JR. HIGH | 1047.0 | 62.0 |
PRE-KINDERGARTEN | 100.0 | 15.0 |
PRE-KINDERGARTEN-KINDERGARTEN | 45.0 | 6.1 |
SENIOR HIGH | 2221.0 | 100.0 |
school_df.groupby("TYPE").mean()
ENROLLMENT | TEACHERS | |
---|---|---|
TYPE | ||
COMBINATION ELEMENTARY & MIDDLE | 333.000000 | 22.000000 |
COMBINATION ELEMENTARY & SECONDARY | 238.000000 | 24.869231 |
COMBINATION JR. HIGH & SENIOR HIGH | 192.000000 | 15.750000 |
ELEMENTARY | 400.939394 | 29.130303 |
MIDDLE/JR. HIGH | 605.500000 | 35.750000 |
PRE-KINDERGARTEN | 81.500000 | 10.000000 |
PRE-KINDERGARTEN-KINDERGARTEN | 45.000000 | 6.100000 |
SENIOR HIGH | 1126.428571 | 55.828571 |
agg
stands for “aggregation” allows us to specify different aggregation functions
for different columns.
by_type.agg({'ENROLLMENT': np.sum, 'TEACHERS': np.mean})
ENROLLMENT | TEACHERS | |
---|---|---|
TYPE | ||
COMBINATION ELEMENTARY & MIDDLE | 2664.0 | 22.000000 |
COMBINATION ELEMENTARY & SECONDARY | 3094.0 | 24.869231 |
COMBINATION JR. HIGH & SENIOR HIGH | 768.0 | 15.750000 |
ELEMENTARY | 13231.0 | 29.130303 |
MIDDLE/JR. HIGH | 4844.0 | 35.750000 |
PRE-KINDERGARTEN | 163.0 | 10.000000 |
PRE-KINDERGARTEN-KINDERGARTEN | 45.0 | 6.100000 |
SENIOR HIGH | 7885.0 | 55.828571 |