Intro to Pandas

Setting up your coding environment

Open a folder (that you have created, with the data we are going to be using) in VS code

Import the package

import pandas

You can provide a shorter alias, which makes it easier to type

import pandas as pd

Load Data

Let’s use this kaggle dataset on house prices as an example. I downloaded the data and saved it in a folder called data.

data_frame = pd.read_csv("data/US houuse price of 10 states.csv")

Inspect the data

data_frame.head()
data_frame.info()
data_frame.describe()
data_frame.shape
data_frame.columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12075 entries, 0 to 12074
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        12075 non-null  object
 1   house_size  11107 non-null  object
 2   bed         11410 non-null  object
 3   bath        11410 non-null  object
 4   price       11009 non-null  object
 5   broker      9569 non-null   object
 6   street      12075 non-null  object
 7   city        12075 non-null  object
 8   state_name  12075 non-null  object
 9   zip_code    12075 non-null  int64 
dtypes: int64(1), object(9)
memory usage: 943.5+ KB
Index(['date', 'house_size', 'bed', 'bath', 'price', 'broker', 'street',
       'city', 'state_name', 'zip_code'],
      dtype='object')

Inspect variables

data_frame["bed"].head()
data_frame[["bed", "bath"]].head()
bed bath
0 4bd 4bd
1 2bd 2bd
2 3bd 3bd
3 Studio Studio
4 3bd 3bd

Inspect variables

data_frame["bed"].count() # number of non-null values
data_frame["bed"].nunique() # count of unique values
data_frame["bed"].value_counts() # count for each unique value
bed
3bd       5172
4bd       2998
2bd       1797
5bd        762
1bd        290
Studio     153
6bd        150
7bd         31
8bd         31
9bd          9
12bd         4
10bd         3
14bd         2
21bd         2
15bd         2
11bd         2
16bd         1
13bd         1
Name: count, dtype: int64

pandas.Series.str

A pandas.Series is one column in our data frame

Read the documentation on pandas.Series.str – how can we create a numeric variable based on the "bed" column?

Filtering the data

data_frame[data_frame["bed"] == "3bd"]
date house_size bed bath price broker street city state_name zip_code
2 AUG 29, 2024 1,926 sqft (on 0.45 acres) 3bd 3bd $375,000 Coldwell Banker Hartung 6761 Landover Cir Tallahassee Florida 32317
4 AUG 29, 2024 1,205 sqft 3bd 3bd $233,900 D R Horton Realty of NW Florida, LLC 6274 June Bug Dr Milton Florida 32583
14 AUG 28, 2024 1,820 sqft 3bd 3bd $330,500 EXP Realty, LLC 8564 Westview Ln Pensacola Florida 32514
15 AUG 28, 2024 1,370 sqft 3bd 3bd $173,000 Better Homes And Gardens Real Estate Main Stre... 6905 Woodley Dr Pensacola Florida 32503
17 AUG 28, 2024 2,681 sqft (on 1.82 acres) 3bd 3bd $525,000 American Valor Realty LLC 8021 Quiet Dr Pensacola Florida 32526
... ... ... ... ... ... ... ... ... ... ...
12059 AUG 30, 2024 2,610 sqft 3bd 3bd $1,143,909 BALBOA REAL ESTATE, INC. 50525 Spyglass Hill Dr La Quinta CA 92253
12063 AUG 30, 2024 1,983 sqft (on 1 acre) 3bd 3bd $1,385,000 Compass 10241 McBroom St Sunland CA 91040
12065 AUG 30, 2024 1,757 sqft 3bd 3bd $568,000 Berkshire Hathaway Home Serv. 26848 Hanford St Menifee CA 92584
12069 AUG 30, 2024 2,000 sqft 3bd 3bd $1,880,000 Real Estate Legends USA 16 Riveroaks Irvine CA 92602
12074 AUG 30, 2024 1,615 sqft 3bd 3bd $508,000 Starlitloan&Realty 1668 Ravenswood Rd Beaumont CA 92223

5172 rows × 10 columns

Handling Missing Data – drop missing data

Documentation on .dropna()

data_frame.dropna()
date house_size bed bath price broker street city state_name zip_code
2 AUG 29, 2024 1,926 sqft (on 0.45 acres) 3bd 3bd $375,000 Coldwell Banker Hartung 6761 Landover Cir Tallahassee Florida 32317
3 AUG 29, 2024 1,132 sqft Studio Studio $190,000 EXP Realty, LLC 1701 S Fairfield Dr Perdido Key Florida 32507
4 AUG 29, 2024 1,205 sqft 3bd 3bd $233,900 D R Horton Realty of NW Florida, LLC 6274 June Bug Dr Milton Florida 32583
5 AUG 29, 2024 3,044 sqft (on 0.34 acres) 4bd 4bd $416,402 ADAMS HOME REALTY, INC 6528 Benelli Dr Milton Florida 32570
13 AUG 28, 2024 1,254 sqft 2bd 2bd $250,000 JANET COULTER REALTY 520 Richard Jackson Blvd #2810 Panama City Beach Florida 32407
... ... ... ... ... ... ... ... ... ... ...
12067 AUG 30, 2024 2,352 sqft 4bd 4bd $795,000 Realty Masters 12549 Navel Ct Riverside CA 92503
12069 AUG 30, 2024 2,000 sqft 3bd 3bd $1,880,000 Real Estate Legends USA 16 Riveroaks Irvine CA 92602
12070 AUG 30, 2024 3,835 sqft 5bd 5bd $1,935,000 Realty ONE Group Pacific 1154 Via Vera Cruz San Marcos CA 92078
12072 AUG 30, 2024 2,616 sqft 4bd 4bd $685,000 Anderson Real Estate Group 5509 W Modoc Avenue Visalia CA 93291
12074 AUG 30, 2024 1,615 sqft 3bd 3bd $508,000 Starlitloan&Realty 1668 Ravenswood Rd Beaumont CA 92223

8068 rows × 10 columns

Handling Missing Data – fill missing data

Documentation on .fillna()

data_frame.fillna(0)
data_frame.ffill()
data_frame.bfill()
data_frame.fillna(data_frame.mean())
data_frame["bed_numeric"].fillna(data_frame["bed_numeric"].mean())