Guide to Pandas Dataframes

Pandas is an open-source Python library providing high-performance, easy-to-use data structures and data analysis tools.

In this guide, we will learn about one of the most important Pandas Data Structures, the DataFrame, and how to use it.

What is a DataFrame?

DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel. The data is represented in rows and columns.

A pandas DataFrame can be created using the following constructor −

pandas.DataFrame( data, index, columns, dtype, copy)
  • data: ndarray, series, map, lists, dict, constants or another DataFrame
  • index: row labels
  • columns: column labels
  • dtype: data type of each column
  • copy: used for copying data (default = False)

Example: Creating a DataFrame from Dict

wnba_stats = { 
'first_name': ['Candace', 'Sue', 'Diana', 'Brittney', 'Arike', 'Courtney', "A'ja"],
'last_name': ['Parker', 'Bird', 'Taurasi', 'Griner', 'Ogunbowale', 'Vandersloot', "Wilson"],
'pts': [14.7, 9.8, 18.7, 17.7, 22.8, 13.6, 20.5],
'reb': [9.7, 1.7, 4.2, 7.5, 2.8, 3.5, 8.5],
'ast': [4.6, 5.2, 4.5, 3.0, 3.4, 10.0, 2.0]
}
df = pandas.DataFrame(wnba_stats)
df
2020 Player Stats per game from wnba.com

Attributes and Methods

Viewing/Inspecting Data

Shape and dtypes are attributes you can call on a DataFrame

  • shape: returns the dimensions for the DataFrame (rows, columns)
  • dtypes: returns the data types of each column
df.shape
df (2020 player stats) has 7 rows and 5 columns
df.dtypes
datatypes of each column in df

Why are first_name and last_name objects? The string lengths in these columns are not fixed so Pandas stores them in an object ndarray. Read more here.

To view a small sample of a DataFrame object, use the head(), tail(), and sample() methods.

  • head: returns the first n rows (default is 5)
  • tail: returns the last n rows (default is 5)
  • sample: returns n random rows (default is 1)
df.head(3)
first 3 rows of df
  • info() is used to get a concise summary of the DataFrame
df.info()
  • describe() is used to view some basic statistical details of a DataFrame
df.describe()

Selection

You can select a specific subset of a DataFrame using [], loc, and iloc

df['first_name'] #returns the first_name column as a Series

loc vs iloc

  • loc: by label — loc[‘row_name’, ‘column_name’]
  • iloc: by index — iloc[row_index, column_index]

In this example, the labels (0, 1, 2, 3, 4, 5, 6) are the same as the indices

If you want to select a range of rows or columns, separate them with :

#locates rows with labels 0 - 3 and columns with label 'first_name'
df.loc[0:3, 'first_name']
df.iloc[4, :] #locates data with index 4 and all columns

Merging DataFrames

merge() will join two DataFrames together similar to how SQL tables are joined. You can choose how they are merged (left, right, inner=default), what column they are merged on, and much more. Check the documentation

Let’s make another DataFrame (from lists this time) with two columns: first_name and team

team_stats = [['Parker', 'LA'], ['Bird', 'SEA'], ['Taurasi', 'PHO'], ['Griner', 'PHO'], ['Ogunbowale', 'DAL'], ['Vandersloot', 'CHI'], ['Wilson', 'LV']]team_df = pd.DataFrame(team_stats, columns=['last_name', 'team'])
team_df
team data

Now we have two DataFrames: df and team_df and they have a last_name column in common. Let’s merge the tables on the last_name column so we have all our data in one DataFrame

merged_df = df.merge(team_df, on='last_name')
merged_df
merged data

DataFrames from CSV

One crucial feature of Pandas is its ability to write and read Excel, CSV, and many other types of files. We will focus on reading CSV files, but you can look through the other reader and writer functions in the Pandas documentation

read_csv() reads data from the CSV files and creates a DataFrame object.

p.s. you should give your DataFrames better names, but this is just easier to keep track of for now

df_one = pd.read_csv('WNBA-Stats.csv')
df_one.head()
WNBA Player stats Season 2016–2017 from Kaggle

We don’t need all these columns, so let’s only keep the important ones.

It’s good practice to make your changes and save them to a new DataFrame instead of editing the original

columns_we_want = ['Name', 'Team', 'Pos', 'Age', 'GP', 'PTS', 'REB', 'AST']
df_two = df_one[columns_we_want]
df_two.sample(5) #returns 5 random rows

Sorting Data

Let’s sort our DataFrame by points (PTS) and set ignore_index=True so our data is reindexed in sorted order

df_three = df_two.sort_values(by='PTS', ascending=False, ignore_index=True)
df_three.head()
data sorted from most to least points scored

Adding new columns

You can add columns to a DataFrame the same way you select them and Pandas will recognize whether it exists already (it’s that cool)

Let’s add a points per game column. To calculate PPG, we will divide each players total points for the season by the number of games they played (GP)

df_three['PPG'] = round(df_two['PTS'] / df_two['GP'], 2)
new ppg column rounded to 2 decimals

Filtering Data

Let’s just look at stats for Seattle Storm players with more than 100 rebounds

sea = (df_three['Team'] == 'SEA')
reb = (df_three['REB'] > 100)
df_three[sea & reb]

Groupby

A groupby operation groups data based on particular indicators. It returns a DataFrameGroupBy object that contains information about the different groups. Here are the steps to using groupby() effectively

  1. Split a DataFrame into groups.
  2. Apply some operations to each of those smaller DataFrames.
  3. Combine the results.

Let’s find the average age players by team

# pass groupby the name of the column you want to group on
grouped = df_three.groupby('Team')
# Age is the column we want to perform our mean() function on
grouped['Age'].mean()

The End

Resources

Sources