Getting to Know the Pandas DataFrame
Contents
Getting to Know the Pandas DataFrame¶
The Pandas DataFrame is a data structure that allows us to manipulate and analyze tabular data. A “tabular” data structure can be thought of as a matrix, where rows represent observations and columns represent features that describe each observation. It’s a structure that you would find in a SQL database or Excel spreadsheet. Let’s say we have a tabular dataset about movies.
In this case, each row represents a movie and each column represents a characteristic about the movie like the genre, rating, and director. The “index” column represents a row’s position in the dataframe. By default, a Pandas DataFrame’s index starts at 0.
Importing the Pandas package¶
In order to create and use a Pandas DataFrame, we need to have the pandas
package readily available in our environment. Let’s import pandas
and give it the alias of “pd” so that we don’t have to write out “pandas” every time we call a function.
import pandas as pd
Creating a dataframe¶
There are several ways to create a Pandas DataFrame. Here, we’ll describe 2 approaches.
Converting a dictionary to a dataframe¶
You can create a dataframe from a dictionary. Each key of the dictionary represents a column name and the value of the dictionary is a list that represents values belonging to that particular column. Each element of the list represents the value of a row in the dataframe.
Let’s create a dataframe called df_movies
.
Note
df
is short for “dataframe”. It’s common for data scientists to name their dataframe “df”.
data = {
'movie': ['Batman', 'Jungle Book', 'Titanic'],
'genre': ['action', 'kids', 'romance'],
'rating': [6, 9, 8],
'director': ['Tim Burton', 'Wolfgang Reitherman', 'James Cameron']
}
df_movies = pd.DataFrame(data)
We can confirm that df_movies
is indeed a dataframe:
type(df_movies)
pandas.core.frame.DataFrame
Now let’s see how it looks 👀:
df_movies
movie | genre | rating | director | |
---|---|---|---|---|
0 | Batman | action | 6 | Tim Burton |
1 | Jungle Book | kids | 9 | Wolfgang Reitherman |
2 | Titanic | romance | 8 | James Cameron |
Loading a csv file into a dataframe¶
You can also create a dataframe by importing tabular data from a comma-separated-value (csv) file, or Excel spreadsheet. A csv file looks somthing like this:
To load this csv file into a Pandas DataFrame, we will need to use the Pandas read_csv()
function. For data in Excel format, you can use read_excel()
. We will also need to know the path where the csv file is located. This can be either on your local machine or in the cloud.
Let’s load in movies_data.csv
file as a dataframe. The original file is located on my local machine in a folder called data/
.
df_movies = pd.read_csv("data/movies_data.csv")
df_movies
movie | genre | rating | director | |
---|---|---|---|---|
0 | Batman | action | 6 | Tim Burton |
1 | Jungle Book | kids | 9 | Wolfgang Reitherman |
2 | Titanic | romance | 8 | James Cameron |
This csv-loaded dataframe is identical to the one that was generated from a dictionary.
Pandas Series¶
An important part of the Pandas DataFrame is the Pandas Series. While the DataFrame is a 2-dimensional structure, a Series is 1-dimensional. It can store any datatype (integers, strings, floats, timestamps, even lists). A Series represents a single column of a DataFrame. This is how you get an individual column (represented as a Pandas Series) from a dataframe:
dataframe['column_name']
Let’s say we want to pull the rating
column from our df_movies
dataframe.
df_movies['rating']
0 6
1 9
2 8
Name: rating, dtype: int64
The rating
column is a Pandas Series! We can confirm its datatype:
type(df_movies['rating'])
pandas.core.series.Series
There is a wide range of built-in functions that come with the Pandas Series. Some examples include:
.mean()
: if the column is numeric, it gets the average value of the column.nunique()
: counts number of unique values belonging to a particular column.fillna(value='value')
: fills missing values with ‘value’ (or any other value of your choosing)
The official documentation on Pandas Series provides a list of all available functions. We’ll explore the functions of Pandas Series in more detail in the upcmoing chapter, Data Exploration.