Tutorial: Palmer Penguins Dataset
In this tutorial, we will explore the Palmer Penguins dataset using the tidyversetopandas package. This package simplifies data manipulation in Python by bringing R’s tidyverse-like functionality to pandas. We’ll demonstrate how to use its key functions: select, mutate, filter, and arrange.
Loading the Palmer Penguins Dataset
The Palmer Penguins dataset includes various measurements from three penguin species. It’s ideal for demonstrating data manipulation techniques.
First, let’s load the dataset into a pandas DataFrame:
# Load Penguins dataset
import pandas as pd
from tidyversetopandas import tidyversetopandas as ttp
penguins = pd.read_csv("penguins.csv")
penguins.head()
| rowid | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
| 1 | 2 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
| 2 | 3 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
| 3 | 4 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
| 4 | 5 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
Removing NAs with filter
Let’s start by finding out how many na value in our columns.
penguins.isna().sum()
rowid 0
species 0
island 0
bill_length_mm 2
bill_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 11
year 0
dtype: int64
There are 11 case of na in sex, lets try to remove them with filter function in ttp and also build in function of isnull from pandas.
newPenguins = ttp.filter(penguins, "~ sex.isnull()")
newPenguins.isna().sum()
rowid 0
species 0
island 0
bill_length_mm 0
bill_depth_mm 0
flipper_length_mm 0
body_mass_g 0
sex 0
year 0
dtype: int64
Looks great! We successfully removed all na from sex.
Filtering species and size with filter
Next, we want lets limit our study to penguins with species of “Adelie” and body_mass_g bigger than 3000 gram. And function filter is perfect fot this job. We started with 333 rows and we should see a decrease in number of rows after filter
print(newPenguins.shape)
newPenguins = ttp.filter(
newPenguins, "species == 'Adelie' & body_mass_g > 3000")
print(newPenguins.shape)
(333, 9)
(138, 9)
We do see a reduce of rows to 138 which is a great sign. Lets check the dataframe to make sure only “Adelie” penguins avaliable and size larger than 3000 grams.
print(newPenguins.species.unique())
print(newPenguins.body_mass_g.min())
['Adelie']
3050.0
We did it again! There is only “Adelie” penguins in species and the smallest penguins has size of 3050 grams.
Creating new columns with mutate
Now, let’s create a new column called body_mass_kg that converts body_mass_g to kilograms. We can do this with the mutate function.
penguins = ttp.mutate(penguins, "body_mass_kg = body_mass_g / 1000")
penguins.head()
| rowid | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | body_mass_kg | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 | 3.75 |
| 1 | 2 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 | 3.80 |
| 2 | 3 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 | 3.25 |
| 3 | 4 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 | NaN |
| 4 | 5 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 | 3.45 |
Now we can see in the rightmost column that we have a new column called body_mass_kg.
Converting all lengths to cm
Now, let us convert the bill_length_mm, bill_depth_mm, and flipper_length_mm columns to centimeters. We can do this with the mutate function as well.
penguins = ttp.mutate(penguins, "bill_length_cm = bill_length_mm / 10")
penguins = ttp.mutate(penguins, "bill_depth_cm = bill_depth_mm / 10")
penguins = ttp.mutate(penguins, "flipper_length_cm = flipper_length_mm / 10")
penguins.head()
| rowid | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | body_mass_kg | bill_length_cm | bill_depth_cm | flipper_length_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 | 3.75 | 3.91 | 1.87 | 18.1 |
| 1 | 2 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 | 3.80 | 3.95 | 1.74 | 18.6 |
| 2 | 3 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 | 3.25 | 4.03 | 1.80 | 19.5 |
| 3 | 4 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 | NaN | NaN | NaN | NaN |
| 4 | 5 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 | 3.45 | 3.67 | 1.93 | 19.3 |
Now we see that we have all these new columns in our dataframe, but some are redundant. Let’s use the select function to remove the old columns.
Selecting Columns with select
The select function in the tidyversetopandas package is a powerful tool designed to bring the simplicity and intuitiveness of R’s tidyverse to Python’s pandas library. This function specifically mirrors the functionality of dplyr’s select in R, allowing users to easily choose specific columns from a DataFrame for focused analysis.
In practical terms, select lets users streamline their datasets by including only the columns that are relevant to their current analysis, thereby simplifying the data manipulation process. By using select, Python users can enjoy a more R-like syntax for data manipulation, making the transition between R and Python smoother and more intuitive.
Example 1: Selecting One Column
Let’s try selecting one column:
# Selecting species
penguins_subset = ttp.select(penguins, "species")
penguins_subset.head()
| species | |
|---|---|
| 0 | Adelie |
| 1 | Adelie |
| 2 | Adelie |
| 3 | Adelie |
| 4 | Adelie |
The output is a subset of the original penguins DataFrame, containing only the columns species.
This subset displays the species and island of each penguin, along with the measurement of their flipper length in millimeters.
Example 2: Selecting Multiple Columns for Comparative Analysis
For a more detailed comparative analysis, let’s select columns that would provide insight into the physical characteristics of the penguins. We’ll choose species, bill length, bill depth, and body mass.
# Selecting species, bill_length_mm, bill_depth_mm, and body_mass_g columns
penguins_physical = ttp.select(
penguins, "species", "bill_length_mm", "bill_depth_mm", "body_mass_g"
)
penguins_physical.head()
| species | bill_length_mm | bill_depth_mm | body_mass_g | |
|---|---|---|---|---|
| 0 | Adelie | 39.1 | 18.7 | 3750.0 |
| 1 | Adelie | 39.5 | 17.4 | 3800.0 |
| 2 | Adelie | 40.3 | 18.0 | 3250.0 |
| 3 | Adelie | NaN | NaN | NaN |
| 4 | Adelie | 36.7 | 19.3 | 3450.0 |
Here, the output is a DataFrame that includes a different set of columns: species, bill_length_mm, bill_depth_mm, and body_mass_g.
This subset is intended for a more detailed comparative analysis, focusing on the physical characteristics of the penguins, such as bill length, bill depth, and body mass.
Example 3: Only selecting the columns we want to keep after mutate
Earlier in the mutate section, we created a few new columns. Let’s use select to remove the old columns.
penguins = ttp.select(
penguins,
"species",
"bill_length_cm",
"bill_depth_cm",
"flipper_length_cm",
"body_mass_kg",
)
penguins.head()
| species | bill_length_cm | bill_depth_cm | flipper_length_cm | body_mass_kg | |
|---|---|---|---|---|---|
| 0 | Adelie | 3.91 | 1.87 | 18.1 | 3.75 |
| 1 | Adelie | 3.95 | 1.74 | 18.6 | 3.80 |
| 2 | Adelie | 4.03 | 1.80 | 19.5 | 3.25 |
| 3 | Adelie | NaN | NaN | NaN | NaN |
| 4 | Adelie | 3.67 | 1.93 | 19.3 | 3.45 |
Now our pengiuns DataFrame has only the columns we want to keep.
Sorting the data using arrange
In Python, the sort function is sort_values(), which is under the package pandas. In the tidyverse package of R, however, the function is named arrange(). It takes in the whole dataset as the first parameter, and then the names of columns that the data will be sorted on. To make the Python version of sort function more friendly to people who have got used to work with R and tidyverse, we wrapped the pandas function following the structure of the tidyverse function arrange().
To have a better understanding of how it works, we can try sorting the palmerpenguin dataset to find the few penguins with the largest weights.
penguins_sorted = ttp.arrange(penguins, False, "body_mass_kg")
penguins_sorted.head()
| species | bill_length_cm | bill_depth_cm | flipper_length_cm | body_mass_kg | |
|---|---|---|---|---|---|
| 169 | Gentoo | 4.92 | 1.52 | 22.1 | 6.30 |
| 185 | Gentoo | 5.96 | 1.70 | 23.0 | 6.05 |
| 269 | Gentoo | 4.88 | 1.62 | 22.2 | 6.00 |
| 229 | Gentoo | 5.11 | 1.63 | 22.0 | 6.00 |
| 263 | Gentoo | 4.98 | 1.59 | 22.9 | 5.95 |
Above, we can see the top few penguins with the largest weight are all Gentoo penguins from Biscoe island.
We can also sort the data with multiple columns using the arrange() function. Suppose we want to find penguins with shortest bill length and bill depth:
penguins_small_bill = ttp.arrange(
penguins, True, "bill_length_cm", "bill_depth_cm")
penguins_small_bill.head(1)
| species | bill_length_cm | bill_depth_cm | flipper_length_cm | body_mass_kg | |
|---|---|---|---|---|---|
| 142 | Adelie | 3.21 | 1.55 | 18.8 | 3.05 |
Putting it all together with pipe
We can call all 4 functions in one line using the pipe function. This function is similar to the %>% operator in R’s tidyverse package.
penguins = pd.read_csv("penguins.csv")
penguins_subset2 = (
penguins.pipe(ttp.filter, "~ sex.isnull()")
.pipe(ttp.mutate, "body_mass_kg = body_mass_g / 1000")
.pipe(ttp.select, "species", "island", "body_mass_kg")
.pipe(ttp.arrange, False, "body_mass_kg")
)
penguins_subset2.head()
| species | island | body_mass_kg | |
|---|---|---|---|
| 169 | Gentoo | Biscoe | 6.30 |
| 185 | Gentoo | Biscoe | 6.05 |
| 269 | Gentoo | Biscoe | 6.00 |
| 229 | Gentoo | Biscoe | 6.00 |
| 263 | Gentoo | Biscoe | 5.95 |