Tutorial: Palmer Penguins Dataset

In this tutorial, we will explore the Palmer Penguins dataset using the tidyversetopandas package. This package simplifies data manipulation in Python by bringing R’s tidyverse-like functionality to pandas. We’ll demonstrate how to use its key functions: select, mutate, filter, and arrange.

Loading the Palmer Penguins Dataset

The Palmer Penguins dataset includes various measurements from three penguin species. It’s ideal for demonstrating data manipulation techniques.

First, let’s load the dataset into a pandas DataFrame:

# Load Penguins dataset
import pandas as pd
from tidyversetopandas import tidyversetopandas as ttp

penguins = pd.read_csv("penguins.csv")
penguins.head()

	rowid	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	1	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	2	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	3	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	4	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	5	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

Removing NAs with `filter`

Let’s start by finding out how many na value in our columns.

penguins.isna().sum()

rowid                 0
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

There are 11 case of na in sex, lets try to remove them with filter function in ttp and also build in function of isnull from pandas.

newPenguins = ttp.filter(penguins, "~ sex.isnull()")
newPenguins.isna().sum()

rowid                0
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
year                 0
dtype: int64

Looks great! We successfully removed all na from sex.

Filtering species and size with `filter`

Next, we want lets limit our study to penguins with species of “Adelie” and body_mass_g bigger than 3000 gram. And function filter is perfect fot this job. We started with 333 rows and we should see a decrease in number of rows after filter

print(newPenguins.shape)
newPenguins = ttp.filter(
    newPenguins, "species == 'Adelie' & body_mass_g > 3000")
print(newPenguins.shape)

(333, 9)
(138, 9)

We do see a reduce of rows to 138 which is a great sign. Lets check the dataframe to make sure only “Adelie” penguins avaliable and size larger than 3000 grams.

print(newPenguins.species.unique())
print(newPenguins.body_mass_g.min())

['Adelie']
3050.0

We did it again! There is only “Adelie” penguins in species and the smallest penguins has size of 3050 grams.

Creating new columns with `mutate`

Now, let’s create a new column called body_mass_kg that converts body_mass_g to kilograms. We can do this with the mutate function.

penguins = ttp.mutate(penguins, "body_mass_kg = body_mass_g / 1000")

penguins.head()

	rowid	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year	body_mass_kg
0	1	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007	3.75
1	2	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007	3.80
2	3	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007	3.25
3	4	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007	NaN
4	5	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007	3.45

Now we can see in the rightmost column that we have a new column called body_mass_kg.

Converting all lengths to cm

Now, let us convert the bill_length_mm, bill_depth_mm, and flipper_length_mm columns to centimeters. We can do this with the mutate function as well.

penguins = ttp.mutate(penguins, "bill_length_cm = bill_length_mm / 10")
penguins = ttp.mutate(penguins, "bill_depth_cm = bill_depth_mm / 10")
penguins = ttp.mutate(penguins, "flipper_length_cm = flipper_length_mm / 10")

penguins.head()

	rowid	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year	body_mass_kg	bill_length_cm	bill_depth_cm	flipper_length_cm
0	1	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007	3.75	3.91	1.87	18.1
1	2	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007	3.80	3.95	1.74	18.6
2	3	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007	3.25	4.03	1.80	19.5
3	4	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007	NaN	NaN	NaN	NaN
4	5	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007	3.45	3.67	1.93	19.3

Now we see that we have all these new columns in our dataframe, but some are redundant. Let’s use the select function to remove the old columns.

Selecting Columns with `select`

The select function in the tidyversetopandas package is a powerful tool designed to bring the simplicity and intuitiveness of R’s tidyverse to Python’s pandas library. This function specifically mirrors the functionality of dplyr’s select in R, allowing users to easily choose specific columns from a DataFrame for focused analysis.

In practical terms, select lets users streamline their datasets by including only the columns that are relevant to their current analysis, thereby simplifying the data manipulation process. By using select, Python users can enjoy a more R-like syntax for data manipulation, making the transition between R and Python smoother and more intuitive.

Example 1: Selecting One Column

Let’s try selecting one column:

# Selecting species
penguins_subset = ttp.select(penguins, "species")
penguins_subset.head()

	species
0	Adelie
1	Adelie
2	Adelie
3	Adelie
4	Adelie

The output is a subset of the original penguins DataFrame, containing only the columns species. This subset displays the species and island of each penguin, along with the measurement of their flipper length in millimeters.

Example 2: Selecting Multiple Columns for Comparative Analysis

For a more detailed comparative analysis, let’s select columns that would provide insight into the physical characteristics of the penguins. We’ll choose species, bill length, bill depth, and body mass.

# Selecting species, bill_length_mm, bill_depth_mm, and body_mass_g columns
penguins_physical = ttp.select(
    penguins, "species", "bill_length_mm", "bill_depth_mm", "body_mass_g"
)
penguins_physical.head()

	species	bill_length_mm	bill_depth_mm	body_mass_g
0	Adelie	39.1	18.7	3750.0
1	Adelie	39.5	17.4	3800.0
2	Adelie	40.3	18.0	3250.0
3	Adelie	NaN	NaN	NaN
4	Adelie	36.7	19.3	3450.0

Here, the output is a DataFrame that includes a different set of columns: species, bill_length_mm, bill_depth_mm, and body_mass_g. This subset is intended for a more detailed comparative analysis, focusing on the physical characteristics of the penguins, such as bill length, bill depth, and body mass.

Example 3: Only selecting the columns we want to keep after `mutate`

Earlier in the mutate section, we created a few new columns. Let’s use select to remove the old columns.

penguins = ttp.select(
    penguins,
    "species",
    "bill_length_cm",
    "bill_depth_cm",
    "flipper_length_cm",
    "body_mass_kg",
)

penguins.head()

	species	bill_length_cm	bill_depth_cm	flipper_length_cm	body_mass_kg
0	Adelie	3.91	1.87	18.1	3.75
1	Adelie	3.95	1.74	18.6	3.80
2	Adelie	4.03	1.80	19.5	3.25
3	Adelie	NaN	NaN	NaN	NaN
4	Adelie	3.67	1.93	19.3	3.45

Now our pengiuns DataFrame has only the columns we want to keep.

Sorting the data using `arrange`

In Python, the sort function is sort_values(), which is under the package pandas. In the tidyverse package of R, however, the function is named arrange(). It takes in the whole dataset as the first parameter, and then the names of columns that the data will be sorted on. To make the Python version of sort function more friendly to people who have got used to work with R and tidyverse, we wrapped the pandas function following the structure of the tidyverse function arrange().

To have a better understanding of how it works, we can try sorting the palmerpenguin dataset to find the few penguins with the largest weights.

penguins_sorted = ttp.arrange(penguins, False, "body_mass_kg")
penguins_sorted.head()

	species	bill_length_cm	bill_depth_cm	flipper_length_cm	body_mass_kg
169	Gentoo	4.92	1.52	22.1	6.30
185	Gentoo	5.96	1.70	23.0	6.05
269	Gentoo	4.88	1.62	22.2	6.00
229	Gentoo	5.11	1.63	22.0	6.00
263	Gentoo	4.98	1.59	22.9	5.95

Above, we can see the top few penguins with the largest weight are all Gentoo penguins from Biscoe island.

We can also sort the data with multiple columns using the arrange() function. Suppose we want to find penguins with shortest bill length and bill depth:

penguins_small_bill = ttp.arrange(
    penguins, True, "bill_length_cm", "bill_depth_cm")
penguins_small_bill.head(1)

	species	bill_length_cm	bill_depth_cm	flipper_length_cm	body_mass_kg
142	Adelie	3.21	1.55	18.8	3.05

Putting it all together with `pipe`

We can call all 4 functions in one line using the pipe function. This function is similar to the %>% operator in R’s tidyverse package.

penguins = pd.read_csv("penguins.csv")

penguins_subset2 = (
    penguins.pipe(ttp.filter, "~ sex.isnull()")
    .pipe(ttp.mutate, "body_mass_kg = body_mass_g / 1000")
    .pipe(ttp.select, "species", "island", "body_mass_kg")
    .pipe(ttp.arrange, False, "body_mass_kg")
)

penguins_subset2.head()

	species	island	body_mass_kg
169	Gentoo	Biscoe	6.30
185	Gentoo	Biscoe	6.05
269	Gentoo	Biscoe	6.00
229	Gentoo	Biscoe	6.00
263	Gentoo	Biscoe	5.95

Tutorial: Palmer Penguins Dataset

Loading the Palmer Penguins Dataset

Removing NAs with filter

Filtering species and size with filter

Creating new columns with mutate