{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 4: Pandas and visulization I." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will start using Pandas. Pandas is the standard way of working with tabular data. A great resource to learn more about Pandas is: http://pandas.pydata.org/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation and downloads" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# downloading the example dataset that we'll use for this class. \n", "# Pandas is already installed in colab by default as its very frequently used.\n", "!pip -q install palmerpenguins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is a library for working with tabular data. It was orignally based on the R data.frame library. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loading and Displaying the Dataset\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "from palmerpenguins import load_penguins\n", "df = load_penguins()\n", "\n", "print(\"Type of the dataset:\", type(df))\n", "print(\"\\nDataFrame using head command:\")\n", "display(df.head())\n", "print(\"\\nDataFrame using tail command:\")\n", "display(df.tail())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Key Concepts:\n", "\n", "- A DataFrame is a 2-dimensional data structure in pandas. It's like a spreadsheet or a table with rows and columns.\n", "- Generally, \n", " - Columns represent features (variables we measure or observe)\n", " - Rows represent individual observations (in this case, each penguin)\n", "\n", "- In this dataset:\n", " - Each row represents a single penguin\n", " - Each column represents a feature (e.g., species, island, bill length)\n", "\n", "- Common functions for initial data exploration:\n", " - df.head() Shows the first 5 rows by default\n", " - df.info() Provides a concise summary of the DataFrame\n", " - df.describe() Generates descriptive statistics for numerical columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Whole DataFrame Exploration\n", "print(\"Summary statistics:\")\n", "display(df.describe())\n", "\n", "print(\"\\nData types of columns:\")\n", "display(df.dtypes)\n", "\n", "print(\"\\nColumn names:\")\n", "print(df.columns)\n", "\n", "print(\"\\nDataFrame info:\")\n", "df.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Column Specific Operations\n", "\n", "# Using value counts to get the number of penguins for each species\n", "species_counts = df['species'].value_counts()\n", "print(\"Penguins per species:\")\n", "print(species_counts)\n", "\n", "# Get the mean bill length\n", "bill_length = df['bill_length_mm'].mean() # Can get the mode and sum\n", "print(f'\\nAverage bill length: {bill_length}')\n", "\n", "# Get the mean bill length per species\n", "species_bill_length = df.groupby('species')['bill_length_mm'].mean()\n", "print(\"\\nAverage bill length per species:\")\n", "print(species_bill_length)\n", "\n", "# Get the unique species\n", "unique_species = df['species'].unique()\n", "print(f\"\\nUnique species: {unique_species}\")\n", "\n", "# Get number of null values\n", "boolean_vec = df.isnull()\n", "print(f\"\\nBoolean_vec: {boolean_vec}\")\n", "sum = boolean_vec.sum()\n", "print(f\"\\nBoolean_vec sum: {sum}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Data Cleaning\n", "\n", "# Examine original dataframe\n", "print(\"Orifinal dataframe:\")\n", "display(df)\n", "\n", "# Drop all rows with a null value\n", "print(\"\\nDataframe with non-null rows:\")\n", "display(df.dropna())\n", "\n", "# Drop all columns with a null value\n", "print(\"\\nDataframe with non-null columns:\")\n", "display(df.dropna(axis=1))\n", "\n", "# Replace all null values\n", "print(\"\\nDataframe with nulls replaced:\")\n", "display(df.fillna(1))\n", "\n", "# Sort df based on column of interest\n", "print(\"\\nDataframe sorted based on column:\")\n", "df_sorted = df.sort_values('bill_length_mm')\n", "display(df_sorted)\n", "\n", "# Resetting the index to the new order after sorting\n", "df_sorted_reset = df_sorted.reset_index(drop=True)\n", "\n", "print(\"\\nDataFrame with reset index:\")\n", "display(df_sorted_reset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reset Index Function Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inplace vs. Not Inplace\n", "\n", "tester = df_sorted.copy()\n", "# Resetting the index to the new order after sorting not inplace\n", "print(\"DataFrame with reset index not inplace:\")\n", "display(tester.reset_index(drop=True))\n", "\n", "print(\"\\nOriginal Dataframe:\")\n", "display(tester)\n", "\n", "# Resetting the index to the new order after sorting inplace\n", "print(\"DataFrame with reset index inplace:\")\n", "tester.reset_index(drop=True,inplace=True)\n", "display(tester)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Another common example dataset\n", "iris = sns.load_dataset(\"iris\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Excercises 1\n", "\n", "# 1. Display the first and then the last 15 rows in a dataframe \n", "\n", "# 2. Use any function to explore the new dataset\n", "\n", "# 3. Display only the sepal_length column in the format of a pandas dataframe\n", "\n", "# 4. Display only sepal_length column in the format of a pandas series and check the type\n", "\n", "# 5. Get the names of all the species and check the type\n", "\n", "# 6. Get the total number of null values in the entire dataFrame\n", "\n", "# 7. Replace all null values w/ the mean value of that column (note: operation only works w/ numeric columns)\n", "\n", "# 8. Sort the dataframe in decending order based on a petal_length and reset the index without writing over the original dataframe\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Several different ways to subset/query a dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### By Column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Selecting Data from DataFrame\n", "\n", "# Understanding Pandas Series\n", "# 1. A Series is like a single column of data\n", "# 2. It has an index (labels for each item) but no column names\n", "# 3. A Series can have a name, which labels the whole Series\n", "\n", "print(\"Single column (as Series):\")\n", "series = df['bill_length_mm']\n", "display(series)\n", "\n", "print(\"\\nSeries index (labels for each item):\")\n", "print(series.index)\n", "\n", "print(\"\\nSeries name:\")\n", "print(series.name)\n", "\n", "# We can change the Series name\n", "series.name = \"Bill Length\"\n", "print(\"\\nUpdated Series name:\")\n", "print(series.name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Selecting Data from DataFrame Pt. 2\n", "\n", "# When we use double brackets, we get a DataFrame instead\n", "print(\"Single column (as DataFrame):\")\n", "single_column_df = df[['bill_length_mm']]\n", "display(single_column_df)\n", "print(\"Type:\", type(single_column_df))\n", "\n", "# Note: A Series is like a single labeled column, while a DataFrame \n", "# is like a table with potentially many columns. The Series index is \n", "# like row labels, and its name can become a column name in a DataFrame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Key Concept: Series vs DataFrame in Column Selection\n", "* Using double brackets [[]] selects columns and returns a DataFrame\n", "* Using single brackets [] for a single column returns a Series\n", "* A DataFrame is a 2D structure with both row and column labels\n", "* A Series is a 1D structure with only row labels (index)\n", "\n", "* Note: Some operations work on Series, others on DataFrames, so it's helpful to know which one you're using." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How can we visulize the distribution of this single variable better?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Basic Data Visualization\n", "plt.figure(figsize=(10, 6))\n", "sns.histplot(df['bill_length_mm'], kde=True, bins=20)\n", "plt.title('Distribution of Bill Length')\n", "plt.xlabel('Bill Length (mm)')\n", "plt.ylabel('Count')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helpful resources for data visulization in python: \n", "- https://python-graph-gallery.com\n", "- https://github.com/cxli233/FriendsDontLetFriends" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### By Row" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Data Selection\n", "# Select specific rows and columns using iloc (integer-location based indexing)\n", "print(\"Using iloc:\")\n", "print(df.iloc[2, 2]) # Value at 3rd row, 3rd column\n", "\n", "# Select rows using loc (label-based indexing)\n", "print(\"\\nUsing loc:\")\n", "display(df.loc[3]) # 4th row\n", "display(df.loc[0:3]) # First 4 rows\n", "\n", "# Query the DataFrame by condition\n", "print(\"\\nUsing query:\")\n", "display(df.query(\"year > 2008\"))\n", "\n", "# Query dataframe based on two conditions\n", "\n", "# Method 1\n", "print(\"\\nMethod 1 Dataframe:\")\n", "yr_vec = df['year'] > 2008\n", "island_vec = df['island'] == 'Biscoe'\n", "\n", "combine_vec = np.logical_and(yr_vec,island_vec)\n", "display(df[combine_vec])\n", "\n", "#Method 2\n", "print(\"\\nMethod 2 Dataframe:\")\n", "display(df[(yr_vec) & (island_vec)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Excercises 2\n", "\n", "display(iris.head())\n", "\n", "# 1. Get out the sepal_length column as a pandas series and rename the series\n", "\n", "# 2. Get the 5th row, 2nd column element using the iloc function\n", "\n", "# 3. Get the last 5 rows using the loc function and then iloc\n", "\n", "# 4. Visualize the distribution of the sepal length of the iris dataset\n", "\n", "# 5. Display only the rows that have a petal length greater than or equal to 1.5 and petal width greater than or equal to 0.5\n", "\n", "# 6. Display only the rows that meet either of two conditions\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Histograms allowed us to visualize the distribution of a singel feature, what about the relation between features? How do we decide our filtering condition with multiple features?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create jointplots to determine relationships between specific variables\n", "sns.jointplot(data=df, x=\"bill_length_mm\", y=\"bill_depth_mm\")\n", "plt.show()\n", "\n", "sns.jointplot(data=df, x=\"body_mass_g\", y=\"flipper_length_mm\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create pairplots to determine relationships between all numeric variables\n", "sns.pairplot(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Different ways of visualizing the same data can lead to different interpretations\n", "import matplotlib.pyplot as plt\n", "fig , ax = plt.subplots(figsize=(15,12), ncols=2,nrows=2)\n", "sns.swarmplot(data=df,x='species',y='body_mass_g',ax=ax[0,0],hue='species')\n", "sns.violinplot(data=df,x='species',y='body_mass_g',ax=ax[0,1])\n", "sns.boxplot(data=df,x='species',y='body_mass_g',ax=ax[1,0])\n", "sns.barplot(data=df,x='species',y='body_mass_g',ax=ax[1,1])\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Excercise 3\n", "\n", "# 1. Plot relation petal length and petal width\n", "\n", "# 2. Plot relationship between all numeric values\n", "\n", "# 3. Use two different plots to visualize the data for petal length for only the setosa and virginica species" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### After editing our dataframes how can we save them in a way that allows us to read them back in without having to redo our analysis?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Saving Pandas Dataframes\n", "\n", "df.to_csv('penguins.csv', index=False)\n", "\n", "df.to_excel('penguins.xlsx', index=False, sheet_name='Sheet1')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reading Pandas Dataframes\n", "\n", "df_csv = pd.read_csv('penguins.csv')\n", "print(\"DataFrame read from CSV:\")\n", "display(df_csv)\n", "\n", "df_excel = pd.read_excel('penguins.xlsx', sheet_name='Sheet1')\n", "print(\"DataFrame read from Excel:\")\n", "display(df_excel)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Excercise 4\n", "\n", "# 1. Write the iris dataset to a csv\n", "\n", "# 2. Write the iris dataset to a excel\n", "\n", "# 3. Read in the iris dataset from the csv\n", "\n", "# 4. Read in the iris dataset from the excel\n", "\n", "# 5. Write to a file w semicolon as a seperator\n", "\n", "# 6. Read in the semicolon file\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.-1" } }, "nbformat": 4, "nbformat_minor": 4 }