{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 4: Pandas and visulization I." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will start using Pandas. Pandas is the standard way of working with tabular data. A great resource to learn more about Pandas is: http://pandas.pydata.org/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation and downloads" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# downloading the example dataset that we'll use for this class. \n", "# Pandas is already installed in colab by default as its very frequently used.\n", "!pip -q install palmerpenguins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is a library for working with tabular data. It was orignally based on the R data.frame library. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loading and Displaying the Dataset\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "from palmerpenguins import load_penguins\n", "df = load_penguins()\n", "\n", "print(\"Type of the dataset:\", type(df))\n", "print(\"\\nDataFrame using head command:\")\n", "display(df.head())\n", "print(\"\\nDataFrame using tail command:\")\n", "display(df.tail())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How can we visulize the distribution of this single variable better?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Basic Data Visualization\n", "plt.figure(figsize=(10, 6))\n", "sns.histplot(df['bill_length_mm'], kde=True, bins=20)\n", "plt.title('Distribution of Bill Length')\n", "plt.xlabel('Bill Length (mm)')\n", "plt.ylabel('Count')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helpful resources for data visulization in python: \n", "- https://python-graph-gallery.com\n", "- https://github.com/cxli233/FriendsDontLetFriends" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subset/query a dataframe by row" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Data Selection\n", "# Select specific rows and columns using iloc (integer-location based indexing)\n", "print(\"Using iloc:\")\n", "print(df.iloc[2, 2]) # Value at 3rd row, 3rd column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Select rows using loc (label-based indexing)\n", "print(\"Using loc:\")\n", "print(\"\\nGetting 4th row:\")\n", "display(df.loc[3]) #4th row\n", "print(\"\\nGetting first 4 rows:\")\n", "display(df.loc[0:3]) # Rows 0, 1, 2, and 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note: key difference between loc and iloc is endpoint Inclusion\n", "display(df.iloc[0:3]) #Rows 0, 1, and 2 (excluding 3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Query the DataFrame by condition\n", "print(\"Using query:\")\n", "display(df.query(\"year > 2008\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Query dataframe based on two conditions: Method 1\n", "print(\"Method 1 Dataframe:\")\n", "yr_vec = df['year'] > 2008\n", "island_vec = df['island'] == 'Biscoe'\n", "\n", "combine_vec = np.logical_and(yr_vec,island_vec)\n", "display(df[combine_vec])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Query dataframe based on two conditions: Method 2\n", "print(\"Method 2 Dataframe:\")\n", "display(df[(yr_vec) & (island_vec)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris = sns.load_dataset(\"iris\")\n", "display(iris.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excercises 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 1. Get the 5th row, 2nd column element using the iloc function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 2. Get the last 5 rows using the loc function and then iloc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 3. Visualize the distribution of the sepal length of the iris dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 4. Display only the rows that have a petal length greater than or equal to 1.5 and petal width greater than or equal to 0.5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 5. Display only the rows that meet either of two conditions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Histograms allowed us to visualize the distribution of a singel feature, what about the relation between features? How do we decide our filtering condition with multiple features?\n", "\n", "- Helpful plotting resources: \n", " - https://seaborn.pydata.org/tutorial/function_overview.html\n", " - https://www.geeksforgeeks.org/plotting-graph-using-seaborn-python/\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create jointplots to determine relationships between specific variables\n", "sns.jointplot(data=df, x=\"bill_length_mm\", y=\"bill_depth_mm\")\n", "plt.show()\n", "\n", "sns.jointplot(data=df, x=\"body_mass_g\", y=\"flipper_length_mm\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What if we wanted to see how species impacts these variables?\n", "\n", "# Create jointplots to determine relationships between specific variables\n", "sns.jointplot(data=df, x=\"bill_length_mm\", y=\"bill_depth_mm\",hue='species')\n", "plt.show()\n", "\n", "sns.jointplot(data=df, x=\"body_mass_g\", y=\"flipper_length_mm\",hue='species')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Different ways of visualizing the same data can lead to different interpretations\n", "fig , ax = plt.subplots(figsize=(15,12), ncols=2,nrows=2)\n", "sns.swarmplot(data=df,x='species',y='body_mass_g',ax=ax[0,0],hue='species')\n", "sns.violinplot(data=df,x='species',y='body_mass_g',ax=ax[0,1],hue='species')\n", "sns.boxplot(data=df,x='species',y='body_mass_g',ax=ax[1,0],hue='species')\n", "sns.barplot(data=df,x='species',y='body_mass_g',ax=ax[1,1],hue='species')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Customizing Plots" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Change the palette to a default Blues palette\n", "sns.boxplot(data=df,x='species',y='body_mass_g',hue='species',palette=\"Blues\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Change the palette to a custom palette\n", "# Common colors: b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white\n", "my_pal = {\"Adelie\": \"g\", \"Gentoo\": \"b\", \"Chinstrap\":\"m\"}\n", "sns.boxplot(data=df,x='species',y='body_mass_g',hue='species',palette=my_pal)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Change axes style for each plot\n", "print('With dark grid:')\n", "with sns.axes_style(\"darkgrid\"):\n", " sns.boxplot(data=df,x='species',y='body_mass_g',hue='species')\n", " plt.show()\n", "\n", "print('\\nWith white grid:')\n", "with sns.axes_style(\"whitegrid\"):\n", " sns.boxplot(data=df,x='species',y='body_mass_g',hue='species')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Change axes style for all subsequent plots\n", "sns.set_style(\"whitegrid\") \n", "\n", "sns.swarmplot(data=df,x='species',y='body_mass_g',hue='species')\n", "plt.show()\n", "sns.violinplot(data=df,x='species',y='body_mass_g',hue='species')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reset sns to default settings\n", "sns.reset_defaults()\n", "\n", "sns.swarmplot(data=df,x='species',y='body_mass_g',hue='species')\n", "plt.show()\n", "sns.violinplot(data=df,x='species',y='body_mass_g',hue='species')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excercise 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 1. Plot relation petal length and petal width" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 2. Use two different plots to visualize the data for petal length for only the setosa and virginica species" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### After editing our dataframes how can we save them in a way that allows us to read them back in without having to redo our analysis?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Saving Pandas Dataframes\n", "\n", "# To CSV\n", "df.to_csv('penguins.csv', index=False)\n", "\n", "# To Excel\n", "df.to_excel('penguins.xlsx', index=False, sheet_name='Sheet1')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reading Pandas Dataframes from CSV\n", "\n", "df_csv = pd.read_csv('penguins.csv')\n", "print(\"DataFrame read from CSV:\")\n", "display(df_csv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reading Pandas Dataframes from Excel\n", "\n", "df_excel = pd.read_excel('penguins.xlsx', sheet_name='Sheet1')\n", "print(\"DataFrame read from Excel:\")\n", "display(df_excel)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excercise 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 1. Write the iris dataset to a csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 2. Write the iris dataset to a excel" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 3. Read in the iris dataset from the csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 4. Read in the iris dataset from the excel\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 5. Write to a file w semicolon as a seperator" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 6. Read in the semicolon file\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }