{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Notebook 4: Pandas and visulization I."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we will start using Pandas. Pandas is the standard way of working with tabular data. A great resource to learn more about Pandas is: http://pandas.pydata.org/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation and downloads"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# downloading the example dataset that we'll use for this class. \n",
"# Pandas is already installed in colab by default as its very frequently used.\n",
"!pip -q install palmerpenguins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas is a library for working with tabular data. It was orignally based on the R data.frame library. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Loading and Displaying the Dataset\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"from palmerpenguins import load_penguins\n",
"df = load_penguins()\n",
"\n",
"print(\"Type of the dataset:\", type(df))\n",
"print(\"\\nDataFrame using head command:\")\n",
"display(df.head())\n",
"print(\"\\nDataFrame using tail command:\")\n",
"display(df.tail())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How can we visulize the distribution of this single variable better?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Basic Data Visualization\n",
"plt.figure(figsize=(10, 6))\n",
"sns.histplot(df['bill_length_mm'], kde=True, bins=20)\n",
"plt.title('Distribution of Bill Length')\n",
"plt.xlabel('Bill Length (mm)')\n",
"plt.ylabel('Count')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Helpful resources for data visulization in python: \n",
"- https://python-graph-gallery.com\n",
"- https://github.com/cxli233/FriendsDontLetFriends"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subset/query a dataframe by row"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Data Selection\n",
"# Select specific rows and columns using iloc (integer-location based indexing)\n",
"print(\"Using iloc:\")\n",
"print(df.iloc[2, 2]) # Value at 3rd row, 3rd column"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Select rows using loc (label-based indexing)\n",
"print(\"Using loc:\")\n",
"print(\"\\nGetting 4th row:\")\n",
"display(df.loc[3]) #4th row\n",
"print(\"\\nGetting first 4 rows:\")\n",
"display(df.loc[0:3]) # Rows 0, 1, 2, and 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note: key difference between loc and iloc is endpoint Inclusion\n",
"display(df.iloc[0:3]) #Rows 0, 1, and 2 (excluding 3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Query the DataFrame by condition\n",
"print(\"Using query:\")\n",
"display(df.query(\"year > 2008\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Query dataframe based on two conditions: Method 1\n",
"print(\"Method 1 Dataframe:\")\n",
"yr_vec = df['year'] > 2008\n",
"island_vec = df['island'] == 'Biscoe'\n",
"\n",
"combine_vec = np.logical_and(yr_vec,island_vec)\n",
"display(df[combine_vec])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Query dataframe based on two conditions: Method 2\n",
"print(\"Method 2 Dataframe:\")\n",
"display(df[(yr_vec) & (island_vec)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"iris = sns.load_dataset(\"iris\")\n",
"display(iris.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Excercises 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 1. Get the 5th row, 2nd column element using the iloc function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 2. Get the last 5 rows using the loc function and then iloc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 3. Visualize the distribution of the sepal length of the iris dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 4. Display only the rows that have a petal length greater than or equal to 1.5 and petal width greater than or equal to 0.5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 5. Display only the rows that meet either of two conditions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Histograms allowed us to visualize the distribution of a singel feature, what about the relation between features? How do we decide our filtering condition with multiple features?\n",
"\n",
"- Helpful plotting resources: \n",
" - https://seaborn.pydata.org/tutorial/function_overview.html\n",
" - https://www.geeksforgeeks.org/plotting-graph-using-seaborn-python/\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create jointplots to determine relationships between specific variables\n",
"sns.jointplot(data=df, x=\"bill_length_mm\", y=\"bill_depth_mm\")\n",
"plt.show()\n",
"\n",
"sns.jointplot(data=df, x=\"body_mass_g\", y=\"flipper_length_mm\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# What if we wanted to see how species impacts these variables?\n",
"\n",
"# Create jointplots to determine relationships between specific variables\n",
"sns.jointplot(data=df, x=\"bill_length_mm\", y=\"bill_depth_mm\",hue='species')\n",
"plt.show()\n",
"\n",
"sns.jointplot(data=df, x=\"body_mass_g\", y=\"flipper_length_mm\",hue='species')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Different ways of visualizing the same data can lead to different interpretations\n",
"fig , ax = plt.subplots(figsize=(15,12), ncols=2,nrows=2)\n",
"sns.swarmplot(data=df,x='species',y='body_mass_g',ax=ax[0,0],hue='species')\n",
"sns.violinplot(data=df,x='species',y='body_mass_g',ax=ax[0,1],hue='species')\n",
"sns.boxplot(data=df,x='species',y='body_mass_g',ax=ax[1,0],hue='species')\n",
"sns.barplot(data=df,x='species',y='body_mass_g',ax=ax[1,1],hue='species')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Customizing Plots"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Change the palette to a default Blues palette\n",
"sns.boxplot(data=df,x='species',y='body_mass_g',hue='species',palette=\"Blues\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Change the palette to a custom palette\n",
"# Common colors: b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white\n",
"my_pal = {\"Adelie\": \"g\", \"Gentoo\": \"b\", \"Chinstrap\":\"m\"}\n",
"sns.boxplot(data=df,x='species',y='body_mass_g',hue='species',palette=my_pal)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Change axes style for each plot\n",
"print('With dark grid:')\n",
"with sns.axes_style(\"darkgrid\"):\n",
" sns.boxplot(data=df,x='species',y='body_mass_g',hue='species')\n",
" plt.show()\n",
"\n",
"print('\\nWith white grid:')\n",
"with sns.axes_style(\"whitegrid\"):\n",
" sns.boxplot(data=df,x='species',y='body_mass_g',hue='species')\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Change axes style for all subsequent plots\n",
"sns.set_style(\"whitegrid\") \n",
"\n",
"sns.swarmplot(data=df,x='species',y='body_mass_g',hue='species')\n",
"plt.show()\n",
"sns.violinplot(data=df,x='species',y='body_mass_g',hue='species')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Reset sns to default settings\n",
"sns.reset_defaults()\n",
"\n",
"sns.swarmplot(data=df,x='species',y='body_mass_g',hue='species')\n",
"plt.show()\n",
"sns.violinplot(data=df,x='species',y='body_mass_g',hue='species')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Excercise 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 1. Plot relation petal length and petal width"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 2. Use two different plots to visualize the data for petal length for only the setosa and virginica species"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### After editing our dataframes how can we save them in a way that allows us to read them back in without having to redo our analysis?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Saving Pandas Dataframes\n",
"\n",
"# To CSV\n",
"df.to_csv('penguins.csv', index=False)\n",
"\n",
"# To Excel\n",
"df.to_excel('penguins.xlsx', index=False, sheet_name='Sheet1')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Reading Pandas Dataframes from CSV\n",
"\n",
"df_csv = pd.read_csv('penguins.csv')\n",
"print(\"DataFrame read from CSV:\")\n",
"display(df_csv)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Reading Pandas Dataframes from Excel\n",
"\n",
"df_excel = pd.read_excel('penguins.xlsx', sheet_name='Sheet1')\n",
"print(\"DataFrame read from Excel:\")\n",
"display(df_excel)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Excercise 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 1. Write the iris dataset to a csv"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 2. Write the iris dataset to a excel"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 3. Read in the iris dataset from the csv"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 4. Read in the iris dataset from the excel\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 5. Write to a file w semicolon as a seperator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 6. Read in the semicolon file\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}