Files To Get

  1. Use wget or drag-and-drop the _notebooks/CSP/big-ideas/big-idea-2 folder for this and other ipynb on pandas.

  2. Use wget or drag-and-drop, in a subfolder named data in your _notebookx to grab data files.

  • data.csv
  • grade.json
  1. Use wget or drag-and-drop, then copy image file and place into subfolder named data_structures in your images folder. Grab the entire folder.

Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. “the ability to process data depends on users capabilities and their tools”
    • Combining Data. “combine county data sets”
    • Status on Data”determining the artist with the greatest attendance during a particular month”
    • Data poses challenge. “the need to clean data”, “incomplete data”
  • From Pandas Overview – When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

  • DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is similar to:
    • a spreadsheet
    • an SQL table
    • a JSON object with rows [] with nexted key-values {}

DataFrame

# uncomment the following line to install the pandas library
# !pip install pandas 

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

# Read the JSON file and convert it to a Pandas DataFrame 
# pd.read_json:  a method that reads a JSON and converts it to a DataFrame (df)
# df: a variable that holds the DataFrame
df = pd.read_json('data/grade.json')

# Print the DataFrame
print(df)

# Additional print statements to understand the DataFrame:
# print(df.info()) # prints a summary of the DataFrame, simmilar to database schema
# print(df.describe()) # prints statistics of the DataFrame
# print(df.head()) # prints the first 5 rows of the DataFrame
# print(df.tail()) # prints the last 5 rows of the DataFrame
# print(df.columns) # prints the columns of the DataFrame
# print(df.index) # prints the index of the DataFrame

# Questions:
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  
# Could you hav Garbage in, Garbage out problem if you don't clean the data?
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

#print the values in the points column with column header
print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

Student ID  GPA
       123 3.57
       246 4.00
       578 2.78
       469 3.45
       324 4.75
       313 3.33
       145 2.95
       167 3.90
       235 3.15
       nil 2.80
       469 3.45
       456 2.75

DataFrame Sort

#sort values
print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
   Student ID Year in School   GPA
11        456             10  2.75
2         578             12  2.78
9         nil              9  2.80
6         145             12  2.95
8         235      9th Grade  3.15
5         313             20  3.33
10        469             11  3.45
3         469             11  3.45
0         123             12  3.57
7         167             10  3.90
1         246             10  4.00
4         324         Junior  4.75

   Student ID Year in School   GPA
4         324         Junior  4.75
1         246             10  4.00
7         167             10  3.90
0         123             12  3.57
10        469             11  3.45
3         469             11  3.45
5         313             20  3.33
8         235      9th Grade  3.15
6         145             12  2.95
9         nil              9  2.80
2         578             12  2.78
11        456             10  2.75

DataFrame Selection or Filter

#print only values with a specific criteria 
print(df[df.GPA > 3.00])
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

DataFrame Selection Max and Min

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

Python Dictionary to Pandas DataFrame

import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
print("-------------Dictionary------------------")
print(dict)

#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")
#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3"])
print(df)
-------------Dictionary------------------
{'calories': [420, 380, 390], 'duration': [50, 40, 45]}
-------------Dict_to_DF------------------
   calories  duration
0       420        50
1       380        40
2       390        45
----------Dict_to_DF_labels--------------
      calories  duration
day1       420        50
day2       380        40
day3       390        45

Examine DataFrame Rows

print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
-------Examine Selected Rows---------
      calories  duration
day1       420        50
day3       390        45
--------Examine Single Row-----------
calories    420
duration     50
Name: day1, dtype: int64

Pandas DataFrame Information

#print info about the data set
print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, day1 to day3
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calories  3 non-null      int64
 1   duration  3 non-null      int64
dtypes: int64(2)
memory usage: 180.0+ bytes
None

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('data/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))

--Duration Top 10---------
     Duration  Pulse  Maxpulse  Calories
69        300    108       143    1500.2
79        270    100       131    1729.0
60        210    108       160    1376.0
109       210    137       184    1860.4
90        180    101       127     600.1
65        180     90       130     800.4
106       180     90       120     800.3
61        160    110       137    1034.4
62        160    109       135     853.0
70        150     97       129    1115.0
--Duration Bottom 10------
     Duration  Pulse  Maxpulse  Calories
64         20    110       130     131.4
68         20    106       136     110.4
89         20     83       107      50.3
100        20     95       112      77.7
95         20    151       168     229.4
94         20    150       171     127.4
139        20    141       162     222.4
135        20    136       156     189.0
93         15     80       100      50.5
112        15    124       139     124.2

APIs are a Source for Panda Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://devops.nighthawkcodingsociety.com/api/users/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json)
 
    # Check if 'active_classes' column exists in the DataFrame
    if 'active_classes' in df.columns:
        # Split the 'active_classes' strings into lists of class names and expand the lists into separate rows
        classes_series = df['active_classes'].str.split(',').explode()

        # Count the unique class names and print the counts
        print(classes_series.str.strip().value_counts())
    else:
        print("Column 'active_classes' does not exist in the DataFrame")

fetch()
active_classes
APCSP    160
APCSA     62
CSSE      60
          20
Name: count, dtype: int64
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://devops.nighthawkcodingsociety.com/api/users/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json)
    
    # Check if 'active_classes' column exists in the DataFrame
    if 'active_classes' in df.columns:
        # Split the 'active_classes' strings into lists of class names
        df['active_classes'] = df['active_classes'].str.split(',')

        # Get a list of unique class names by using a set comprehension
        unique_classes = pd.Series([unique_class.strip() for class_list in df['active_classes'] for unique_class in class_list]).unique()
                                    
        # Iterate over the each class name
        for current_class in unique_classes:
            # Filter the DataFrame for students in the current class using a lambda function
            class_df = df[df['active_classes'].apply(lambda classes: current_class in classes)]

            # Select the desired data frame column
            students = class_df[['active_classes','id', 'first_name', 'last_name']]

            # Print the list of students in the current class
            print(students.sort_values(by='last_name').head()) # avoids jupyter notebook truncation, remove .head() to print all students
            print()
    else:
        print("Column 'active_classes' does not exist in the DataFrame")

fetch()
    active_classes   id first_name last_name
60         [APCSA]   86     Aditya          
33         [APCSA]   55       Finn          
30         [APCSA]   52    [Edwin]   Abraham
247        [APCSA]  316   [Vishnu]   Aravind
117        [APCSA]  161  [Anthony]  Bazhenov

    active_classes   id first_name last_name
298        [APCSP]  369       Test          
94         [APCSP]  134      Cindy          
296        [APCSP]  367   testUser          
12         [APCSP]   29     Saaras          
150        [APCSP]  199      Gavin          

    active_classes   id           first_name last_name
263             []  334                 Pele          
254             []  325                 Pele          
161             []  212              Varnika          
193             []  246       [Alyssa-Allen]    Abrams
258             []  329  [Alexander, Graham]      Bell

    active_classes   id first_name   last_name
286         [CSSE]  357     Amelia            
205         [CSSE]  260    Gabriel            
265         [CSSE]  336     Yoseph            
211         [CSSE]  267      Timur            
91          [CSSE]  130   [Maryam]  Abdul-Aziz

Hacks

Early Seed award. Don’t tell anyone. Show to Teacher.

  • Add this Blog to you own Blogging site.
  • Have all lecture files saved to your files directory before Tech Talk starts.
  • Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.

The next 6 weeks, the Teachers want you to improve your understanding of data structures and data science. Your intention is to find some things to differentiate your individual College Board project, particularly if your project looks like all other projects.

  • Look at this blog and others on data structures for todays date.
  • Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your CPT/PBL project would be Amazing.
  • Build frontend to backend to filter or use your data set in your CPT/PBL.
  • When choosing a data set, think about the following…
    • Does it have a good sample size?
    • Is there bias in the data?
    • Does the data set need to be cleaned?
    • What is the purpose of the data set?

Early Seed Award: Notes and Observations on Each Code Cell


1. Importing Pandas Library

  • Code:
    import pandas as pd
    
  • Observation:
    • The import pandas as pd line imports the pandas library, a powerful tool in Python for data manipulation and analysis. By importing it with the alias pd, it makes it easier to reference and use throughout the code.
  • Note:
    • Pandas is essential for handling tabular data, such as CSV or Excel files, and helps to simplify data processing.

2. Reading a JSON File into a DataFrame

  • Code:
    df = pd.read_json('data/grade.json')
    print(df)
    
  • Observation:
    • This code reads a JSON file (grade.json) and converts it into a DataFrame. A DataFrame is essentially a table of data, making it easier to work with and analyze.
  • Note:
    • JSON is a common data format, and pandas simplifies converting it into a tabular form for analysis. This is useful when working with APIs or data sources that return JSON.

3. Exploring DataFrame Information

  • Code:
    print(df.info())  # Prints a summary of the DataFrame
    print(df.describe())  # Prints statistics of the DataFrame
    print(df.head())  # Prints the first 5 rows of the DataFrame
    print(df.tail())  # Prints the last 5 rows of the DataFrame
    print(df.columns)  # Prints the columns of the DataFrame
    print(df.index)  # Prints the index of the DataFrame
    
  • Observation:
    • These commands provide an overview of the dataset:
      • info() shows a concise summary of the DataFrame, including column types and missing values.
      • describe() provides statistical details (mean, standard deviation, etc.) for numeric columns.
      • head() and tail() display the first and last 5 rows to get an idea of the data.
      • columns and index show the column names and index labels of the DataFrame.
  • Note:
    • These functions are helpful for exploratory data analysis (EDA), enabling you to quickly understand the structure of your data and identify any missing or problematic values.

4. Extracting Specific Columns

  • Code:
    print(df[['GPA']])
    print(df[['Student ID','GPA']].to_string(index=False))
    
  • Observation:
    • The first line extracts only the GPA column.
    • The second line extracts both the Student ID and GPA columns and removes the index from the output using .to_string(index=False).
  • Note:
    • Extracting specific columns is a common task when you want to focus on certain features of your dataset. In this case, it’s useful when analyzing specific attributes like student performance (GPA).

5. Sorting Values

  • Code:
    print(df.sort_values(by=['GPA']))
    print(df.sort_values(by=['GPA'], ascending=False))
    
  • Observation:
    • This code sorts the DataFrame by the GPA column in both ascending and descending order.
  • Note:
    • Sorting is a crucial operation in data analysis, helping you organize data and make comparisons (e.g., finding students with the highest or lowest GPA).

6. Filtering Data Based on Conditions

  • Code:
    print(df[df.GPA > 3.00])
    
  • Observation:
    • This filters the DataFrame to display students with a GPA greater than 3.00.
  • Note:
    • Filtering data allows you to extract relevant subsets based on specific criteria, which is common when analyzing students’ performance or other attributes.

7. Finding Maximum and Minimum Values

  • Code:
    print(df[df.GPA == df.GPA.max()])
    print(df[df.GPA == df.GPA.min()])
    
  • Observation:
    • The first line finds the student with the highest GPA (df.GPA.max()), and the second line finds the student with the lowest GPA (df.GPA.min()).
  • Note:
    • This technique helps identify extreme values, such as top performers or those needing extra support. It’s useful in performance analysis and decision-making.

8. Creating a DataFrame from a Dictionary

  • Code:
    dict = {
      "calories": [420, 380, 390],
      "duration": [50, 40, 45]
    }
    df = pd.DataFrame(dict)
    print(df)
    
  • Observation:
    • This code creates a DataFrame from a Python dictionary, where keys represent column names and values represent the data in those columns.
  • Note:
    • Pandas can easily convert data from different formats, like dictionaries or lists, into DataFrames, making it highly versatile for working with raw data.

9. Using Custom Index Labels for Rows

  • Code:
    df = pd.DataFrame(dict, index=["day1", "day2", "day3"])
    print(df)
    
  • Observation:
    • This adds custom index labels (day1, day2, day3) to the rows of the DataFrame.
  • Note:
    • Custom indices improve readability and make data more meaningful, especially in time-series data or data where rows represent distinct categories (e.g., days, events).

10. Selecting Specific Rows Using .loc[]

  • Code:
    print(df.loc[["day1", "day3"]])
    print(df.loc["day1"])
    
  • Observation:
    • The first line selects multiple rows based on index labels (day1 and day3), while the second line selects a single row by its index label (day1).
  • Note:
    • The .loc[] indexer allows for flexible row selection based on labels. It’s particularly useful for working with labeled data where specific entries are required.

11. Reading CSV Files and Sorting Data

  • Code:
    df = pd.read_csv('data/data.csv').sort_values(by=['Duration'], ascending=False)
    print("--Duration Top 10---------")
    print(df.head(10))
    print("--Duration Bottom 10------")
    print(df.tail(10))
    
  • Observation:
    • This reads data from a CSV file (data.csv) and sorts it by the Duration column. It then prints the top 10 and bottom 10 rows based on that sorting.
  • Note:
    • Reading and sorting CSV data is a common use case in data analysis, as CSV files are a popular data storage format. Sorting by a specific column helps to organize data based on significance (e.g., event duration).

12. Fetching Data from an API

  • Code:
    def fetch():
        url = "https://devops.nighthawkcodingsociety.com/api/users/"
        fetch = requests.get(url)
        json = fetch.json()
        df = pd.DataFrame(json)
        # Further processing...
    fetch()
    
  • Observation:
    • This function fetches data from an external API and converts the returned JSON data into a DataFrame.
  • Note:
    • APIs are an excellent source for real-time data. Pandas makes it easy to fetch data from APIs and process it for analysis. This approach is commonly used for web scraping or working with live data sources.