4  Exploratory Data Analysis in Python

5 Overview

There’s no data science without data. And the “learning” in ML is based on the data we observe. The field of statistics is all about the mathematics of data. Over the past century, the field engaged in a rigorous investigation of the origin, nature, transformation and prediction of data from a myriad of angles. That legacy feeds into the computer age and forms the theoretical foundation upon which data science and machine learning have pushed the boundaries of knowledge. What statistics could not do effectively is extend itself beyond tabular data comprised of quantitative and qualitative variables - to new data types. With help from computer science, analysis of new kinds of data became practical.

Find a video and a webpage tutorial on one of the topics below. Basics: read in data and one basic visualization. Email me what you want to do from below

6 Tabular Data

6.1 Tabular Data in Python

Let’s take a look at how the above is performed in Python. We install and load the reticulate package to run Python in RStudio

Website
https://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/index.html
Youtube
https://youtu.be/5_QXMwezPJE

The following line imports the pandaslibrary

import pandas as pd

the following code was used to load the data from an online repository
if the csv file was in saved to the computer then we use the following code
pd.read_csv(‘directory’)


orders=pd.read_table("http://bit.ly/chiporders")

The above data gives the orders for a chipotle store
we can ge the head of the data just to have a general idea of how the data works

orders.head()
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

another data is the following movie data


movie = pd.read_table("http://bit.ly/movieusers")
movie.head()
1|24|M|technician|85711
0 2|53|F|other|94043
1 3|23|M|writer|32067
2 4|24|M|technician|43537
3 5|33|F|other|15213
4 6|42|M|executive|98101

This data has an issue the colomns are all squished together by a different delimeter
also the colos have no header and that needs to be specified

movie=pd.read_table("http://bit.ly/movieusers",sep="|",header=None)
movie.head()
0 1 2 3 4
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

we need to add some headers to the data so that it can be more presentable

user_col = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

now I an use the user colomns in the names of the imported data as a header

movie = pd.read_table("http://bit.ly/movieusers", sep = "|", header = None, names = user_col)
movie.head()
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

we need to use the seaborn library to do some visualization

import seaborn as sn
import matplotlib.pyplot as plt

The following plot uses the seaborn library and it plots the age and occupations


sn.scatterplot(x="age",y="occupation",data=movie,hue='gender')
plt.show()

The following plot uses the seaborn library and it plots a histogram of all the ages for people who watched the movie. Sn easily includes the kernel plot as opposed matplotlib


sn.histplot(movie['age'],kde=True,bins=15)
plt.show()

The following plot uses the barplot to plot the genders of people that watched the movie and see if it’s corelated to their ages.

sn.barplot(x = 'gender', y = 'age', data = movie)
plt.show()

from the following figure we can see that each data has a specifice plot and not all types of plots work for the data.

sn.boxplot(x = 'age', y = 'zip_code', data = movie, hue = 'gender', palette = 'YlGnBu')
plt.show()

7 Text Data

7.1 Text Data in Python

8 Image Data

8.1 Image Data in Python

Website  https://www.tutorialspoint.com/how-to-read-a-jpeg-or-png-image-in-pytorch

For importing image data in python we use the following libraries and the image in this case is imported from online sources to get the same results every time the program is run.

import urllib.request
from PIL import Image
  
urllib.request.urlretrieve(
  "https://upload.wikimedia.org/wikipedia/commons/7/72/2012._Vasarely_hair_cut%2C_acr%C3%ADlico_sobre_lienzo.jpg",
   "gfg.png")
img = Image.open("gfg.png")
img.show()

The above image is an art piece that was collected from the wikimedia library and the following code is used on the above image:

import torch

Also import torchvision:

import torchvision
from torchvision.io import read_image
import torchvision.transforms as T

If the photo that you want to use is in the same directory as the r-file, then use the following code otherwise use the code that follows this one:

#replace the text with image name
#image = read_image('image_name.png')

The following code converts the image to tensor using the transform method:

# Runing this code twice might throw an error because, it cannot transform a
# tensor that is already transformed.
img = T.ToTensor()(img)

This line gives most of the image tensor data:

print("Image data:", img)
Image data: tensor([[[0.4902, 0.8706, 0.8353,  ..., 0.7020, 0.6980, 0.7294],
         [0.4471, 0.8392, 0.8196,  ..., 0.6745, 0.6627, 0.6941],
         [0.4471, 0.8431, 0.8157,  ..., 0.6745, 0.6549, 0.6863],
         ...,
         [0.2353, 0.6157, 0.7098,  ..., 0.6392, 0.6314, 0.6627],
         [0.3059, 0.6706, 0.7216,  ..., 0.6431, 0.6392, 0.6706],
         [0.3882, 0.7373, 0.7608,  ..., 0.6902, 0.6902, 0.7255]],

        [[0.3020, 0.3843, 0.4941,  ..., 0.1608, 0.1804, 0.2196],
         [0.2471, 0.3490, 0.4667,  ..., 0.1333, 0.1451, 0.1843],
         [0.2314, 0.3294, 0.4471,  ..., 0.1333, 0.1373, 0.1765],
         ...,
         [0.1373, 0.2000, 0.2392,  ..., 0.1137, 0.1216, 0.1725],
         [0.1843, 0.2392, 0.2471,  ..., 0.1176, 0.1294, 0.1804],
         [0.2510, 0.2941, 0.2667,  ..., 0.1647, 0.1804, 0.2353]],

        [[0.2941, 0.2745, 0.3882,  ..., 0.1608, 0.1725, 0.2196],
         [0.2353, 0.2392, 0.3647,  ..., 0.1333, 0.1373, 0.1843],
         [0.2196, 0.2196, 0.3451,  ..., 0.1333, 0.1294, 0.1765],
         ...,
         [0.1098, 0.1373, 0.1333,  ..., 0.1098, 0.1137, 0.1569],
         [0.1647, 0.1804, 0.1412,  ..., 0.1137, 0.1216, 0.1647],
         [0.2353, 0.2392, 0.1686,  ..., 0.1608, 0.1725, 0.2196]]])

9 Sound Data

9.1 Sound Data in Python

# import the torch and torchaudio dataset packages.
import torch
import torchaudio

# access the dataset in torchaudio package using
# datasets followed by dataset name.
# './' makes sure that the dataset is stored
# in a root directory.
# download = True ensures that the
# data gets downloaded
yesno_data = torchaudio.datasets.YESNO('./',
                                    download=True)

# loading the first 5 data from yesno_data
for i in range(5):
    waveform, sample_rate, labels = yesno_data[i]
    print("Waveform: {}\nSample rate: {}\nLabels: {}".format(
        waveform, sample_rate, labels))
Waveform: tensor([[ 3.0518e-05,  6.1035e-05,  3.0518e-05,  ..., -1.8616e-03,
         -2.2583e-03, -1.3733e-03]])
Sample rate: 8000
Labels: [0, 0, 0, 0, 1, 1, 1, 1]
Waveform: tensor([[ 3.0518e-05,  6.1035e-05,  3.0518e-05,  ..., -2.7466e-03,
         -3.6926e-03, -1.6174e-03]])
Sample rate: 8000
Labels: [0, 0, 0, 1, 0, 0, 0, 1]
Waveform: tensor([[-3.0518e-05,  3.0518e-05, -3.0518e-05,  ..., -2.1973e-03,
          2.1362e-04, -9.4604e-04]])
Sample rate: 8000
Labels: [0, 0, 0, 1, 0, 1, 1, 0]
Waveform: tensor([[ 3.0518e-05,  6.1035e-05,  3.0518e-05,  ..., -1.8311e-04,
          4.2725e-04,  6.7139e-04]])
Sample rate: 8000
Labels: [0, 0, 1, 0, 0, 0, 1, 0]
Waveform: tensor([[ 3.0518e-05,  6.1035e-05,  3.0518e-05,  ..., -1.0071e-03,
         -1.2207e-03, -8.5449e-04]])
Sample rate: 8000
Labels: [0, 0, 1, 0, 0, 1, 1, 0]

Eventually we’ll want to delve into speech recognition using Wav2Vec2 and PyTorch.

10 Geospatial Data

10.1 Geospatial Data in Python

11 Database Query with SQL

12 Interactive Data Visualization

12.1 Enhanced Graphics with Plotly

13 Building Apps for Data Exploration

13.1 Dash Apps