Data Science – Tables

The datascience package was written for use in Berkeley’s Data8 course, and it contains useful functionality for investigating and graphically display of data. The most important functionality in the package is a Table, and its structure is used to represent columns of data.

We will learn the Table structure and its functionalities. Along with this, we will also do hands-on on the IRIS dataset so that the concept would be cleared in a much better way.

IRIS Dataset

It is a simple dataset containing three flowers of the Iris species family, and these are Iris setosaIris virginica, and Iris versicolor. Four features are selected and measured in centimeters per flower, and the four features are sepal length, sepal width, petal length, and petal width. The selection of features was completely subjective to the domain expert. There are fifty samples per Iris species family, and the combination of these fifty samples per species forms 150 records. We will use the Iris dataset to perform all the operations on the table object.

Table Structure

Tables are used to organized data to perform computation and visualization operations easily. It is a sequence of labeled columns (header) and data within a column representing one attribute or feature of the individual. Each row represents a tuple with all features.

Table Operations

Various operations can be performed on the table object, and this we will see in the subsequent sections. First, we will import the Iris spreadsheet into a table object using the following operation.

Import Data from CSV File

To import data, we need to import the Table class from the datascience package. Once it is imported, we can use the below code to load the CSV file.

# Import datascience table class

from datascience import Table

# Load iris data from the specified location
iris_data = Table.read_table('data/iris.csv')

# Print the iris data
print(iris_data)
Iris data - Using datascience.table
Iris data – Using datascience.table

Show Table Tuples

To view the dataset, we can use show() function with or without argument. If there is no agreement, it shows the entire dataset; otherwise, it shows dataset rows based on the specified number.

# Import Table from datascience package
from datascience import Table

# Load iris data
iris_data = Table.read_table('data/iris.csv') 

# Show first 10 rows
iris_data.show(10)
Iris Dataset - Datascience.Table
Iris Dataset – Datascience.Table

Select Table Columns

This is one of the interesting features of the Table class. We can select one or more features using the select() function. Thus, this function makes our life very easy for operations or data visualizations using specific features.

# Import Table from datascience package
from datascience import Table

# Load iris dataset
iris_data = Table.read_table('data/iris.csv')

# Create a new table with Sapal Length and Petal Length only
iris_sepal_length_data = iris_data.select('SepalLengthCm','PetalLengthCm')

# Print the table
print(iris_sepal_length_data)
Iris Dataset - Datascience.Table.select
Iris Dataset – Datascience.Table.select

Drop Table Columns

This is another important function of the Table class. Using this function, we can drop one or more features from the dataset. However, it does not actually manipulate the dataset. Instead, it simply returns an object which we need to points to another variable.

# Import Table from datascience packages
from datascience import Table

# Load Iris dataset
iris_data = Table.read_table('data/iris.csv') 

# Drop Column(s)
iris_selected_feature_data = iris_data.drop('SepalLengthCm','PetalLengthCm','SepalWidthCm')

# Print the updated table
print(iris_selected_feature_data)

In this, we have dropped Sepal Length, Petal Length, and Sepal Width features and stored the remaining features into another table object called iris_selected_feature_data.

Iris Dataset - Datascience.Table.drop
Iris Dataset – Datascience.Table.drop

Selection Using Where

It allows us to select tuples from the dataset based on the specified condition. It takes the first argument as a column label and the second argument as one of the column values, and the selection will be made. In the end, the function returns another table filled with selected tuples if specified conditions are met.

# Import Table from datascience package
from datascience import Table

# Load Iris dataset
iris_data = Table.read_table('data/iris.csv') 

# Select datapoints from a specific column based on the condition
iris_species_versicolor_data = iris_data.where('Species','Iris-versicolor') 

# Print the updated table object
print(iris_species_versicolor_data)

In the above code, we have specified a condition to provide all the tuples whose Species is Iris-versicolor.  In the below diagram, we can see that all the rows are selected whose Species are Iris-versicolor.

Iris Dataset – Datascience.Table.where

Sort Table Column Datapoints 

This function will sort the tuples based on the column label and descending = {true or false} order.  This sorting can be performed both on the qualitative and quantitative data types

# Import Table from datascience package
from datascience import Table

# Load Iris dataset
iris_data = Table.read_table('data/iris.csv') 

# Sort Column datapoints
iris_sorted_data = iris_data.sort('SepalLengthCm',descending=True) 

# Print sorted dataset
print(iris_sorted_data)

We can see that; Sepal Length is in descending order starting from 7.9 cm at the top.

Iris Dataset - Datascience.Table.sort
Iris Dataset – Datascience.Table.sort

Sometimes, we want to change the table column header name in a table for our own convenience. To relabel the column name, we can use relabeled function with two arguments, the first one is column index number, and the second one is the new name.

# Import Table from datascience package
from datascience import Table 

# Rename Column Name
iris_data = Table.read_table('data/iris.csv').relabeled(5,'Custom Species') 

# Print updated column name
print(iris_data)

After execution of the above code, our last column name changed from Species to Custom Species.

Iris Dataset - Datascience.Table.relabeled
Iris Dataset – Datascience.Table.relabeled

Extract Column As Numpy Array

To export the Table column into the NumPy array object. We can use the column function with the argument name as the column header name.

from datascience import *

iris_data = Table.read_table('data/iris.csv')

print(type(iris_data))

SepalLengthCm = iris_data.column('PetalLengthCm')

print(type(SepalLengthCm))

print(SepalLengthCm)
Iris Dataset - Datascience.Table.column
Iris Dataset – Datascience.Table.column

In the above code, we have converted one of the column data of the Table into Numpy Array. Once we have a Numpy array, we can perform all the functionalities that are provided by the Numpy package. 

Create Table and Add Columns

A table object is automatically created when we load data from a spreadsheet. However, this is not the only way to create an empty table by using Table() functions and later adding columns and labels individually. The below code snippet shows how to create a table object from scratch. 

# Import table from datascience packages
from datascience import Table

# make an array with Indian States.
indian_states = make_array('ANDAMAN & NICOBAR ISLANDS', 'ANDHRA PRADESH', 'ASSAM', 'BIHAR',
       'CHANDIGARH', 'CHHATTISGARH', 'GUJARAT', 'HARYANA',
       'HIMACHAL PRADESH', 'JAMMU & KASHMIR', 'JHARKHAND', 'KARNATAKA',
       'KERALA', 'MADHYA PRADESH', 'MAHARASHTRA', 'MANIPUR ', 'MEGHALAYA',
       'MIZORAM', 'NAGALAND', 'NCT OF DELHI', 'ORISSA', 'PUDUCHERRY',
       'PUNJAB', 'RAJASTHAN', 'TAMIL NADU', 'TRIPURA', 'UTTAR PRADESH',
       'UTTARAKHAND', 'WEST BENGAL')

# Add this information in the table object
states_table = Table().with_column("States", indian_states)

In the above code, we have created a table and added a column in the newly created table object. In the below code, we are appending one more column in the exiting table object.

# Import table from datascience packages
from datascience import Table

# Make an array of Indian States
state_name = make_array('ANDAMAN & NICOBAR ISLANDS', 'ANDHRA PRADESH', 'ASSAM', 'BIHAR',
       'CHANDIGARH', 'CHHATTISGARH', 'GUJARAT', 'HARYANA',
       'HIMACHAL PRADESH', 'JAMMU & KASHMIR', 'JHARKHAND', 'KARNATAKA',
       'KERALA', 'MADHYA PRADESH', 'MAHARASHTRA', 'MANIPUR ', 'MEGHALAYA',
       'MIZORAM', 'NAGALAND', 'NCT OF DELHI', 'ORISSA', 'PUDUCHERRY',
       'PUNJAB', 'RAJASTHAN', 'TAMIL NADU', 'TRIPURA', 'UTTAR PRADESH',
       'UTTARAKHAND', 'WEST BENGAL')

# Create table and add state_name into the table object.
state_table = Table().with_column("States", state_name)

# Make an array of State Code
state_code = make_array(1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 24, 27, 28, 29, 32, 33, 34, 35)

# Add state_code to the existing table object.
state_table = state_table.with_column("States_Code", state_code)
state_table
Iris Dataset - Datascience.Table
Iris Dataset – Datascience.Table

In this note, we have seen how to create a table, add columns to the table, sort the entire dataset based on column attribute, export table column data points in Numpy array, and a lot more. 

This note is published under CC BY-NC-SA 4.0 license.

References

 130 total views,  1 views today

Scroll to Top
Scroll to Top