The datascience
package was written for use in Berkeley’s Data8 course, and it contains useful functionality for investigating and graphically display of data. The most important functionality in the package is a Table
, and its structure is used to represent columns of data.
We will learn the Table structure and its functionalities. Along with this, we will also do hands-on on the IRIS dataset so that the concept would be cleared in a much better way.
IRIS Dataset
It is a simple dataset containing three flowers of the Iris species family, and these are Iris setosa, Iris virginica, and Iris versicolor. Four features are selected and measured in centimeters per flower, and the four features are sepal length, sepal width, petal length, and petal width. The selection of features was completely subjective to the domain expert. There are fifty samples per Iris species family, and the combination of these fifty samples per species forms 150 records. We will use the Iris dataset to perform all the operations on the table object.
Table Structure
Tables are used to organized data to perform computation and visualization operations easily. It is a sequence of labeled columns (header) and data within a column representing one attribute or feature of the individual. Each row represents a tuple with all features.
Table Operations
Various operations can be performed on the table object, and this we will see in the subsequent sections. First, we will import the Iris spreadsheet into a table object using the following operation.
Import Data from CSV File
To import data, we need to import the Table class from the datascience
package. Once it is imported, we can use the below code to load the CSV file.
# Import datascience table class from datascience import Table # Load iris data from the specified location iris_data = Table.read_table('data/iris.csv') # Print the iris data print(iris_data)
Show Table Tuples
To view the dataset, we can use show()
function with or without argument. If there is no agreement, it shows the entire dataset; otherwise, it shows dataset rows based on the specified number.
# Import Table from datascience package from datascience import Table # Load iris data iris_data = Table.read_table('data/iris.csv') # Show first 10 rows iris_data.show(10)
Select Table Columns
This is one of the interesting features of the Table class. We can select one or more features using the select()
function. Thus, this function makes our life very easy for operations or data visualizations using specific features.
# Import Table from datascience package from datascience import Table # Load iris dataset iris_data = Table.read_table('data/iris.csv') # Create a new table with Sapal Length and Petal Length only iris_sepal_length_data = iris_data.select('SepalLengthCm','PetalLengthCm') # Print the table print(iris_sepal_length_data)
Drop Table Columns
This is another important function of the Table class. Using this function, we can drop one or more features from the dataset. However, it does not actually manipulate the dataset. Instead, it simply returns an object which we need to points to another variable.
# Import Table from datascience packages from datascience import Table # Load Iris dataset iris_data = Table.read_table('data/iris.csv') # Drop Column(s) iris_selected_feature_data = iris_data.drop('SepalLengthCm','PetalLengthCm','SepalWidthCm') # Print the updated table print(iris_selected_feature_data)
In this, we have dropped Sepal Length, Petal Length, and Sepal Width features and stored the remaining features into another table object called iris_selected_feature_data.
Selection Using Where
It allows us to select tuples from the dataset based on the specified condition. It takes the first argument as a column label and the second argument as one of the column values, and the selection will be made. In the end, the function returns another table filled with selected tuples if specified conditions are met.
# Import Table from datascience package from datascience import Table # Load Iris dataset iris_data = Table.read_table('data/iris.csv') # Select datapoints from a specific column based on the condition iris_species_versicolor_data = iris_data.where('Species','Iris-versicolor') # Print the updated table object print(iris_species_versicolor_data)
In the above code, we have specified a condition to provide all the tuples whose Species is Iris-versicolor. In the below diagram, we can see that all the rows are selected whose Species are Iris-versicolor.
Sort Table Column Datapoints
This function will sort the tuples based on the column label and descending = {true or false} order. This sorting can be performed both on the qualitative and quantitative data types.
# Import Table from datascience package from datascience import Table # Load Iris dataset iris_data = Table.read_table('data/iris.csv') # Sort Column datapoints iris_sorted_data = iris_data.sort('SepalLengthCm',descending=True) # Print sorted dataset print(iris_sorted_data)
We can see that; Sepal Length is in descending order starting from 7.9 cm at the top.
Sometimes, we want to change the table column header name in a table for our own convenience. To relabel the column name, we can use relabeled
function with two arguments, the first one is column index number, and the second one is the new name.
# Import Table from datascience package from datascience import Table # Rename Column Name iris_data = Table.read_table('data/iris.csv').relabeled(5,'Custom Species') # Print updated column name print(iris_data)
After execution of the above code, our last column name changed from Species to Custom Species.
Extract Column As Numpy Array
To export the Table column into the NumPy array object. We can use the column function with the argument name as the column header name.
from datascience import * iris_data = Table.read_table('data/iris.csv') print(type(iris_data)) SepalLengthCm = iris_data.column('PetalLengthCm') print(type(SepalLengthCm)) print(SepalLengthCm)
In the above code, we have converted one of the column data of the Table into Numpy Array. Once we have a Numpy array, we can perform all the functionalities that are provided by the Numpy package.
Create Table and Add Columns
A table object is automatically created when we load data from a spreadsheet. However, this is not the only way to create an empty table by using Table()
functions and later adding columns and labels individually. The below code snippet shows how to create a table object from scratch.
# Import table from datascience packages from datascience import Table # make an array with Indian States. indian_states = make_array('ANDAMAN & NICOBAR ISLANDS', 'ANDHRA PRADESH', 'ASSAM', 'BIHAR', 'CHANDIGARH', 'CHHATTISGARH', 'GUJARAT', 'HARYANA', 'HIMACHAL PRADESH', 'JAMMU & KASHMIR', 'JHARKHAND', 'KARNATAKA', 'KERALA', 'MADHYA PRADESH', 'MAHARASHTRA', 'MANIPUR ', 'MEGHALAYA', 'MIZORAM', 'NAGALAND', 'NCT OF DELHI', 'ORISSA', 'PUDUCHERRY', 'PUNJAB', 'RAJASTHAN', 'TAMIL NADU', 'TRIPURA', 'UTTAR PRADESH', 'UTTARAKHAND', 'WEST BENGAL') # Add this information in the table object states_table = Table().with_column("States", indian_states)
In the above code, we have created a table and added a column in the newly created table object. In the below code, we are appending one more column in the exiting table object.
# Import table from datascience packages from datascience import Table # Make an array of Indian States state_name = make_array('ANDAMAN & NICOBAR ISLANDS', 'ANDHRA PRADESH', 'ASSAM', 'BIHAR', 'CHANDIGARH', 'CHHATTISGARH', 'GUJARAT', 'HARYANA', 'HIMACHAL PRADESH', 'JAMMU & KASHMIR', 'JHARKHAND', 'KARNATAKA', 'KERALA', 'MADHYA PRADESH', 'MAHARASHTRA', 'MANIPUR ', 'MEGHALAYA', 'MIZORAM', 'NAGALAND', 'NCT OF DELHI', 'ORISSA', 'PUDUCHERRY', 'PUNJAB', 'RAJASTHAN', 'TAMIL NADU', 'TRIPURA', 'UTTAR PRADESH', 'UTTARAKHAND', 'WEST BENGAL') # Create table and add state_name into the table object. state_table = Table().with_column("States", state_name) # Make an array of State Code state_code = make_array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 27, 28, 29, 32, 33, 34, 35) # Add state_code to the existing table object. state_table = state_table.with_column("States_Code", state_code) state_table
In this note, we have seen how to create a table, add columns to the table, sort the entire dataset based on column attribute, export table column data points in Numpy array, and a lot more.
This note is published under CC BY-NC-SA 4.0 license.
References
130 total views, 1 views today