Why Pandas
Pandas is essential for data analysis. It allows us to work with data.frames which makes it easy for us to use as input for machine learning models and for graphics packages. Dash also works well with data.frames, and hence this tutorial is included.
Install pandas
pip install pandas
Import pandas
import pandas as pd
Read csv as data.frame
df = pd.read_csv("salaries.csv")
CSV stand for comma-separated values. Most modern data is formatted into csv files for easier analysis with pandas data.frames.
Converting Numpy array to data.frame
Matrix of 5 rows and 10 columns
mat = np.arange(0,50).reshape(5,10)
Converting to pandas data.frame
df = pd.DataFrame(data = mat)
Data.frame format
Data.frames have 2 attributes:
- rows or indices
- columns
Row indices are useful for keeping track of observations and for subsetting rows of a data.frame with loc
Finding column names
df.columns
columns is an attribute, not a method, so there are no parentheses.
Renaming columns
df.columns = ["f1","f2","f3","f4","label"]
Referencing columns
Columns in dataa.frames are referenced in square brackets using double quotes:
df["<column name>"]
A single column that is referenced returns a Series object.
df["col 1"]
- returns series
Referencing columns to return data.frame
To return a data.frame, place the reference inside a list:
df[["<column name>"]]
- returns data.frame
You can also call multiple columns by name by placing them inside lists
Referencing rows and columns with loc
loc
is used to return certain rows and columns of a data.frame
df.loc["<row>", "<column>"]
loc
is useful if you know the "index" or name of the row (and column) that you want to subset. However, it is easier to use iloc
if you want to reference rows and columns by integers
Referencing rows and columns with iloc
If you want to reference rows and columns by their position, use iloc
df.iloc[<rownumber>,<colnumber>]
To reference multiple rows and multiple columns, put the rows and/columns into a list