For example there is a input file categorical_input.csv:
patientID,age,diabetesType,status
1,25,Type1,Poor
2,34,Type2,Improved
3,28,Type1,Excellent
4,52,Type1,Poor
Notice there are NO space after comma, or the data will be wrong after loaded
We need load it into a dataframe with right type: the diabetesType is a nominal feature, and status is a ordinal feature.
R
R encodes category data via function factor
.
To encode nominal feature, use factor(vec)
.
To encode ordinal feature, use
factor(vec, order=TRUE, levels=c("Poor", "Improved", "Excellent"))
.
See section 2.2.5 "Factors" in "R in Action" for details.
> diabetes <- read.table("categorical_input.csv", header = TRUE, row.names = "patientID", sep = ",", stringsAsFactors = FALSE)
> diabetes <- transform(diabetes, status = factor(status, ordered = TRUE, levels = c("Poor", "Improved", "Excellent")))
> diabetes <- transform(diabetes, diabetesType = factor(diabetesType))
> str(diabetes)
'data.frame': 4 obs. of 3 variables:
$ age : int 25 34 28 52
$ diabetesType: Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : Ord.factor w/ 3 levels "Poor"<"Improved"<..: 1 2 3 1
> diabetes$status <- as.integer(diabetes$status) # convert categorial feature to numeric
> str(diabetes)
'data.frame': 4 obs. of 3 variables:
$ age : int 25 34 28 52
$ diabetesType: Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : int 1 2 3 1
Python
import pandas as pd
import numpy as np
print(pd.__version__)
0.20.3
diabetes = pd.read_csv('categorical_input.csv', index_col='patientID', dtype={'age': np.float, 'diabetesType': 'category', 'status': 'category'})
print(diabetes['status'])
patientID
1 Poor
2 Improved
3 Excellent
4 Poor
Name: status, dtype: category
Categories (3, object): [Excellent, Improved, Poor]
status = diabetes['status']
status.cat.set_categories(['Poor', 'Improved', 'Excellent'], ordered=True, inplace=True)
diabetes['status']
patientID
1 Poor
2 Improved
3 Excellent
4 Poor
Name: status, dtype: category
Categories (3, object): [Poor < Improved < Excellent]
# convert categorical feature to numeric
status_mapping = {'Poor': 1, 'Improved': 2, 'Excellent': 3}
diabetes['status'] = diabetes['status'].map(status_mapping)
# convert it back
inv_status_mapping = {v: k for k, v in status_mapping.items()}
diabetes['status'] = diabetes['status'].map(inv_status_mapping)
Ref: * Categorical Data in pandas (0.20.3) doc.
- Section Mapping ordinal features in book "Python Machine Learning" 2ed by Sebastian Raschka.
For categorial data encoding, R is more mature than Python.