DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Encoding Categorial Data in Machine Learning


For example there is a input file categorical_input.csv:

patientID,age,diabetesType,status
1,25,Type1,Poor
2,34,Type2,Improved
3,28,Type1,Excellent
4,52,Type1,Poor

Notice there are NO space after comma, or the data will be wrong after loaded

We need load it into a dataframe with right type: the diabetesType is a nominal feature, and status is a ordinal feature.

R

R encodes category data via function factor.

To encode nominal feature, use factor(vec).

To encode ordinal feature, use factor(vec, order=TRUE, levels=c("Poor", "Improved", "Excellent")).

See section 2.2.5 "Factors" in "R in Action" for details.

> diabetes <- read.table("categorical_input.csv", header = TRUE, row.names = "patientID", sep = ",", stringsAsFactors = FALSE)
> diabetes <- transform(diabetes, status = factor(status, ordered = TRUE, levels = c("Poor", "Improved", "Excellent")))
> diabetes <- transform(diabetes, diabetesType = factor(diabetesType))
> str(diabetes)
'data.frame':    4 obs. of  3 variables:
$ age        : int  25 34 28 52
$ diabetesType: Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status      : Ord.factor w/ 3 levels "Poor"<"Improved"<..: 1 2 3 1
> diabetes$status <- as.integer(diabetes$status)   # convert categorial feature to numeric
> str(diabetes)
'data.frame':   4 obs. of  3 variables:
 $ age         : int  25 34 28 52
 $ diabetesType: Factor w/ 2 levels "Type1","Type2": 1 2 1 1
 $ status      : int  1 2 3 1

Python

import pandas as pd
import numpy as np
print(pd.__version__)

    0.20.3

diabetes = pd.read_csv('categorical_input.csv', index_col='patientID', dtype={'age': np.float, 'diabetesType': 'category', 'status': 'category'})
print(diabetes['status'])

    patientID
    1        Poor
    2    Improved
    3    Excellent
    4        Poor
    Name: status, dtype: category
    Categories (3, object): [Excellent, Improved, Poor]

status = diabetes['status']
status.cat.set_categories(['Poor', 'Improved', 'Excellent'], ordered=True, inplace=True)
diabetes['status']

    patientID
    1        Poor
    2    Improved
    3    Excellent
    4        Poor
    Name: status, dtype: category
    Categories (3, object): [Poor < Improved < Excellent]

# convert categorical feature to numeric
status_mapping = {'Poor': 1, 'Improved': 2, 'Excellent': 3}
diabetes['status'] = diabetes['status'].map(status_mapping)

# convert it back
inv_status_mapping = {v: k for k, v in status_mapping.items()}
diabetes['status'] = diabetes['status'].map(inv_status_mapping)

Ref: * Categorical Data in pandas (0.20.3) doc.

  • Section Mapping ordinal features in book "Python Machine Learning" 2ed by Sebastian Raschka.

For categorial data encoding, R is more mature than Python.



Published

Jan 25, 2018

Last Updated

Jan 29, 2018

Category

Tech

Tags

  • category 1
  • pandas 5
  • rlang 17

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor