Estimator, Transformer and Pipeline in Machine Learning

Main concepts in Pipelines (in Spark Docs > Programming Guides > Machine Learning Library (MLlib) Guide > ML Pipelines) gives a good introduction to these 3 concepts.

A dataframe is converted to a model (which is a transformser) via the fit() method of an estimator. A dataframe is converted to another dataframe via the transform() method of a transformer. A pipeline is an estimator, which is consisted by many transformers and estimators.

The document provides an example of above concepts. A text processing pipeline built with 2 transformers (Tokenizer and HashingTF) and an estimator (LogisticRegression). The input of the fit() method of the pipeline is raw text in dataframe. The output is a model (transformer), which is used to predict new text.

As the documents mentioned, the pipeline and related concepts are mostly inspired by scikit-learn. So these concepts in scikit-learn have the same meanings.

Estimator, Transformer and Pipeline in Machine Learning

Published

Last Updated

Category

Tags

Contact