Zeppelin 的一个代码块叫做一个 paragraph,相当于 Jupyter 的 cell。
Jupyter 有 mode 的概念,可以通过键盘快捷键操作 cell,例如 normal 模式下,
x
可以删除 cell,Zeppelin 只有编辑模式。
当前的 zeppelin (0.8.1) 似乎没有 note(一个note相当于一个脚本文件)级别的 working directory 的概念,
只有一个全局的 zeppelin.home
可以设置。所以读入数据文件时只能写全路径。
Zeppelin 数据可视化的实现方式与 Jupyter 不同,没有 matplotlib 这样的做图包,
而是通过 dataframe 的 createOrReplaceTempView
方法缓存为 view,然后创建一个 SQL paragraph,
利用绘图按钮将 SQL 返回的数据集转换为图。
运行 $ZEPPELIN_HOME/bin/zeppelin-daemon.sh start
启动服务,默认端口 8080.
Customize
To modify the server port, run:
cd $ZEPPELIN_HOME/conf
cp zeppelin-site.xml.template zeppelin-site.xml
cp zeppelin-env.sh.template zeppelin-env.sh
Modify the value of zeppelin.server.port in zeppelin-site.xml, and restart the server.
Using Anaconda
To run functions using scikit-learn, pandas, numpy in zeppelin, firstly install anaconda. Starting Zeppelin daemon in a environment where anaconda packages are fully accessible.
For example, I build the environment in the following steps:
-
Downloaded and installed minicodna on a host (Ubuntu 16.04).
-
Create a new virtual environment contains anaconda (named anaconda);
-
Downloaded and installed zeppelin 0.7.3;
-
Activate the virtual env anaconda;
-
Start the zeppelin daemon.
As the following commands shows:
conda create --name anaconda python=3.5
conda install --name anaconda anaconda
. activate anaconda
$ZEPPELIN_HOME/bin/zeppelin-daemon.sh start
Note here you can't use Python 3.6 to build the environment if using pyspark. Because pyspark 2.1.0 and before do not support Python 3.6. If you do not use pyspark, or the spark version is greater than 2.1, you can build it with Python 3.6. Ref: Unable to run pyspark.
Now test anaconda modules with the following codes:
%python
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
print("Prediction: %s" % clf.predict(X[0:1]))
print("Actual: %d" % y[0])
Execute with Shift-Enter
. The result should be:
Prediction: [0]
Actual: 0
Print Python interpreter information:
%python
import sys
print("Version: %s\nSys Path: %s" % (sys.version, sys.path))
Verify plot functions:
%python
import matplotlib.pyplot as plt
plt.plot([1, 2, 3])
Note 1:
Use Ctrl .
instead of
Note 2:
I've tried to switch conda environment in zeppelin (v0.7.3), but failed.
In a zeppelin notebook, you can run %conda env list
to list all virtual envs.
But %conda activate ...
failed. It said command activate
is not supported.
While this is the method written in official doc
Python 2 & 3 Interpreter for Apache Zeppelin.
Using PySpark
If your Spark version is 2.1 or lower, you have to ensure the Python version in the PATH is 3.5 or lower. In zeppelin notebook run the following codes:
%pyspark
a = sc.parallelize(["black", "blue", "white", "green", "grey"], 2)
b = a.groupBy(lambda x: len(x)).collect()
print(b)
sorted([(x,sorted(y)) for (x,y) in b])
Using Ctrl .
to autocompletion.