你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

在 HDInsight on AKS 中的 Apache Spark™ 群集上提交和管理作业

项目
11/15/2023

重要

此功能目前以预览版提供。 Microsoft Azure 预览版的补充使用条款包含适用于 beta 版、预览版或其他尚未正式发布的 Azure 功能的更多法律条款。有关此特定预览版的信息，请参阅 Azure HDInsight on AKS 预览版信息。如有疑问或功能建议，请在 AskHDInsight 上提交请求并附上详细信息，并关注我们以获取 Azure HDInsight Community 的更多更新。

创建群集后，用户可通过以下操作使用各种界面来提交和管理作业

使用 Jupyter
使用 Zeppelin
使用 ssh (spark-submit)

使用 Jupyter

先决条件

HDInsight on AKS 上的 Apache Spark™ 群集。有关详细信息，请参阅创建 Apache Spark 群集。

Jupyter Notebook 是支持各种编程语言的交互式笔记本环境。

创建 Jupyter Notebook

导航到 Apache Spark™ 群集页，打开“概述”选项卡。单击 Jupyter，它会要求你进行身份验证并打开 Jupyter 网页。
从 Jupyter 网页中，选择“新建”>“PySpark”来创建笔记本。

这会创建新的笔记本，并以 Untitled(Untitled.ipynb) 名称打开。

注意

如果使用 PySpark 或 Python 3 内核创建笔记本，在运行第一个代码单元时，会自动创建 spark 会话。不需要显式创建会话。

在 Jupyter Notebook 的空单元格中粘贴以下代码，然后按 Shift+Enter 运行这些代码。有关 Jupyter 上的更多控件，请查看此处。

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
data1 = [22,40,10,50,70]
s1 = pd.Series(data1)   #One-dimensional ndarray with axis labels (including time series).

data2 = data1
index = ['John','sam','anna','smith','ben']
s2 = pd.Series(data2,index=index)

data3 = {'John':22, 'sam':40, 'anna':10,'smith':50,'ben':70}
s3 = pd.Series(data3)

s3['jp'] = 32     #insert a new row
s3['John'] = 88

names = ['John','sam','anna','smith','ben']
ages = [10,40,50,48,70]
name_series = pd.Series(names)
age_series = pd.Series(ages)

data_dict = {'name':name_series, 'age':age_series}
dframe = pd.DataFrame(data_dict)   
#create a pandas DataFrame from dictionary

dframe['age_plus_five'] = dframe['age'] + 5   
#create a new column
dframe.pop('age_plus_five')
#dframe.pop('age')

salary = [1000,6000,4000,8000,10000]
salary_series = pd.Series(salary)
new_data_dict = {'name':name_series, 'age':age_series,'salary':salary_series}
new_dframe = pd.DataFrame(new_data_dict)
new_dframe['average_salary'] = new_dframe['age']*90

new_dframe.index = new_dframe['name']
print(new_dframe.loc['sam'])

绘制一张 X 和 Y 轴分别为“工资”和“年龄”的图
在同一个笔记本中，在 Jupyter Notebook 的空单元格中粘贴以下代码，然后按 Shift+Enter 运行该代码。
```
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

plt.plot(age_series,salary_series)
plt.show()
```

保存笔记本

从笔记本菜单栏中，导航到“文件”>“保存和检查点”。
关闭笔记本以释放群集资源：从笔记本菜单栏，导航到“文件”>“关闭并停止”。还可在示例文件夹下运行任何笔记本。

使用 Apache Zeppelin 笔记本

HDInsight on AKS 中的 Apache Spark 群集包括 Apache Zeppelin 笔记本。使用笔记本运行 Apache Spark 作业。本文介绍如何在 HDInsight on AKS 群集中使用 Zeppelin 笔记本。

先决条件

HDInsight on AKS 上的 Apache Spark 群集。有关说明，请参阅创建 Apache Spark 群集。

启动 Apache Zeppelin 笔记本

导航到 Apache Spark 群集的概述页面，从“群集”仪表板选择 Zeppelin 笔记本。它会提示进行身份验证并打开 Zeppelin 页面。
创建新的笔记本。在标题窗格中，导航到“笔记本”>“创建新笔记”。确保笔记本标题显示“已连接”状态。它表示右上角的一个绿点。

在 Zeppelin 笔记本中运行以下代码：

%livy.pyspark
import pandas as pd
import matplotlib.pyplot as plt
data1 = [22,40,10,50,70]
s1 = pd.Series(data1)   #One-dimensional ndarray with axis labels (including time series).

data2 = data1
index = ['John','sam','anna','smith','ben']
s2 = pd.Series(data2,index=index)

data3 = {'John':22, 'sam':40, 'anna':10,'smith':50,'ben':70}
s3 = pd.Series(data3)

s3['jp'] = 32     #insert a new row

s3['John'] = 88

names = ['John','sam','anna','smith','ben']
ages = [10,40,50,48,70]
name_series = pd.Series(names)
age_series = pd.Series(ages)

data_dict = {'name':name_series, 'age':age_series}
dframe = pd.DataFrame(data_dict)   #create a pandas DataFrame from dictionary

dframe['age_plus_five'] = dframe['age'] + 5   #create a new column
dframe.pop('age_plus_five')
#dframe.pop('age')

salary = [1000,6000,4000,8000,10000]
salary_series = pd.Series(salary)
new_data_dict = {'name':name_series, 'age':age_series,'salary':salary_series}
new_dframe = pd.DataFrame(new_data_dict)
new_dframe['average_salary'] = new_dframe['age']*90

new_dframe.index = new_dframe['name']
print(new_dframe.loc['sam'])

选择“播放”按钮，使段落运行内容片段。段落右上角的状态应从“就绪”逐渐变成“挂起”、“正在运行”和“已完成”。输出会显示在同一段落的底部。屏幕截图如下图所示：

输出：

使用 Spark 提交作业

使用 `#vim samplefile.py' 命令创建文件
此命令会打开 vim 文件

将以下代码粘贴到 vim 文件中

import pandas as pd
import matplotlib.pyplot as plt

From pyspark.sql import SparkSession
Spark = SparkSession.builder.master('yarn').appName('SparkSampleCode').getOrCreate()
# Initialize spark context
data1 = [22,40,10,50,70]
s1 = pd.Series(data1)   #One-dimensional ndarray with axis labels (including time series).

data2 = data1
index = ['John','sam','anna','smith','ben']
s2 = pd.Series(data2,index=index)

data3 = {'John':22, 'sam':40, 'anna':10,'smith':50,'ben':70}
 s3 = pd.Series(data3)

s3['jp'] = 32     #insert a new row

s3['John'] = 88

names = ['John','sam','anna','smith','ben']
ages = [10,40,50,48,70]
name_series = pd.Series(names)
age_series = pd.Series(ages)

data_dict = {'name':name_series, 'age':age_series}
dframe = pd.DataFrame(data_dict)   #create a pandas DataFrame from dictionary

dframe['age_plus_five'] = dframe['age'] + 5   #create a new column
dframe.pop('age_plus_five')
#dframe.pop('age')

salary = [1000,6000,4000,8000,10000]
salary_series = pd.Series(salary)
new_data_dict = {'name':name_series, 'age':age_series,'salary':salary_series}
new_dframe = pd.DataFrame(new_data_dict)
new_dframe['average_salary'] = new_dframe['age']*90

new_dframe.index = new_dframe['name']
print(new_dframe.loc['sam'])