Spark主流环境搭建
Spark On Local模式
:::note 默认大家下载并且配置了Hadoop环境,并且本文章不适合没有Linux基础的朋友学习
下载Spark文件
访问 Spark 下载页面 并选择适合的版本。
利用wget命令下载指定文件
wget https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
:::note 如果下载在本地环境中,请自行将文件压缩包传输到你有可用的Hadoop环境的容器中,并解压。 :::
下载Anaconda配置Python环境
访问 Anaconda 下载页面 并选择适合的版本。
利用wget命令下载指定文件
wget https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Linux-x86_64.sh
bash ./Anaconda3-2023.07-2-Linux-x86_64.sh
# 加入bash后会自动配置环境
:::note 如果下载在本地环境中,请自行将文件压缩包传输到你有可用的Hadoop环境的容器中,并解压。 :::
- Anaconda运行依赖部分Linux库,官方给出了需要下载的
apt-get install libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6
yum install libXcomposite libXcursor libXi libXtst libXrandr alsa-lib mesa-libEGL libXdamage mesa-libGL libXScrnSaver
pacman -Sy libxau libxi libxss libxtst libxcursor libxcomposite libxdamage libxfixes libxrandr libxrender mesa-libgl alsa-lib libglvnd
zypper install libXcomposite1 libXi6 libXext6 libXau6 libX11-6 libXrandr2 libXrender1 libXss1 libXtst6 libXdamage1 libXcursor1 libxcb1 libasound2 libX11-xcb1 Mesa-libGL1 Mesa-libEGL1
emerge x11-libs/libXau x11-libs/libxcb x11-libs/libX11 x11-libs/libXext x11-libs/libXfixes x11-libs/libXrender x11-libs/libXi x11-libs/libXcomposite x11-libs/libXrandr x11-libs/libXcursor x11-libs/libXdamage x11-libs/libXScrnSaver x11-libs/libXtst media-libs/alsa-lib media-libs/mesa
- 出现安装成功但是没有conda的虚拟环境解决方案
source ~/.bashrc
# 让新的配置生效
5 .创建一个虚拟环境
conda create -n pyspark python=3.8
- 查看虚拟环境是否配置成功
conda env list
- 测试是否可以激活
conda activate pyspark
修改相关的配置文件
- /etc/profile
# 添加到末尾
export HADOOP_HOME=/opt/hadoop
export HBASE_HOME=/opt/hbase
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_HOME=/opt/spark-3.4.1-bin-hadoop3
export PYSPARK_PYTHON=/root/anaconda3/envs/pyspark/bin/python3.8
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
# 没有vim用这个
echo 'export HADOOP_HOME=/opt/hadoop' | sudo tee -a /etc/profile
echo 'export HBASE_HOME=/opt/hbase' | sudo tee -a /etc/profile
echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' | sudo tee -a /etc/profile
echo 'export SPARK_HOME=/opt/spark-3.4.1-bin-hadoop3' | sudo tee -a /etc/profile
echo 'export PYSPARK_PYTHON=/root/anaconda3/envs/pyspark/bin/python3.8' | sudo tee -a /etc/profile
echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' | sudo tee -a /etc/profile
echo 'export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH' | sudo tee -a /etc/profile
source /etc/profile
- ~/.bashrc
# 添加到末尾
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PYSPARK_PYTHON=/root/anaconda3/envs/pyspark/bin/python3.8
# 没有vim用这个
echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> ~/.bashrc
echo 'export PYSPARK_PYTHON=/root/anaconda3/envs/pyspark/bin/python3.8' >> ~/.bashrc
source ~/.bashrc
运行一下Spark
# 请在Spark的bin目录下运行
./pyspark
Spark运行成功
运行一个Spark的API
sc.parallelize([1,2,3,4,5]).map(lambda x:x*10).collect()
Spark算子运行成功
Spark On Standalone模式
修改Spark配置文件(位于Spark的conf)
- spark-defaults.conf
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# 2. 修改内容, 追加如下内容
# 开启spark的日期记录功能
spark.eventLog.enabled true
# 设置spark日志记录的路径
spark.eventLog.dir hdfs://hadoop-master1:8020/sparklog/
# 设置spark日志是否启动压缩
spark.eventLog.compress true
- spark-env.sh
# Java和Hadoop相关环境变量
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
YARN_CONF_DIR=/opt/hadoop/etc/hadoop
# Spark Master节点配置
export SPARK_MASTER_HOST=hadoop-master1
export SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8080
# Spark Worker节点配置
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1g
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8081
## 设置历史服务器
# 将Spark程序运行的历史日志存储到HDFS的/sparklog目录中
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://hadoop-master1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"
- workers
hadoop-worker1
hadoop-worker2
hadoop-worker3
-log4j2.properties
# 修改日志登记为WARN
rootLogger.level = WARN
启动服务
#进入你选择的master机里面
#在spark的sbin里面
./start-all.sh
Spark On Standalone HA On Zookeeper模式r
修改Spark配置文件(位于Spark的conf)
- spark-defaults.conf
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# 2. 修改内容, 追加如下内容
# 开启spark的日期记录功能
spark.eventLog.enabled true
# 设置spark日志记录的路径
spark.eventLog.dir hdfs://hadoop-master1:8020/sparklog/
# 设置spark日志是否启动压缩
spark.eventLog.compress true
# 使用Zookeeper作为恢复模式来启用Master节点的高可用性
spark.deploy.recoveryMode ZOOKEEPER
# 设置Zookeeper集群的URL,用逗号分隔多个Zookeeper节点
spark.deploy.zookeeper.url zoo1:2181,zoo2:2181,zoo3:2181
# 指定Zookeeper存储Spark Master状态的路径
spark.deploy.zookeeper.dir /spark
- spark-env.sh
# Java和Hadoop相关环境变量
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
YARN_CONF_DIR=/opt/hadoop/etc/hadoop
# Spark Worker节点配置
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1g
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8081
## 设置历史服务器
# 将Spark程序运行的历史日志存储到HDFS的/sparklog目录中
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://hadoop-master1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"
# 启用Zookeeper来管理Master高可用性
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER \
-Dspark.deploy.zookeeper.url=zoo1:2181,zoo2:2181,zoo3:2181 \
-Dspark.deploy.zookeeper.dir=/spark"
- workers
hadoop-worker1
hadoop-worker2
hadoop-worker3
-log4j2.properties
# 修改日志登记为WARN
rootLogger.level = WARN
启动服务
# 顺便就把woker节点启动了
docker exec hadoop-master2 /opt/hadoop/spark-3.4.1-bin-hadoop3/sbin/start-all.sh
docker exec hadoop-master3 /opt/hadoop/spark-3.4.1-bin-hadoop3/sbin/start-master.sh
docker exec hadoop-master1 /opt/hadoop/spark-3.4.1-bin-hadoop3/sbin/start-master.sh
本文是原创文章,采用 CC BY-NC-ND 4.0 协议,完整转载请注明来自 David
评论
匿名评论
隐私政策
你无需删除空行,直接评论以获取最佳展示效果