Spark Standalone 配置
01. spark-env.sh支持的配置变量
Environment Variable |
Meaning |
SPARK_MASTER_IP |
指定Master绑定的IP地址,例如公共IP |
SPARK_MASTER_PORT |
从另外一个端口启动master(默认: 7077) |
SPARK_MASTER_WEBUI_PORT |
Master的web UI端口 (默认: 8080) |
SPARK_MASTER_OPTS |
master守护进程的配置属性,格式“-Dx=y” (默认值: 无). See below for a list of possible options| |
SPARK_LOCAL_DIRS |
Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. |
SPARK_WORKER_CORES |
Total number of cores to allow Spark applications to use on the machine (default: all available cores). |
SPARK_WORKER_MEMORY |
Spark作业可使用的主机内存总量,默认格式1000M或者 2G (默认: 全部内在去年1GB);注意:每个作业自己的内存空间由属性spark.executor.memory决定。 |
SPARK_WORKER_PORT |
启动Spark worker绑定的端口(默认:随机) |
SPARK_WORKER_WEBUI_PORT |
worker 的web UI 启动端口(默认: 8081) |
SPARK_WORKER_INSTANCES |
每台机器上运行worker数量 (默认: 1). 当你有一个非常强大的计算机,并且需要多个Spark worker进程的时候你可以修改这个默认值大于1 . 如果你设置了这个值,要确保SPARK_WORKER_CORE明确限制每一个 worker的CPU核个数, 否则每个worker将尝试使用所有的CPU核 |
SPARK_WORKER_DIR |
应用运行在该目录中,也包含日志 和临时空间(默认: SPARK_HOME/work); |
SPARK_WORKER_OPTS |
Configuration properties that apply only to the worker in the form "-Dx=y" (默认值: 无). See below for a list of possible options. |
SPARK_DAEMON_MEMORY |
分配给Spark master和 worker 守护进程的内存空间 (默认: 512m) |
SPARK_DAEMON_JAVA_OPTS |
Spark master 和 worker守护进程的JVM 选项,格式“-Dx=y” (默认值: 无) |
SPARK_PUBLIC_DNS |
The public DNS name of the Spark master and workers (默认值: 无) |
Note: The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.
02. SPARK_MASTER_OPTS支持下面的系统属性:
Property Name |
Default |
Meaning |
spark.deploy.retainedApplications |
200 |
The maximum number of completed applications to display. Older applications will be dropped from the UI to maintain this limit. |
spark.deploy.retainedDrivers |
200 |
The maximum number of completed drivers to display. Older drivers will be dropped from the UI to maintain this limit. |
spark.deploy.spreadOut |
true |
Whether the standalone cluster manager should spread applications out across nodes or try to consolidate them onto as few nodes as possible. Spreading out is usually better for data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. |
spark.deploy.defaultCores |
(infinite) |
Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default. |
spark.worker.timeout |
60 |
Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats. |
03.SPARK_WORKER_OPTS支持下面的系统属性:
Property Name |
Default |
Meaning |
spark.worker.cleanup.enabled |
false |
Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Applications directories are cleaned up regardless of whether the application is still running. |
spark.worker.cleanup.interval |
1800 (30 minutes) |
Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine. |
spark.worker.cleanup.appDataTtl |
7 24 3600 (7 days) |
The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently. |