Powered by GitBook

Spark Standalone 配置

01. spark-env.sh支持的配置变量

Environment Variable	Meaning
SPARK_MASTER_IP	指定Master绑定的IP地址，例如公共IP
SPARK_MASTER_PORT	从另外一个端口启动master(默认: 7077)
SPARK_MASTER_WEBUI_PORT	Master的web UI端口 (默认: 8080)
SPARK_MASTER_OPTS	master守护进程的配置属性，格式“-Dx=y” (默认值: 无). See below for a list of possible options｜
SPARK_LOCAL_DIRS	Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.
SPARK_WORKER_CORES	Total number of cores to allow Spark applications to use on the machine (default: all available cores).
SPARK_WORKER_MEMORY	Spark作业可使用的主机内存总量，默认格式1000M或者 2G (默认: 全部内在去年1GB);注意：每个作业自己的内存空间由属性spark.executor.memory决定。
SPARK_WORKER_PORT	启动Spark worker绑定的端口(默认：随机)
SPARK_WORKER_WEBUI_PORT	worker 的web UI 启动端口(默认: 8081)
SPARK_WORKER_INSTANCES	每台机器上运行worker数量 (默认: 1). 当你有一个非常强大的计算机，并且需要多个Spark worker进程的时候你可以修改这个默认值大于1 . 如果你设置了这个值，要确保SPARK_WORKER_CORE明确限制每一个 worker的CPU核个数, 否则每个worker将尝试使用所有的CPU核
SPARK_WORKER_DIR	应用运行在该目录中，也包含日志和临时空间(默认: SPARK_HOME/work);
SPARK_WORKER_OPTS	Configuration properties that apply only to the worker in the form "-Dx=y" (默认值: 无). See below for a list of possible options.
SPARK_DAEMON_MEMORY	分配给Spark master和 worker 守护进程的内存空间 (默认: 512m)
SPARK_DAEMON_JAVA_OPTS	Spark master 和 worker守护进程的JVM 选项,格式“-Dx=y” (默认值: 无)
SPARK_PUBLIC_DNS	The public DNS name of the Spark master and workers (默认值: 无)

Note: The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.

02. SPARK_MASTER_OPTS支持下面的系统属性：

Property Name	Default	Meaning
spark.deploy.retainedApplications	200	The maximum number of completed applications to display. Older applications will be dropped from the UI to maintain this limit.
spark.deploy.retainedDrivers	200	The maximum number of completed drivers to display. Older drivers will be dropped from the UI to maintain this limit.
spark.deploy.spreadOut	true	Whether the standalone cluster manager should spread applications out across nodes or try to consolidate them onto as few nodes as possible. Spreading out is usually better for data locality in HDFS, but consolidating is more efficient for compute-intensive workloads.
spark.deploy.defaultCores	(infinite)	Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default.
spark.worker.timeout	60	Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats.

03.SPARK_WORKER_OPTS支持下面的系统属性：

Property Name	Default	Meaning
spark.worker.cleanup.enabled	false	Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Applications directories are cleaned up regardless of whether the application is still running.
spark.worker.cleanup.interval	1800 (30 minutes)	Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl	7 24 3600 (7 days)	The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.