Setting up Executor Nodes¶
We shall setup the executor nodes by setting up the hardware, operating system first and then the executor service itself.
Considerations and tuning for hardware and operating system¶
In the following sections we discus some aspects of scheduling, hardware and settings on the OS to ensure good performance.
CPU and Memory considerations¶
The executor nodes are the servers that host and run the actual docker containers. Drove will take into consideration the NUMA topology of these machines to optimize the placement for containers to extract the maximum performance. Along with this, Drove will cpuset
the containers to the allocated cores in a non overlapping manner, so that the cores allocated to a container are dedicated to it. Memory allocated to a container is pinned as well and selected from the same NUMA node.
Needless to say the minimum amount of CPU that can be given to an application or task is 1. Fractional cpu allocation can be achieved in a predictable manner by configuring over provisioning on executor nodes.
Over Provisioning of CPU and Memory¶
Drove does not do any kind of burst scaling or overcommitment to ensure application performance remains predictable even under load. Instead, in Drove, there is a feature to make executors appear to have more cores (and memory) than it actually has. This can be used to get more utilization out of executor nodes in clusters that do not need guaranteed performance (for example staging or dev testing clusters). This is achieved by enabling over provisioning.
Over provisioning needs to be configured in the executor configuration. It primarily consists of two configs:
- CPU Multiplier - an integral multiplier which will be used to multiply the number of available cores
- Memory Multiplier - an integral multiplier which will be used to multiply available memory
VCores (virtual cores) are internal representation of a CPU core on the executor. If over provisioning is disabled, a vcore will correspond to a physical core. If over provisioning is enabled, 1 CPU core will generate cpu multiplier
number of v cores. Drove does do cpuset
even on containers running on nodes that have over provisioning enabled, however the physical cores that the containers get bound to are chosen at random, albeit from the same NUMA node. cpuset-mem
is always done on the same NUMA node as well.
Mixed clusters
In some production clusters you might have applications that are non critical in terms of performance and are unable to utilize a full core. These can be tagged to be spun up on some nodes where over provisioning is enabled. Adopting such a cluster topology will ensure that containers that need high performance run on nodes without over provisioning and the smaller apps (like for example operations consoles etc) are run on separate nodes with over provisioning enabled. Just ensure the latter are tagged properly and during app deployment specify this tag in application spec or task spec.
Disable NUMA Pinning¶
There is an option to disable memory and core pinning. In this situation, all cores from all NUM nodes show up as being part of one node. cpuset-mems
is not called if numa pinning is disabled and therefore you will be leaving some memory performance on the table. We recommend not to dabble with this unless you have tasks and containers that need more than the number of cores available on a single NUMA node. This setting is enabled at executor level by setting disableNUMAPinning: true
.
Hyper-threading¶
Whether Hyper Threading needs to be enabled or not is a bit dependent on applications deployed and how effectively they can utilize individual CPU cores. For mixed workloads, we recommend Hyper Threading to be enabled on the executor nodes.
Isolating container and OS processes¶
Typically we would not want containers to share CPU resources with processes for the operating system, Drove Executor Service as well as Docker engine (if using docker) and so on. While complete isolation would need creating a full scheduler (and passing isolcpus
to GRUB parameters), we can get a good middle ground by ensuring such processes utilize only a few CPU cores on the system, and let the Drove executors deploy and pin containers to the rest.
This is achieved in two steps:
- Make changes to systemd to use only specific cores
- Exclude these cores in the drove executor configuration
Let's say our server has 2 NUMA nodes, each with 40 hyper-threaded cores. We want to reserve the first 2 cores from each CPU to the OS processes. So we reserve cores [0,1,2,3]
for the OS processes.
The following line in /etc/systemd/system.conf
#CPUAffinity=
needs to be changed to
CPUAffinity=0 1 2 3
Tip
Reboot the machine for this to take effect.
The changes can be validated post reboot by running the following command:
grep Cpus_allowed_list /proc/1/status
The expected output should be:
Cpus_allowed_list: 0-3
Note
Refer to this for more details.
GPU Computation¶
Nvidia based GPU compute can be enabled at executor level by installing relevant drivers. Please follow the setup guide to enable this. Remember to tag these nodes to isolate them from the primary cluster and use tags to deploy apps and tasks that need GPU.
Storage consideration¶
On executor nodes the disk might be under pressure if container (re)deployments are frequent or the containers log very heavily. As such, we recommend the logging directory for Drove be mounted on hardware that will be able to handle this load. Similar considerations need to be given to the log and package directory for docker or podman.
Executor Configuration Reference¶
The Drove Executor is written on the Dropwizard framework. The configuration to the service is set using a YAML file which needs to be injected into the container. A typical controller configuration file will look like the following:
server: #(1)!
applicationConnectors: #(2)!
- type: http
port: 3000
adminConnectors: #(3)!
- type: http
port: 3001
applicationContextPath: /
requestLog:
appenders:
- type: console
timeZone: ${DROVE_TIMEZONE}
- type: file
timeZone: ${DROVE_TIMEZONE}
currentLogFilename: /logs/drove-executor-access.log
archivedLogFilenamePattern: /logs/drove-executor-access.log-%d-%i
archivedFileCount: 3
maxFileSize: 100MiB
logging:
level: INFO
loggers:
com.phonepe.drove: ${DROVE_LOG_LEVEL}
appenders: #(4)!
- type: console #(5)!
threshold: ALL
timeZone: ${DROVE_TIMEZONE}
logFormat: "%(%-5level) [%date] [%logger{0} - %X{instanceLogId}] %message%n"
- type: file #(6)!
threshold: ALL
timeZone: ${DROVE_TIMEZONE}
currentLogFilename: /logs/drove-executor.log
archivedLogFilenamePattern: /logs/drove-executor.log-%d-%i
archivedFileCount: 3
maxFileSize: 100MiB
logFormat: "%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n"
archive: true
- type: drove #(7)!
logPath: "/logs/applogs/"
archivedLogFileSuffix: "%d"
archivedFileCount: 3
threshold: TRACE
timeZone: ${DROVE_TIMEZONE}
logFormat: "%(%-5level) | %-23date | %-30logger{0} | %message%n"
archive: true
zookeeper: #(8)!
connectionString: ${ZK_CONNECTION_STRING}
clusterAuth: #(9)!
secrets:
- nodeType: CONTROLLER
secret: ${DROVE_CONTROLLER_SECRET}
- nodeType: EXECUTOR
secret: ${DROVE_EXECUTOR_SECRET}
resources: #(10)!
osCores: [ 0, 1 ]
exposedMemPercentage: 60
disableNUMAPinning: ${DROVE_DISABLE_NUMA_PINNING}
enableNvidiaGpu: ${DROVE_ENABLE_NVIDIA_GPU}
options: #(11)!
cacheImages: true
maxOpenFiles: 10_000
logBufferSize: 5m
cacheFileSize: 10m
cacheFileCount: 3
- Server listener configuration. See Dropwizard Server Configuration for the different options.
- Main port configuration. This is where the UI and APIs will be exposed. Check connector configuration docs for details.
- Admin port. You can take thread dumps, metrics, run healthchecks on the Drove controller on this port.
- Logging configuration. See logging docs.
- Log to console. Useful in docker-compose.
- Log to rotating files. Useful for running servers.
- Drove application logger configuration. See drove logger config for details.
- Configure how to connect to Zookeeper See Zookeeper Config for details.
- Configuration for authentication between nodes in the cluster. Please check intra node auth config for details.
- Resource configuration for this node.
- Options to configure executor behaviour. Check executor options section for details.
Tip
In case you do not want to expose admin apis to outside the host, please set bindHost
in the admin connectors section.
adminConnectors:
- type: http
port: 10001
bindHost: 127.0.0.1
Zookeeper Connection Configuration¶
The following details can be configured.
Name | Option | Description |
---|---|---|
Connection String | connectionString |
The connection string of the form: zkserver:2181,zkserver2:2181... |
Data namespace | namespace |
The top level node inside which all Drove data will be scoped. Defaults to drove if not set. |
Sample
zookeeper:
connectionString: "192.168.3.10:2181,192.168.3.11:2181,192.168.3.12:2181"
namespace: drovetest
Note
This section is same across the cluster including both controller and executor.
Intra Node Authentication Configuration¶
Communication between controller and executor is protected by a shared-secret based authentication. The following configuration is meant to configure this. This section consists of a list of 2 members:
- Config for controller to talk to executors
- Config for executors to talk to controller
Each section consists of the following:
Name | Option | Description |
---|---|---|
Node Type | nodeType |
Type of node in the cluster. Can be CONTROLLER or EXECUTOR |
Secret | secret |
The actual secret to be passed. |
Sample
clusterAuth:
secrets:
- nodeType: CONTROLLER
secret: ControllerSecretValue
- nodeType: EXECUTOR
secret: ExecutorSecret
Note
This section is same across the cluster including both controller and executor.
Drove Application Logger Configuration¶
Drove will segregate application and task instance logs in a directory of your choice. The path for such files is set as:
- <application id>/<instance id>
for Application Instances
- <sourceAppName>/<task id>
for Task Instances
The Drove Log Appender is based of LogBack's Sifting Appender.
The following configuration options are supported:
Name | Option | Description |
---|---|---|
Path | logPath |
Directory to host the logs |
Archive old logs | archive |
Whether to enable log rotation |
Archived File Suffix | archivedLogFileSuffix |
Suffix for archived log files. |
Archived File Count | archivedFileCount |
Count of archived log files. Older files are deleted. |
File Size | maxFileSize |
Size of current log file after which it is archived and a new file is created. Unit: DataSize. |
Total Size | totalSizeCap |
total size after which deletion takes place. Unit: DataSize. |
Buffer Size | bufferSize |
Buffer size for the logger. (Set to 8KB by default). Used if immediateFlush is turned off. |
Immediate Flush | immediateFlush |
Flush logs immediately. Set to true by default (recommended) |
Sample
logging:
level: INFO
...
appenders:
# Setup appenders for the executor process itself first
...
- type: drove
logPath: "/logs/applogs/"
archivedLogFileSuffix: "%d"
archivedFileCount: 3
threshold: TRACE
timeZone: ${DROVE_TIMEZONE}
logFormat: "%(%-5level) | %-23date | %-30logger{0} | %message%n"
archive: true
Resource Configuration¶
This section can be used to configure how resources are exposed from an executor to the cluster. We have discussed a few of the considerations that will drive the configuration that is being setup.
Name | Option | Description |
---|---|---|
OS Cores | osCores |
A list of cores reserved for use by operating system processes. See the relevant section for details on the pre-steps needed to achieve this. |
Exposed Memory | exposedMemPercentage |
What percentage of the system memory can be used by the containers running on the host collectively. Range: 50-100 integer |
NUMA Pinning | disableNUMAPinning |
Disable NUMA and CPU core pinning for containers. Pinning is on by default. (default: false ) |
Nvidia GPU | enableNvidiaGpu |
Enable GPU support on containers. This setting makes all available Nvidia GPUs on the current executor machine available for any container running on this executor. GPU resources are not discovered on the executor, managed and rationed between containers. Needs to be used in conjunction with tagging (see tags below) to ensure only the applications which require a GPU end up on the executor with GPUs. |
Tags | tags |
A set of strings that can be used in TAG placement policy to route application and task instances to this executor. |
Over Provisioning | overProvisioning |
Setup over provisioning configuration. |
Tagging
The current hostname is always added as a tag by default and is handled specially to allow for non-tagged deployments to be routed to this executor. If any tag is specified in the tags
config, this node will receive containers only when MATCH_TAG
placement is used. Please check relevant sections to specify correct placement policies for applications and tasks.
Sample
resources:
osCores: [0,1,2,3]
exposedMemPercentage: 90
Over provisioning configuration¶
Drove strives to ensure that containers can run unencumbered on CPU cores allocated to them. This means that the minimum allocation unit possible is 1
for cores. It does not support fractional CPU.
However, there are situations where we would want some non-critical applications to run the cluster but not waste CPU. The overProvisioning
configuration aims to provide user a way to turn off NUMA pinning on the executor and run more containers than it normally would.
To ensure predictability, we do not want pinned and non-pinned containers running on the same host. Hence, an executor host can either be running in pinned mode or in non-pinned mode.
To enable more containers than we could usually deploy and to still retain some level of control on how small you want a container to go, we specify multipliers on CPU and memory.
Example:
- Let's say your executor server has 40 cores available. If you set cpuMultiplier
as 4, this node will now show up as having 160 cores to the controller.
- Let's say your server had 512GB of memory, setting memoryMultiplier
to 2 will make drove see it as 1TB.
Name | Option | Description |
---|---|---|
Enabled | enabled |
Set this to true to enable over provisioning. Default: false |
CPU Multiplier | cpuMultiplier |
Multiplier to be applied to enable cpu over provisioning. Default: 1 . Range: 1-20 |
Memory Multiplier | memoryMultiplier |
Multiplier to be applied to enable memory over provisioning. Default: 1 . Range: 1-20 |
Sample
resources:
exposedMemPercentage: 90
overProvisioning:
enabled: true
memoryMultiplier: 1
cpuMultiplier: 3
Tip
This feature was developed to allow us to run our development environments cheaper. In such environments there is not much pressure on CPU or memory, but a large number of containers run as developers can spin up containers for features they are working on. There was no point is wasting a full core on containers that get hit twice a minute or less. On production we tend to err on the side of caution and allocate at least one core even to the most trivial applications as of the time of writing this.
Executor Options¶
The following options can be set to influence the behavior for the Drove executors.
Name | Option | Description |
---|---|---|
Hostname | hostname |
Override the hostname that gets exposed to the controller. Make sure this is resolvable. |
Cache Images | cacheImages |
Cache container images. If this is not passed, a container image is removed when a container dies and no other instance is using the image. |
Command Timeout | containerCommandTimeout |
Timeout used by the container engine client when issuing container commands to docker or podman |
Container Socket Path | dockerSocketPath |
The path of socket for docker socket. Comes in handy to configure path for socket when using podman etc. |
Max Open Files | maxOpenFiles |
Override the maximum number of file descriptors a container can open. Default: 470,000 |
Log Buffer Size | logBufferSize |
The size of the buffer the executor uses to read logs from container. Unit DataSize. Range: 1-128MB. Default: 10MB |
Cache File Size | cacheFileSize |
To limit disk usage, configure fixed size log file cache for containers. Unit: DataSize. Range: 10MB-100GB. Default: 20MB. Compression is always enabled. |
Cache File Count | cacheFileSize |
To limit disk usage, configure fixed count of log file cache for containers. Unit: integer . Max: 1024. Default: 3 |
Sample
options:
logBufferSize: 20m
cacheFileSize: 30m
cacheFileCount: 3
cacheImages: true
Relevant directories¶
Location for data and logs are as follows:
/etc/drove/executor/
- Configuration files/var/log/drove/executor/
- Executor Logs/var/log/drove/executor/instance-logs
- Application/Task Instance Logs
We shall be volume mounting the config and log directories with the same name.
Prerequisite Setup
If not done already, please complete the prerequisite setup on all machines earmarked for the cluster.
Setup the config file¶
Create a relevant configuration file in /etc/drove/controller/executor.yml
.
Sample
server:
applicationConnectors:
- type: http
port: 11000
adminConnectors:
- type: http
port: 11001
requestLog:
appenders:
- type: file
timeZone: IST
currentLogFilename: /var/log/drove/executor/drove-executor-access.log
archivedLogFilenamePattern: /var/log/drove/executor/drove-executor-access.log-%d-%i
archivedFileCount: 3
maxFileSize: 100MiB
logging:
level: INFO
loggers:
com.phonepe.drove: INFO
appenders:
- type: file
threshold: ALL
timeZone: IST
currentLogFilename: /var/log/drove/executor/drove-executor.log
archivedLogFilenamePattern: /var/log/drove/executor/drove-executor.log-%d-%i
archivedFileCount: 3
maxFileSize: 100MiB
logFormat: "%(%-5level) [%date] [%logger{0} - %X{appId}] %message%n"
- type: drove
logPath: "/var/log/drove/executor/instance-logs"
archivedLogFileSuffix: "%d-%i"
archivedFileCount: 0
maxFileSize: 1GiB
threshold: INFO
timeZone: IST
logFormat: "%(%-5level) | %-23date | %-30logger{0} | %message%n"
archive: true
zookeeper:
connectionString: "192.168.56.10:2181"
clusterAuth:
secrets:
- nodeType: CONTROLLER
secret: "0v8XvJrDc7r86ZY1QCByPTDPninI4Xii"
- nodeType: EXECUTOR
secret: "pOd9sIEXhv0wrGOVc7ebwNvR7twZqyTN"
resources:
osCores: []
exposedMemPercentage: 90
disableNUMAPinning: true
overProvisioning:
enabled: true
memoryMultiplier: 10
cpuMultiplier: 10
options:
cacheImages: true
logBufferSize: 20m
cacheFileSize: 30m
cacheFileCount: 3
cacheImages: true
Setup required environment variables¶
Environment variables need to run the drove controller are setup in /etc/drove/executor/executor.env
.
CONFIG_FILE_PATH=/etc/drove/executor/executor.yml
JAVA_PROCESS_MIN_HEAP=1g
JAVA_PROCESS_MAX_HEAP=1g
ZK_CONNECTION_STRING="192.168.56.10:2181"
JAVA_OPTS="-Xlog:gc:/var/log/drove/executor/gc.log -Xlog:gc:::filecount=3,filesize=10M -Xlog:gc::time,level,tags -XX:+UseNUMA -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -Dfile.encoding=utf-8 -Djute.maxbuffer=0x9fffff"
Create systemd file¶
Create a systemd
file. Put the following in /etc/systemd/system/drove.executor.service
:
[Unit]
Description=Drove Executor Service
After=docker.service
Requires=docker.service
[Service]
User=drove
TimeoutStartSec=0
Restart=always
ExecStartPre=-/usr/bin/docker pull ghcr.io/phonepe/drove-executor:latest
ExecStart=/usr/bin/docker run \
--env-file /etc/drove/executor/executor.env \
--volume /etc/drove/executor:/etc/drove/executor:ro \
--volume /var/log/drove/executor:/var/log/drove/executor \
--volume /var/run/docker.sock:/var/run/docker.sock \
--publish 11000:11000 \
--publish 11001:11001 \
--hostname %H \
--rm \
--name drove.executor \
ghcr.io/phonepe/drove-executor:latest
[Install]
WantedBy=multi-user.target
systemd-analyze verify drove.executor.service
Set permissions
chmod 664 /etc/systemd/system/drove.executor.service
Start the service on all servers¶
Use the following to start the service:
systemctl daemon-reload
systemctl enable drove.executor
systemctl start drove.executor
You can tail the logs at /var/logs/drove/executor/drove-executor.log
.
The executor should now show up on the Drove Console.