Jenkins: The Strange Case of Killed jobs

May 4th, 2019 Posted by Cloudbees, DevOps No Comment yet

I have Jenkins configured with a dozen jobs that support a microservices application. The setup worked fine for several months until, suddenly, jobs started to fail with errors similar to this (console log fragment):

 

 

It caused all jobs to fail most of the time, including Bash scripts as well as Maven builds. The main Jenkins log had no additional information. A web search turned up many discussions about jobs failing (not Jenkins itself), all pointing to a memory shortage like heap space or virtual memory. But the same processes worked flawlessly when executed manually, i.e. not by Jenkins. So the problem was unrelated to memory; something else was going on.

My Jenkins service was being subjected to a denial of service attack. This article is about identifying the cause and taking preventative action.

 

A note on SIGKILL

Exit code 137 corresponds to SIGKILL i.e. unblockable signal to terminate. It’s equivalent to this Linux command:

 

 

<PID> is a placeholder for a process ID. SIGKILL is an alias for 9, the argument to kill above. This value is counted in exit code 137 = 9 + 128 (see UNIX exit codes).

 

Jenkins setup

My Jenkins setup is relatively simple: a single server to manage and execute a dozen or so Multibranch Pipeline jobs. It runs inside a Docker container, so the Jenkins environment is portable across different machines; in this case it’s a cloud-hosted virtual machine with 8 GB of RAM. For that environment I have a custom Docker image that includes the tools below:

  • Maven CLI, for jobs that build Java projects.
  • Docker CLI, for jobs that build container images (some other tools are available: Kaniko and Jib from Google, Buildah from Red Hat).
  • Docker Compose CLI, for automated testing of an app composed of loosely-coupled services.
  • Kubernetes CLI, for jobs that manage a Kubernetes cluster.

 

Profiling Jenkins with VisualVM

Apart from failing builds, the observable concern was that Jenkins had started to use almost all available CPU resources, with every core consistently around 100% usage. That was easy to see with htop.

My first idea was to profile CPU usage with VisualVM. For that to work I enabled RMI in JAVA_OPTS and exposed an additional container port (10099) for debugging:

 

 

VisualVM has some CPU profiling but CPU usage graph was in 0-5% range, even though htop showed 100%.

 

Understanding the problem – incoming HTTP requests

As a starting point for troubleshooting, the Monitoring plugin for Jenkins is very simple compared to VisualVM. The plugin generates a report of aggregated resource usage including CPU, memory, threads, HTTP requests, errors, etc.

I installed the plugin and opened it in Jenkins to generate a report (URL: [Jenkins home]/monitoring). The screenshot below is consistent with htop, showing 100% CPU usage. Also the system load: 2.83 indicates that the virtual machine’s 2 CPU cores are overloaded by 83%.

 

JavaMelody - system info

 

After 20 minutes I refreshed the page to generate a second report. Comparing these reports, it was obvious that HTTP requests were increasing over time, even though there was no reason for such requests. The screenshot below shows total requests (circled in red) since the Jenkins process was started:

 

JavaMelody - HTTP statistics

 

Apart from jobs that are triggered by webhooks, it’s only me using the Jenkins web interface at this stage. Also, while troubleshooting this problem there were no code changes to invoke webhooks.

This suggested that someone else is making these requests, so I killed Jenkins and started an echo server on the same port. It started logging incoming requests right away (long lines folded for readability):

 

 

So, what’s happening here is that some kind of malicious software tries to break into Jenkins by trying various simple login/password combinations, like manager/admin.

I validated that by restarting the container with non-standard port 8081 exposed instead of 8080. Then Jenkins was fully operational with a reasonable CPU usage: 0-2% while idle.

What is surprising is that Jenkins needs 100% of CPU to deny the login attempts.

The chain of events is:

  1. Malicious login attempts
  2. 100% CPU usage
  3. Jenkins kills job by sending signal SIGKILL

 

Whitelisting IP addresses

Some cloud providers, AWS for example, provide tools that control access to virtual machines with rules. Many cloud providers don’t. I wanted to make a generic solution that works anywhere a Docker container can run.

My idea is to make a container image that allows connections only from trusted IP addresses. A list of these trusted addresses, or whitelist, is a file that should be external to the image, in order to change allowed connections without having to create a new image. The most straightforward way to achieve that is to run the iptables command in startup script of Jenkins container. So in Dockerfile:

  • Install iptables
  • Copy Bash script: Docker custom entrypoint
  • Copy Bash script: run iptables commands

The custom entrypoint script sets firewall rules first with iptables, then starts Jenkins.

The iptables script reads the whitelist file, then creates iptables rules for these addresses.

 

Issue #1 – iptables must run as root

Jenkins container runs with jenkins user. In such context, iptables can’t be used. We can’t use su to run commands as root, because there is no root password in the first place.

The only option I see is to change user to root, but this means that Jenkins will start as root which is bad and should be avoided. So I changed Jenkins startup command to execute as jenkins user.

 

Issue #2 – container can’t run iptables without additional privileges

Docker container run command provides ‐‐privileged switch, which widens container permissions a lot. But it’s an overkill for this case.

But there is another switch ‐‐cap-add=NET_ADMIN that is exactly right.

 

Solution

The code listings below show my updated Dockerfile and script files.

 

Dockerfile

 

Docker custom entrypoint script: custom-entrypoint.sh


 

Bash script for creating iptables rules: create-iptables-rules.sh

 

How to run

 

Note the environment variables: CREATE_IPTABLES_RULES, IP_WHITELIST_FILE, and a volume mapping to mount whitelist.txt inside the container.

 

Final notes

The original observation of failing jobs is quite far away from the underlying reason. That peculiarity is not uncommon in systems with multiple moving parts.

Initial research strongly suggested that insufficient memory caused the problem. But a quick test eliminated that suspect, distinguishing the Jenkins jobs (failing) from the same jobs executed without Jenkins (working).

The Jenkins main log had no information about the failing jobs, and there is no built-in monitoring. The Monitoring plugin helped to understand the problem, by profiling the behavior with snapshots of the process metrics. Internally the plugin uses JavaMelody which is a tool for monitoring Java applications in general.

This was a good lesson of not forgetting about malicious attempts to break into all kinds of public servers.

Tags: , , ,

No comments yet. You should be kind and add one!

Leave a Reply