possible workarounds of fortigate VPN client changing DNS server

Since VPN technology is used to connect sites and users, usually most it’s implementations provide an ability to change DNS servers to available on remote side. But not all of them are able to do this neatly and without breaking existing configuration. And if routing configuration works fine in most use cases, dealing with resolv.conf can dramatically slow down operating system performance or even break it, ‘cos uncountable regular operations rely on quick and correct domain resolving.
So, some of VPN clients, in my case it was FortiGate VPN client, deal with one of the most important system configuration files in most barbaric way: it changes file contents correspondingly FortiGate server instructions, flushing existing changes and (this really drives me crazy) don’t restore them after connection termination! And there are no any way (at least described in manuals) to change or disable this behavior. In my case it broke postal service (postfix) since it has no way to specify DNS server to use.
So we need to fix this this magical piece of software. Logic suggests three ways:

1. Somehow isolate the intruder from system it tries to break. This involves containerisation or running on separated machine. Not always possible way since not all infrastructure can use containerisation or change their routing policy just owing to idiotic decisions of software developers;
2. Setup a quasi watchdog which should watch for file changes and roll them back, forcing correct confuguration. Pros: better for complex installations. Cons: part of queries will use the wrong DNS server and, least for mail, can result it information loss or large delivery delay, which is unacceptable;
3. Forbid interfering with resolv.conf file for all users and processes, handling it manually.

I chose third way, since my DNS config is pretty static and used chattr +i to make resolv.conf readonly to ALL users and processes.

ansible template parsing workaround

Whilst trying to template a python script with ansible 2.4 I’ve got a weird error:

fatal: [HOSTNAME]: FAILED! => {
"changed": false,
"failed": true,
"msg": "AnsibleError: template error while templating string: Missing end of comment tag. String: #!/usr/bin/python\nimport os\nimport json\n\nif __name__ == \"__main__\":\n # Iterate over all block devices, but ignore them if they are in the skippable set\n skippable = (\"sr\", \"loop\", \"ram\")\n devices = (device for device in os.listdir(\"/sys/class/block\")\n if not any(ignore in device for ignore in skippable))\n data = [{\"{#DEVICENAME}\": device} for device in devices]\n print(json.dumps({\"data\": data}, indent=4))\n"

Turns out it tries to somehow parse it. We can workaround it by wrapping the contents of file with either %raw% escape string (easier) or unsafe one.

crontab truncates file path to 100 symbols

I use crontab’s ability to load it’s configuration from file in my deployment scripts. But today it refused to load the configuration due to inability to find the file, despite all permissions were correct for this file. Using stdin (cat $ABSOLUTE_PATH | crontab -) worked correctly, but I was curious why this happens in the first place. Here’s the example of the error (I obviously changed the paths for security reasons):

$ crontab /opt/my_long_path/file.crontab
/opt/my_long_path/file.cron: No such file or directory

At first I was confused with name change and thought that it’s related to filename parsing, like sudo ignores all files with . (dot) in their names in /etc/sudoers.d/. But afterwards I found out that file imports correctly if I invoked crontab from the directory containing the file (using the short relative path). I tried to rename the file short and it worked correctly. So it seemed to be the problem with the long path, so I counted the characters in it:

$ echo "/opt/my_long_path/file.crontab" | wc -m

…and it was exactly three symbols more then 100, and these three symbols we’ve missed (..tab). Seems like crontab doesn’t accept path more then 100 symbols and truncates one if it’s longer then this. I didn’t find an adequate explanation for this behavior, neither operating, nor file system provides such weird restrictions. So we need to whatsoever change file name or use stdin as it’s shown in the beginning of the note.

nginx IP rules beyond DDoS guard

I use ngx_http_access_module‘s allow/deny directives to protect sensitive parts of my websites from public access. But if a website is protected by CloudFlare, or same-scheme DDoS protection/CDN provider, your nginx will get only CloudFlare’s IPs, so your blocking (or any another IP-based, e.g. GeoIP) rules will not work. This HOWTO is written for CloudFlare.

P.S. Please check your access.log to ensure if you are really getting the wrong IPs before doing something.

So, we can use built-in nginx module ngx_http_realip_module to handle this. It provides the ability to get the real client’s IP address based on CloudFlare’s headers.

We’ll use the set_real_ip_from directive. You can get CloudFlare’s subnet list from this official page, which is regularly updated. We need to buld the list which looks like this, for example:

Please DO NOT copy this and use actual IP list provided by CloudFlare, this may be incorrect


..And add the header real_ip_header X-Forwarded-For;. This will throw the real IP to backend.
Now, we’ll add this all to our virtualhost config on server config level. You also can use it in location level, but be sure that you put it in all the locations, including defined by regular expressions. Example:

server {
<...> there goes your virtualhost config <...>


        real_ip_header X-Forwarded-For;

        location / {
<...> and here the following config below <...>

Now, check your config with nginx -t and restart the web server to apply the changes.
Now nginx received the valid client IPs from CloudFlare and your rules will work correctly.

unable to find LVM volume pve/root

I found out that my recently installed Proxmox hypervisor sporadically can’t boot due to inability to find it’s LVM logical volumes:

This situation means that our initramfs is successfully loaded, unpacked and finished it’s work, and now, when the OS should run the real filesystem, disks are not present.
First of all, you need to activate VG(s) with vgchange -aa, as shown on screenshot. If you can do this successfully, it means that our LVM configuration is ok, and you can process to server boot with Ctrl+D.

When your server finishes booting, check dmesg for any hardware errors. If it’s OK, so we can come to conclusion that that the root cause of the problem is slow RAID controller (I use 3ware), which can’t present disks to system in specified time. So the solution is changing rootdelay directive in grub.cfg by adding the corresponding value to linux line. 10 seconds must be enough:

linux /boot/vmlinuz-4.2.6-1-pve root=/dev/mapper/pve-root ro rootdelay=10 quiet

Be careful with manually changing this file and check twice which menuentry you are editing.
Now try to reboot your server, it must work as planned.

jenkins can’t connect to slaves after update to 2.55 or higher

Jenkins is a continuous integration tool, which is written in Java and provides very useful toolchain for DevOps software cycle.
After update to version 2.55 my master server was unable to connect to it’s own slaves. I began receiving messages like this after an update whilst trying to connect to slave:

ERROR: Connection terminated
java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2298)
	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2767)
	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:798)
	at java.io.ObjectInputStream.(ObjectInputStream.java:298)
	at hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:40)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
ERROR: Socket connection to SSH server was lost

This stacktrace refers to remoting issue, which is wrong, because according to changelog, from this version it requires java 8. Although it isn’t said in clear, this requirement is mandatory only for slave jar files, server installation works well on Java 7 and even Java 6 (I checked both). And if we try to connect to our obsolete slave with the new slave.jar, connection will fail with this stack trace even if the required Java is installed on slave machine.

Therefore we have two ways now: we can update and/or make the Java 8 default on slave machines or downgrade Jenkins or somehow use the obsolete slave.jar. Second option is way more complicated and should be used only in case you cannot use modern java on your host.
I’ll skip java installation as perfectly described process in many HOWTO’s.
So, we have two installed JDKs:

# update-java-alternatives --list
java-1.7.0-openjdk-amd64 1071 /usr/lib/jvm/java-1.7.0-openjdk-amd64
java-1.8.0-openjdk-amd64 1069 /usr/lib/jvm/java-1.8.0-openjdk-amd64

Now, we need to change default option to desired one:

# update-java-alternatives --set java-1.8.0-openjdk-amd64                                                                                                                                                                         
update-alternatives: error: no alternatives for mozilla-javaplugin.so                                                                                                                                                                        
update-java-alternatives: plugin alternative does not exist: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/IcedTeaPlugin.so

Since we rarely use Firefox on deployment servers, this message can be dismissed. Now our Jenkins slave must operate as usual.

atop log does not rotate in debian stretch

atop is a sar-like tool which saves system diagnostic data and allows to view it if a fancy readable way. By default it’s configured to rotate logs every midnignt, but recently I found out that they aren’t rotating for a month, which made impossible finding any useful info:

# ls -ltrh /var/log/atop/
total 7.8G
-rw-r--r-- 1 root root 3.4M Jul 6 00:15 atop_20160705
-rw-r--r-- 1 root root 290M Jul 21 20:20 atop_20160706
-rw-r--r-- 1 root root 864M Aug 22 17:53 atop_20160721
-rw-r--r-- 1 root root 669M Aug 30 20:26 atop_20160822
-rw-r--r-- 1 root root 87M Aug 31 16:27 atop_20160830
-rw-r--r-- 1 root root 2.7G Sep 30 01:30 atop_20160831
-rw-r--r-- 1 root root 0 Sep 30 01:32 daily.log
-rw-r--r-- 1 root root 3.3G Oct 31 16:56 atop_20160930

atop doesn’t use logrotate to rotate it’s files, there is a file in /etc/cron.d, which invokes init script with _cron key every day at 00:00:

# cat /etc/cron.d/atop

# start atop daily at midnight
0 0 * * * root invoke-rc.d atop _cron

I tried to run it manually, but result was the same, no logs were rotated. Logic suggests that if it keeps writing the same file, it may be not successfully restarted. atop’s init script contains the _cron case, which uses the do_stop function:

      [ "$1" = "_cron" ] && VERBOSE="no"
      [ "$VERBOSE" != no ] && log_daemon_msg "Restarting $DESC" "$NAME"

Let’s check with strace (I won’t post it’s whole output):

# strace invoke-rc.d atop stop
execve("/usr/sbin/invoke-rc.d", ["invoke-rc.d", "atop", "stop"], [/* 20 vars */]) = 0
+++ exited with 0 +++
# /etc/init.d/atop status
● atop.service - LSB: Monitor for system resources and process activity
Loaded: loaded (/etc/init.d/atop)
Active: active (running) since Thu 2016-11-10 13:33:39 MSK; 1 day 5h ago
CGroup: /system.slice/atop.service
└─24693 /usr/bin/atop -a -w /var/log/atop/atop_20161110 20

So, although the return code is 0, process is still running and writing the old log file. invoke-rc.d has the –force key, let’s try it:

# invoke-rc.d --force atop stop
# systemctl status atop
● atop.service - LSB: Monitor for system resources and process activity
   Loaded: loaded (/etc/init.d/atop)
   Active: inactive (dead) since Fri 2016-11-11 19:38:57 MSK; 1s ago
  Process: 24994 ExecStop=/etc/init.d/atop stop (code=exited, status=0/SUCCESS)
  Process: 23446 ExecStart=/etc/init.d/atop start (code=exited, status=0/SUCCESS)

Excellent, it works. Now we need to update the /etc/cron.d/atop file to add this key to command running every midnight. File must look like this:

# cat /etc/cron.d/atop

# start atop daily at midnight
0 0 * * * root invoke-rc.d --force atop _cron

Now logs will rotate as usual.

hairpin nat alternative

Hairpin NAT (aka loopback NAT) is a technology used to resolve the situation when the resourse, usually web server, is located in internal network, but has an external IP address. It can be accessible from outside interface (in Cisco terminology), but since it has the same gateway as your machine (of course, I mean the backbone router, not the subnet’s), internal machines cannot access them.
The problem has at least three solutions:
1. Whatsoever use the non-rfc1918 IP address, using identity NAT, or something. But this way implies serious changes in routing and security policies;
2. Configure hairpin NAT, which has the same weaknesses and not an option if you don’t wanna change your routing policies;
3. And, this article’s object, you can configure the internal DNS server to resolve the usual FQDN to internal IP. I’ll use BIND as an example.

You must setup a new zone, which describes the domain you have (e.g. .com, or, if you’re configuring the third-level domain, use yoursite.com to simplify configuration), configure ur site in there, and add this line in the end:

* IN NS $upstream_dns_ip

And restart bind9.
From now, all users who’ll connect to your webserver from inside, will have it’s internal IP, and users from outside will use the external.

atlassian crowd hangs on login after server ip change

Unlike other Atlassian products, Crowd has it’s server IP hardcoded in settings. So, if you have a timeout error on login (in my case it was 504) after server IP changed, you need to change it in crowd.properties file or /etc/hosts, depending on your configuration.

Here are messages from crowd log file which describe the problem:

[timestamp]http-8095-2 ERROR [crowd.integration.springsecurity.CrowdSSOAuthenticationProcessingFilter] Unable to unset Crowd SSO token
org.codehaus.xfire.XFireRuntimeException: Could not invoke service.. Nested exception is org.codehaus.xfire.fault.XFireFault: Couldn't send message.
org.codehaus.xfire.fault.XFireFault: Couldn't send message.
Caused by: java.net.ConnectException: Connection timed out: connect
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
	at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
	at java.net.Socket.connect(Socket.java:529)
	at java.net.Socket.connect(Socket.java:478)
	at java.net.Socket.(Socket.java:375)
	at java.net.Socket.(Socket.java:249)