Help! Cloudflare 522 Errors every once in a while for 5 to 15 minutes

michacassola

I have ruled out several things and I am going crazy with this, can't solve it.

I use a larger VPS with LXD/LXC containers on it, packets get routed to the containers from eth0 with universally routable ipv6 addresses. To Cloudflare my containers are connected with ipv6 (AAAA) only and Cloudflare provides ipv4 compatibility. Cloudflare IPs are whitelisted in UFW on host and container.

When the 522 outage happens it is when I am very active on a site or on the webmin panels I have installed in each of my containers. When Cloudflare gets blocked it seems to be only my ip through the rate limiting of nginx (I am not sure why it also happens with webmin), as I can access the site through a vpn connection meaning from another ip or through a GTMetrix retest and I can access webmin directly with the ipv6 address and the port. Also uptime testing doesn't report downtimes. In the webmin panel during an outage I see barely any load on the container or host system.

So far so good, it has to be some kind of filtering happening.
I already tried increasing:

# Limit Request
	limit_req_status 403;
	limit_req_zone $binary_remote_addr zone=one:10m rate=150r/s;
	limit_req_zone $binary_remote_addr zone=two:10m rate=550r/s;

to redicoulus amounts to rule that out in one container for one site. But it happened again working on the site in that container.

Could it be that keepalive with 500 connections is still too low?
Should we use another nginx where we can use "allow cloudflare-ipv6".
What do I do?

Please, anybody who can help will be much appreciated!

Whoever helps me actually figuring this out will get a few cups of coffee worth from me through paypal or alternatively 200g of the best black tea in the world, I'll send it anywhere you want.

Thanks in advance!

JuanMaia

hey man, here in this article there is something about error 522. https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors. There may be something for you here

michacassola

@JuanMaia Went through it twice at least already. I know the problem must be some kind of rate limiting but I can't pin-point it.

And trust me, I know how to google. 😉 Thanks though!

michacassola

I am not sure but I think this is what happens to the CPU etc. when the outage happens:

You see from top to bottom: CPU, Memory, Processes, Disk IO, Network IO

This Cloudflare Article is even better https://community.cloudflare.com/t/community-tip-fixing-error-522-connection-timed-out/42325 @JuanMaia

JuanMaia

Did you look at your firewall logs?

michacassola

JuanMaia Yes, ufw only blocked incoming ipv4 traffic, the containers run ipv6 only, so that can be ruled out. Logging is set to low in ufw.

portofacil

@michacassola
Is Brotli enabled in your Nginx? (Just checking a hunch.)

michacassola

The brotli.conf file has .disabled appended, so I think it is disabled. Do I have to check in other conf files for it?

portofacil

No, it's enough. I thought of it because if I keep Brotli enabled in my servers, the CPU usage slowly escalates to a unmanageable level.

By default, WO whitelists all Cloudflare IPs, so there should not happen any rate limit to them.

What's the hardware of your server?

michacassola

It is enabled on Cloudflare. I will disable HTTP/3 for now on Cloudflare and see what that does.

The hardware is good, Intel Xeon, KVM virtualized, lots of RAM. I really do not think it is the hardware. Or are you getting at something else?

I do think it might be something escalating the CPU.

portofacil

michacassola Cloudflare should have Brotli and HTTP/3 enabled. Their action has nothing to do with Nginx.

I suspected of low hardware because your CPU graph shows an increasing in usage, consistent over the time. It's weird that the server acts like this. I would expect the CPU line to be completely irregular, full of spikes and depressions, never an ascending line until the crash.

michacassola

That image is from inside an container being limited to 2 of the 8 CPUs. I had another outage of the container and this time the CPU load was up again, but not so linear, when logging in using a VPN I saw that. But that might also be due to loading webmin stats (might that be a cause, webmin...?) and the container checking for updates and what not...

Also what is weird, that the outage hits all containers, when I can reach one I can reach all. When I cant reach one I cant reach them all.

portofacil

michacassola That's how containers work, they don't have true isolation.

michacassola

It means that I do not understand the root of the issue. The containers are not overloaded and not really down as I can access them through another network or directly through their ipv6 address. Usually the load is down long before they come up for my original network again, meaning that my ip of my original network gets limited in some way on the root server and I get my very own 522 time out error from cloudflare.

Please help me to figure out where this limit can come from, is there a kernel module or iptables/ufw thing I am not aware of? I read there is no standard rate limiting being applied by iptables/ufw and I am sure I did not set it up...
Is it another bug with netplan that affects this in some way... On-Link is broken for ipv6 in netplan and I did not specify it in my config... Or maybe networkd of systemd has some issues...

So many possibilities.

tersor

Can you reproduce the issue even if UFW is disabled?

michacassola

tersor
I had the firewall completely disabled and the issue occured once more again. So it is not UFW/iptables.
Also I think I got hacked/attacked as I got some wierd SWAP things going on utilizing 100% of all of the CPU cores...

I also noticed that redis hugepage is not set to never as per the kernel tweak script. Is the kernel tweak script not automatically applied when isntalling or updating?
Also, is the ufw script automatically applied on install?

michacassola

That is part of the problem, I seem to not be able to reproduce the issue if I want to. I hammer the sites and systems in the containers with updates at the same time, nothing happens. But then just when I wrote here the other day and loaded webmin in 1 or 2 containers my good old friend 522 came back.

I since suspect that the new webmin stats history are too much for the systems and temporarily overload. But then why can I usually access the containers directly long before they come back through cloudflare? The mistery continues.

I will take your comment though as a tip to test. Will shut off ufw everywhere and hope I won't get attacked.
I do not get any block messages besides some ipv4 addresses in the ufw log though...
Thanks @tersor !

renatofrota

Are the errors 522 or 502?

I faced a problem recently with 502 errors through CloudFlare. The PHP-FPM socks were marked offline by /etc/nginx/conf.d/upstreamd.conf due to PHP returning invalid error codes (PHP scripts with malformed header() functions).

I have added "max_fails=0" in front of the FPM socks and restarted nginx, it resolved the issue immediately (Nginx ignored errors coming from the PHP scripts and never marked them as offline). After that, more calm as my server was back online 😆 , I monitored PHP error logs and finally found the problematic script.

michacassola

Thanks renatofrota , the errors are always 522. But I will check the PHP logs for errors. Thanks.

michacassola

The PHP log shows a known error of a wordpress builder but nothing else. This error is shown all the time so not specific to the outage. So there is no bad plugin there.

portofacil

Error 522 is on the network layer.

Have you tried a VPS on another provider, just for testing purposes?