Troubleshooting Business Architecture: Resolving 502 Errors in Docker-Based Deployments

Background: Business Architecture

Business Architecture:

Business Architecture >

Deployment Details:   Both containers are deployed on the same machine, orchestrated by docker-compose, and linked via link.

Business Architecture Issue Description

During a code update, the frontend could open, but all interface requests returned 502 (Bad Gateway)

Business Architecture >

Issue Troubleshooting

Checking the logs of the frontend container compose_ui_1, there was a flood of 502 (Bad Gateway)

If there’s no issue with the UI, the first reaction is that compose_api_1 has failed, so I directly went to the container to check the logs.

The container logs appear normal, no crashes, and the logs seem as if they’ve never received a request, but I’m sure my frontend accessed it, which feels odd. I tried accessing the interfaces separately to see if there’s any change:

Accessing the interface separately still results in a brutal 502 (Bad Gateway), which seems unreliable. Could there be a port or host access error? I used Wireshark on the host to confirm the request’s host and port:

With this, it’s certain the host and port accessed by the frontend compose_ui_1 are correct, and the precise result is a 502 (Bad Gateway). This means the issue must be investigated from compose_api_1.

A similar issue was encountered before, because compose_api_1 is deployed through uwsgi with python flask. There were some usage issues in the past, but after modifying uwsgi configurations, it quieted down for a while. Now it’s back.

First, let’s determine if compose_api_1 truly failed… even though hope is low…

Accessing the backend API interface directly.

Uh…awkward…seems like I wrongfully accused someone. This shouldn’t be right, let’s sniff with packet capture again to confirm:

It really seems to be…let’s take another look at the container logs:

Uh…okay…I’ve made a mistake, compose_api_1 hasn’t failed.

So the problem arises…Backend interfaces work fine, but there’s an error accessing from the frontend, what’s going on?

I have a hunch it’s an issue due to the characteristics of the container. Hopefully not…

Let’s go into the compose_ui_1 container to capture packets and analyze if there’s any problem in the entire request chain:

There seems to be some trickery, Flags[R.] represents the TCP connection being reset, but why is it reset for no reason?

Seeing what’s returned from 172.17.0.5.8080, let’s telnet and ask first:

What??? This is puzzling, first, where did this 172.17.0.5.8080 come from? Secondly, why is the port inaccessible?

Suddenly remembered an important issue:

JavaScript codeCopy

How do containers know where to send their requests to?

As discussed earlier, these two containers are linked via link, like this:

Searched Google about how the link mechanism works:

JavaScript codeCopy

The link mechanism provides such information through environment variables, in addition to which database passwords and such information will also be provided through environment variables. Docker imports all environment variables defined in the source container into the received container, where you can use environment variables to get connection information.After using the link mechanism, you can communicate with the target container through a specified name, which is actually achieved by adding a name and IP resolution relationship to /etc/hosts.

So, in compose_ui_1, it translates the specific IP from the specified name and allows communication via /etc/hosts.What is the container’s name?

compose_ui_1’s /etc/hosts

JavaScript codeCopy

root@e23430ed1ed7:/# cat /etc/hosts127.0.0.1    localhost::1    localhost ip6-localhost ip6-loopbackfe00::0    ip6-localnetff00::0    ip6-mcastprefixff02::1    ip6-allnodesff02::2    ip6-allrouters172.17.0.4    detectapi fc1537d83fdf compose_api_1172.17.0.3    authapi ff83f8e3adf2 compose_authapi_1172.17.0.3    authapi_1 ff83f8e3adf2 compose_authapi_1172.17.0.3    compose_authapi_1 ff83f8e3adf2172.17.0.4    api_1 fc1537d83fdf compose_api_1172.17.0.4    compose_api_1 fc1537d83fdf172.17.0.6    e23430ed1ed7

If the information is correct, 172.17.0.4:8080 should be the correct address mapping for compose_api_1, right? Let’s test it first.

Even though it returned auth product is None, this is actually a valid request.

Next, check the logs of the compose_api_1 container:

So, no further running around is needed. The reason the frontend access directly gives a 502 is that the UI container is sending requests to the wrong address

But why is it like this? Did it go berserk for no reason?

Earlier, based on the host’s record experiment, sending interface requests according to its mapped address faced no issue:

Checking the nginx logs of compose_ui_1

Awkward… The nginx logs directly connect to standard output and standard error… To simplify, I used docker logs to check.

Looks like nginx‘s forwarding is wrong. Why does it forward to 172.17.0.5? Let’s look at nginx‘s forwarding configuration:

This detectapi can find the correct address 172.17.0.4 in the hosts table above, right? I can’t figure out why it would forward to 172.17.0.5

Could it be a domain name resolution error in the system?

This is truly bizarre.

A man’s intuition tells me there’s something suspicious with nginx!

Restarting the container’s nginx, yet the container also gets restarted…

Reaccessing the page, and it works now…

Rechecking the container’s nginx logs confirms successful forwarding

With this, it can be pinpointed; the issue stemmed from nginx?

Fault Localization

But why would nginx have such an error? It shouldn’t be. It feels like it’s an internal nginx domain name resolution cache issue.

Upon checking references, yep, there really is one. https://www.zhihu.com/questio…

This is quite embarrassing. Skeptical about this issue, I consulted a senior expert, and his reply was:

JavaScript codeCopy

If proxy_pass is followed by a domain name, it is initialized when nginx starts and will reuse this value; refer to: ngx_http_upstream_init_round_robin function.If proxy_pass is followed by upstream, then it takes the parsing and caching logic.

Improvement Measures

  1. Instead of proxy_pass directly to real domain names, forward to upstream configuration;
  2. Also refer to the treatment plan in the earlier Zhihu link: https://www.zhihu.com/questio…;

Additional Problems

  1. Why did the specified compose_api_1 in compose_ui_1 produce an error?
  2. If proxy_pass is followed by a real domain name, is it directly reused, or is there a time cache?

I originally wanted to use gdb to debug this issue, but after spending a day, nothing came of it. However, there was a small takeaway, which is how to configure nginx to support gdb:

1. Edit the compile configuration file: auto/cc/conf

JavaScript codeCopy

ngx_compile_opt="-c"  change to ngx_compile_opt="-c -g"

2. During ./configure, add the compile parameter: --with-cc-opt='-O0', to avoid compiler optimization;Example: ./configure --prefix=/usr/local/nginx --with-cc-opt='-O0' ....Without this, the compiler optimizes the code, making it impossible to print some loop variable values during debugging, resulting in errors like below:

JavaScript codeCopy

value optimized out

Here you can see the debugging effect:Entry point for processing in the nginx worker process: ngx_http_static_handler