Troubleshooting in Different Environments
Development Environment
Any familiar tool can be used for troubleshooting freely. As long as the issue can be reproduced, it wonât be too difficult to troubleshoot. Most of the time, it involves debugging the program down to various framework sources, which is why interviews often ask about source code. Itâs not expected for you to have read all, but you should have a strategy to know how to look into it to solve problems.
Test Environment
Compared to the development environment, debugging is less accessible. However, you can use tools like jvisualvm or Arthas, attaching them to remote JVM processes for troubleshooting. Note that the test environment allows the creation of data to simulate the scenarios we need. Therefore, if issues occur, you can contact testers to create some data to reproduce the bug.
Production Environment
The difficulty of troubleshooting is greatest:
- Permissions control is strict in production, usually not allowing debugging tools to attach to processes remotely
- Problems in production require a priority on recovery, making it difficult to allocate sufficient time for troubleshooting. However, since production environment traffic is real, with high access and strict network permissions, it is more likely to encounter issues, making it the environment with the most problems.
Monitoring
When issues occur in the production environment, because applications need to be restored quickly, keeping a complete live environment for troubleshooting and testing becomes impossible. Therefore, whether there is sufficient information (logs, monitoring, and snapshots) to comprehend history and recreate the bug scenario is crucial. Logs from ELK are most commonly used, note:
- Ensure that error and exception information can be completely recorded in file logs
- In production, ensure the programâs log level is above INFO. Logs should be recorded using a reasonable priority: DEBUG for development debugging, INFO for significant process information, WARN for issues that need attention, and ERROR for errors that disrupt processes.
Regarding monitoring, when troubleshooting in production, the development and operations team must first develop a comprehensive monitoring plan:
- At the host level, monitor resources like CPU, memory, disk, and network. If applications are deployed on a virtual machine or Kubernetes cluster, apart from monitoring the physical machineâs basic resources, similar monitoring should be applied to the virtual machine or Pod. The number of monitoring layers depends on the applicationâs deployment scheme; there should be a layer of monitoring for each OS layer. For the network layer, monitor dedicated line bandwidth, basic conditions of switches, and network latency. All middleware and storage should be adequately monitored, not just monitoring processes for basic metrics like CPU, memory, disk IO, and network usage but more importantly, monitoring internal critical metrics of components. For example, the renowned monitoring tool Prometheus offers numerous exporters to interface with various middleware and storage systems. At the application layer, common metrics to monitor include class loading, memory, GC, threads of JVM processes (e.g., using Micrometer for application monitoring). Additionally, make sure you can collect and save application logs, GC logs. Letâs look at snapshots. Here, âsnapshotsâ refer to the applicationâs state at a specific moment. Typically, we set up production Java applications with the JVM parameters -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=⊠to retain heap snapshots in case of OOM. In this course, weâve frequently used the MAT tool for heap snapshot analysis.
After understanding the past and recreating the scene, letâs explore how to diagnose the problem.
Methods for Diagnosing Problems Identifying an issue begins with determining which level the problem originates from. For example, is it a problem with the Java application itself or an issue caused by external factors? We can start by checking if the program has exceptions, as exception information is typically specific and can quickly point to the general problem direction; if itâs a resource-consuming issue, it may not manifest exceptions, so we can use metrics monitors paired with evident problem points to locate the issue.
Generally, program issues arise from three aspects.
First, bugs after program release can be quickly resolved by rolling back, allowing for a slower analysis of version differences post-rollback.
Second, external factors such as host, middleware, or database issues can be systematically troubleshot. Host-level issues can often be diagnosed using tools:
For CPU-related issues, use tools like top, vmstat, pidstat, ps; For memory-related issues, use tools such as free, top, ps, vmstat, cachestat, sar; For IO-related issues, employ tools like lsof, iostat, pidstat, sar, iotop, df, du; For network-related issues, utilize tools like ifconfig, ip, nslookup, dig, ping, tcpdump, iptables. Component issues can be probed from several angles:
Check whether the host housing the component has issues; Examine the basic status of component processes, observing various monitoring metrics; Review component log outputs, particularly error logs; Enter the component console, using commands to check operational status. Third, system resource constraints leading to perceived system hangs usually require solving initially through restarting and scaling, after which analysis is performed, though ideally, one node is left as a reference state. System resource limitations are typically evident in high CPU usage, memory leakage or OOM issues, IO issues, and network-related issues.
For high CPU usage issues, if the state is still active, the analysis procedure is:
First, run the top -Hp pid command on a Linux server to check which thread within the process has high CPU usage; Then, input an uppercase P to sort threads by CPU usage rate, converting the thread ID occupying CPU noticeably to hexadecimal; Finally, search this thread ID in the thread stack output of the jstack command to identify the problematic threadâs call stack at that time. If running the top command directly on the server isnât feasible, we can identify issues through sampling: run the jstack command at fixed second intervals (e.g., every 10 seconds), compare samples after a few iterations to determine which threads consistently remain active, analyzing the offending thread.
If the state is no longer present, we can analyze through elimination. High CPU usage is generally caused by factors such as:
Burst pressure. These issues can be confirmed through previously employed load balancing traffic or log volume of the application, as reverse proxies like Nginx will log URLs, allowing finer identification using the proxyâs Access Log, or through monitoring JVM thread count scenarios. If thereâs no evident improper resource usage but high CPU due to pressure issues, this can further be diagnosed through stress testing and a profiler (jvisualvm offers this feature) to locate hotspot methods; if resources are improperly used, for example, resulting in thousands of threads, configuration tweaking should be considered. GC. This scenario can be confirmed by monitoring GC-related metrics and checking the GC Log of the JVM. If confirmed to be GC pressure, memory usage will likely also be improper, necessitating further analysis following the memory issue analysis process. Deadloop logic or improper process flows in the program can be diagnosed using application logs. Typically, logs are generated during application execution; focus should be given to abnormal log volume sections. For memory leakage or OOM issues, the simplest analysis involves heap dumping followed by MAT analysis. Heap dumping contains a complete heap and thread stack snapshot, typically examined through a dominator tree diagram or histogram, quickly highlighting memory-occupying objects, allowing rapid identification of memory issues. This aspect will be elaborated in the fifth installment.
Note that Java process memory usage is not limited to the heap but also includes memory used by threads (number of threads * each threadâs stack) and the metadata area. Each memory section can potentially generate an OOM condition, analyzed by monitoring thread count, loaded class count, and other metrics. Furthermore, check the JVM parameter settings for potential unreasonable constraints on resource usage.
IO-related issues, unless arising from code issues like resource non-release, arenât typically prompted by Java process internal factors.
Network-related issues typically stem from external factors. Connectivity problems are relatively easy to pinpoint when combined with exception information; performance or transient issues can initially be evaluated using tools like ping, or, if necessary, tcpdump or Wireshark for deeper analysis.
Points to Consider When Analyzing and Identifying Issues Sometimes, we may hit a wall or lack direction while analyzing and locating issues. In such situations, consider my nine insights:
First, consider the âchicken and eggâ scenario. For instance, if you notice slow business logic execution with an increase in thread count, ponder two possibilities:
One, thereâs a program logic issue, or external dependencies slow down the business logic execution; unchanged access rates now require more threads. For example, a previously 10TPS concurrent session executing requests in 1 second supported by 10 threads now requires 100 threads for 10s execution. Two, thereâs a chance of increased request volume causing more threads, resulting in inadequate CPU resources plus context switching problems slow processing. When issues arise, integrate internal performance with incoming traffic to ascertain whether this âslownessâ is the root cause or a symptom.
Second, try categorizing to find patterns. Without leads for pinpointing issues, try to summarize patterns.
For instance, with 10 application servers distributed load balancing, log analysis may reveal whether issues are evenly distributed or concentrated on one machine. Alternatively, application logs often record thread names; logs may indicate concentration in specific thread classes during issues. Additionally, if a vast number of TCP connections is seen, netstat can be used to identify connections primarily targeting which service.
Recognizing patterns might highlight breakthrough points.
Third, analyze problems based on call topology, avoiding assumptions. For example, observing an Nginx 502 error might automatically be considered a downstream service issue for lack of gateway request forwarding. However, in terms of downstream services, jumping to the conclusion that itâs our Java program might be premature; topology might indicate Nginx proxies Kubernetesâs Traefik Ingress, with the chain being Nginx->Traefik->application. If Java program health is heavily probed, root causes might never be found.
Similarly, while using Spring Cloud Feign for service calls, connection timeouts arenât always server issues; the scenario might be the client invoking server-side services via URL, bypassing Eurekaâs client-side load balancing for direct connection. The client connection service timeout might actually result from Nginx proxy downtime.
Fourth, identify resource-limiting issues. Observe curves of various metrics; a slowly ascending curve stabilizing at a level often indicates resource reaching limitation or bottleneck.
For example, observing the network bandwidth curve rising to around 120MB without movement could indicate a saturated 1GB NIC or transmission bandwidth. Another example is active database connections stabilizing at 10, indicating maxed-out connection pools. If such curves appear in monitoring, they warrant attention.
Fifth, consider resource interactions. CPU, memory, IO, and network mimic the symbiotic nature of human organs, with one resource bottleneck potentially prompting a chain reaction in others.
For instance, memory leaks causing unreclaimed objects lead to frequent full GCs, resulting in substantial CPU consumption on GC, thus increasing CPU utilization. Another common setup involves data caching in memory queues for asynchronous IO processing; should network or disk issues be present, they might spur memory surges. Hence, when problems arise, consider this aspect to avoid misdiagnosis.
Sixth, for network-related issues, consider three facets: client, server, or transmission problems. For instance, slow database access might stem from client issues: insufficient connection pools leading to slow connectivity acquisition, GC pauses, or maxed CPUs; it could also involve transmission obstacles, including fiber optics, firewalls, or routing table setups; being genuine server issues, however, requires successive isolation for clarification.
Server slowness is typically detectable via slow logs from MySQL, while transmission slowdown can be evaluated through simple ping checks; after excluding these, only certain clients experiencing slowness suggests a client-side issue. For slow access to third-party systems, services, or storage, one shouldnât assume server-side culprits.
Seventh, combine snapshot tools with trend-based ones. Essentially, jstat, top, and various monitoring graphs are trend tools, exposing metric changes and locating general problem areas, whereas jstack and MAT for heap snapshots are momentary, resolving pinpoint application states.
Generally, leverage trend tools to discern patterns before applying snapshots for in-depth issue breakdowns. The reverse approach can mislead, as snapshots convey a singular program moment, inadequately forming conclusions devoid of trend analysis; minimal multiple snapshot comparisons aid analysis when lacking trend support.
Eighth, donât hastily refute monitoring. Drawing parallels with an air crash analysis, a pilot encountered all-gauge low oil indications, suspecting gauge defects instead of immediate fuel shortages, resulting in engine stalling soon after. Similarly, upon application issues, rely on myriad monitoring systems; at times, instincts overshadow chart readings, veering troubleshooting down errant paths.
If skeptical about monitoring, verify whether said system shows no discrepancies for fault-free applications, if so, lend credence to monitoring over personal intuition.
Ninth, in absence of monitoring tracing root causes, recurring identical issues pose repeat risks, necessitating three actions:
Rectify logging, monitoring, and snapshot deficiencies to pinpoint root causes upon subsequent occurrences; Develop real-time alerts addressing symptoms, facilitating prompt issue detection; Contemplate hot standby solutions for swift transition to backup systems during issues, whilst preserving old system states. Key Takeaways Today, I summarized production environment issue-analysis concepts.
First, problem analysis develops from necessity, reliant on established foundational monitoring. Monitoring spans basic operations, application, and business-level measures. Problem identification harmonizes across multiple monitoring layer indicators.
Second, pinpointing origins begins with broad classifications: distinguishing internal vs external issues; CPU-centric vs memory-related problems; isolated A-interface vs whole-application concerns, thereafter funneling down for details; encountering analysis bottlenecks necessitates occasional detailed breakdown exclusion, revisiting broader touched points for holistic reassessment.
Third, problem solving heavily involves intuition, oftentimes methodology lacks comprehensiveness. Addressing critical issues swiftly targets potential aspects based on intuition, occasionally imbibing serendipitous elements. I highlighted nine insights; I encourage consistent reflection and summarization, honing personal problem-solving strategies and tool adeptness.
Lastly, once pinpointing an issueâs causes, diligence in documentation and retrospective evaluation is vital. Each fault or issue symbolizes invaluable insights; retrospection extends beyond mere documentation, emphasizing improvement. Retrospection necessitates achieving four facets:
Articulating a comprehensive timeline, intervention measures, and reporting processes; Analyzing root causes; Offering short, medium, long-term improvement strategies, inclusive of but not limited to code alterations, SOP, and procedure, ensuring closed-loop follow-through; Convening regular team reviews of past incidents.