[转]Troubleshooting 4xx and 5xx Errors with Azure APIM services
[Reference Link: https://techcommunity.microsoft.com/t5/azure-paas-blog/troubleshooting-4xx-and-5xx-errors-with-azure-apim-services/ba-p/2115744]
This is a continuation of troubleshooting series for 5xx errors. You can find the link of 4xx here.
In the below section, we are referring to the diagnostic logs present under the Log Analytics ApiManagementGatewayLogs when we quote “Diagnostic/Gateway Logs”
Scenario 1: Http Error code 500 with BackendResponseCode logged as 500
Symptom:
A certain API call fails with the error message “500 – Internal Server Error” as highlighted below.
The diagnostic log for this specific failure indicates 500 for the value of the column BackendResponseCode
Cause:
Under the diagnostic logs, if you observe the BackendResponseCode value logged as 500, it means that the backend API has returned a 500 response to the APIM service.
In scenarios where the backend API itself has returned a status code 500 for the incoming request, the APIM service would forward the same response back to the client
Resolution:
The issue would further have to be investigated from the backend API perspective and the backend API provider has to verify why are the backend servers returning the HTTP 500 errors.
Scenario 2: Expression Value Evaluation Failures
Symptom:
Few API requests may return a 500 response code due to failures in the evaluation of the policy expression that the API request invokes.
The error message would be logged as follows:
“ExpressionValueEvaluationFailure: Expression evaluation failed. Object reference not set to an instance of an object.”
Cause:
This error normally occurs due to a “NullReferenceException” wherein you attempt to read a parameter value that hasn’t been defined yet or is set to null.
The ErrorSource column in the diagnostic logs would indicate the name of the policy that is causing the error during the evaluation.
Resolution:
Recommendation is to revisit the policy definition for the API operation which fails evaluation during request processing and fix the null reference exception.
Scenario 3: APIM Client Connection Failure with response code 0 or response code 500
Symptom:
In the gateway logs, you may observe scenarios where the:
- Response code column contains either a 0 or 500 response
- Error Reason column contains the value “ClientConnectionFailure” logged
- Error Message column contains error messages such as “The operation was cancelled, “A task was cancelled”, et cetera.
Cause:
The term ‘Client Connection Failure’ essentially means that the client application (which initiated the API call) terminated the connection with the APIM service even before the backend API could revert with the expected response for the incoming API call and APIM could forward the same back to the client.
It basically implies that the client abandoned the request before the response could be received. APIM has no control over when or why the client decides to abandon the request.
These failures generally occur when the request is taking too long to complete so the client either gives up (a user may close the browser) or the client application may have a time out.
Here a few possible causes for such failures:
- Issues with client network
- Azure Virtual Network stability
- Issues with client application
- Low time-out value in client application
- Increased request processing time
- The Backend API takes abnormally long to respond (possibly due to large payload)
Most of the time, you can observe from the diagnostic logs that the clientTime values for these requests are quite high and contribute to most of the totalTime.
In order to explain what these fields indicate:
- totalTime - Total time for the request measured from the first byte received to last byte sent to the client. This includes backend roundrip and client ability to read.
- backendTime - Number of milliseconds spent on overall backend IO (connecting, sending, and receiving bytes). If this time is high, it means the backend is slow and the performance investigation needs to be focused there.
- clientTime - Number of milliseconds spent on overall client I/O (connecting, sending, and receiving bytes). If this time is high, the client bandwidth or processing might not allow to read response fast.
Resolution:
In most scenarios, Client Connection Failures primarily have to be investigated further from a client perspective since it is the client that essentially terminates the connection with the APIM service.
Few possible suggestions are increasing the Timeout value at the client end, decrease the response processing time, et cetera which depend from scenario to scenario.
Additionally, using the diagnostic logs, you can also find the specific process during which the client abanonds the request by looking into the ErrorSource column.
For example,
- If the column contains the value “forward-request”, it means that the client terminated the connection while the APIM service was still forwarding the request to the backend API
- If the column contains the value “transfer-response”, it means that the client terminated the connection while the APIM service had received the response from the backend API and was forwarding it back to the client.
Scenario 4: APIM Backend Connection Failures
The APIM service logging “BackendConnectionFailure” under the ErrorReason column in the diagnostic logs essentially indicates that the APIM service failed to establish a connection with the backend API.
This error could be happen due to various reasons and with multiple types of error messages.
Few of the commonly observed error messages for Backend Connection Failures are listed down below. The corresponding error message for the failure would be logged under the ErrorMessage column in the diagnostic logs.
Scenario 5: Unable to connect to the remote server
Symptom:
API requests fail with Backend Connection Failure with the below error message highlighted in the Ocp-Apim traces/diagnostic logs
Cause and Resolution:
The error “Unable to connect to the remote server” normally occurs due to the below reasons:
- APIM performance/capacity issues.
- SNAT port exhaustion on the APIM VMs
- There is an additional network device (like a firewall) that is blocking the APIM service from communicating with the backend API
- Backend API isn’t responding to the APIM requests (backend down or not responding)
- Network issues/latencies between the APIM service and the backend.
Using the Capacity dashboard on the Metrics blade of the APIM service, you can verify whether there have been any abnormal fluctuations with the average capacity which could have possibly contributed to the issue.
SNAT Port Exhaustion is a hardware specific failure.
The following document highlights that the max concurrent requests from APIM to a back-end is 1024 for the developer tier and 2048 for the other tiers.
Let’s take the example of a Developer Tier service to understand what this means.
The Developer Tier is an APIM service where the APIM service is hosted on a single underlying VM/node/host machine.
Each VM is internally assigned 1024 SNAT ports for communication. Hence, in case of the Developer tier you cannot have more than 1024 outbound connections to the same destination at the same time (concurrent connections). If the number exceeds beyond 1024 outbound connections (possibly due to huge influx of incoming requests) the service will encounter SNAT port exhaustion issues and will fail to establish a connection with the backend server.
NOTE: You can have more than 1024 connections at the same time if the destinations are different (not concurrent).
If it has already been verified that the issue has not occurred due to either capacity issues or SNAT failures, then the issue could possibly be occurring because either the backend API was down, unavailable to establish connection with the APIM service or was terminating the connection due to network latencies between the APIM service and the backend
In order to confirm this, you would have to collect network traces from the underlying VMs/nodes hosting the APIM service while the issue is being reproduced and then analyze the traces for establishing the point of failure.
In most scenarios, you can observe from the diagnostic logs that the "BackendTime" was almost equal to or greater than 21 seconds for all the failed requests and contributed to most of the “totalTime”.
This indicates possibilities of a TCP connection failure to the backend (21 seconds is the usual TCP timeout). APIM tried to engage with the backend, but there was no response from the backend. So, the connection timed out after 21 seconds and a HTTP Status Code 500 was returned, which indicates that the backend server was down or was not responding to connection requests or was unable to maintain the connection.
Scenario 5: The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
Symptom:
API requests fail with Backend Connection Failure with the below error message highlighted under the errorMessage section in the diagnostic logs
“The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.”
Cause:
This is usually caused by a known APIM issue.
APIM keeps connections to the backend open for as long as possible so it can re-use them and so that it doesn't have to perform TCP/SSL handshakes to establish new connections every time, which has a negative impact on performance. However, if a connection doesn't get used for a certain period of time due to low/no activity (4 minutes), the internal Azure Load Balancer silently drops the connection. When this happens, if APIM tries using the dropped connection next time, the connection fails and the above error message gets logged.
Resolution:
This can be avoided by using the retry logic in APIM.
Reference: APIM Retry Policy - https://docs.microsoft.com/en-us/azure/api-management/api-management-advanced-policies#Retry
Scenario 6: The remote name could not be resolved
Symptom:
API requests fail with Backend Connection Failure with the below error message highlighted under the errorMessage section in the diagnostic logs:
“The remote name could not be resolved”
Cause:
When one machine has to connect to another machine, it has to perform DNS name resolution.
The above error indicates that APIM wasn't able to convert the hostname of the backend (e.g. contoso.azurewebsites.com) to an IP address and couldn't connect to it.
The most frequent cause for this error is using an incorrect hostname while setting up the API configuration. If the service is in a VNET and is using custom DNS, it could mean that custom DNS server was unavailable or did not contain a record for the backend that APIM is attempting to connect to.
Resolution:
Accordingly, the issue has to be troubleshot from a network perspective as per the dependent scenario. The most reliable method of isolating the issue and zeroing down on the exact cause is analysis of network traces for sample failures.
Scenario 7: The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel
Symptom:
API requests fail with Backend Connection Failure with the below error message highlighted under the errorMessage column in the diagnostic logs:
“The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel”
Cause:
This error is normally encountered when the backend has been configured to use a self-signed certificate instead of using a publicly trusted root CA certificate.
APIM services are hosted in the Azure infrastructure using PaaS VMs that run on Windows OS.
Hence, every APIM instance trusts the same default Root Certificate Authorities that all windows machines trust.
The list of trusted Root CAs can be downloaded using the Microsoft Trusted Root Certificate Program Participants list - https://docs.microsoft.com/en-us/security/trusted-root/participants-list
Resolution:
There are 2 possible solutions for resolving this issue:
- Add a valid trusted root CA certificate that resolves to a Microsoft Trusted Root Participant list.
- Disable certificate chain validation in order for APIM to communicate with the backend system. To configure this, you can use the New-AzApiManagementBackend (for new back end) or Set-AzApiManagementBackend (for existing back end) PowerShell cmdlets and set the -SkipCertificateChainValidation parameter to True.
Below is the sample PowerShell command:
$context = New-AzApiManagementContext -resourcegroup 'ContosoResourceGroup' -servicename 'ContosoAPIMService'
New-AzApiManagementBackend -Context $context -Url 'https://contoso.com/myapi' -Protocol http -SkipCertificateChainValidation $true
References for creating/updating backend entity:
- https://docs.microsoft.com/en-us/powershell/module/az.apimanagement/new-azapimanagementbackend?view=...
- https://docs.microsoft.com/en-us/powershell/module/az.apimanagement/set-azapimanagementbackend?view=...
Scenario 8: Unable to read data from the transport connection: The connection was closed.
Symptom:
API requests fail with Backend Connection Failure with the below error message highlighted under the errorMessage column in the diagnostic logs:
“Unable to read data from the transport connection: The connection was closed.”
Cause:
This error occurs when the APIM service is still trying to read the response from the backend, but the connection was suddenly aborted.
The process by which an APIM service transfers a response to the client is highlighted below:
APIM reads the response status code and header first. The payload will stay in network stream.
Once the header and the status code is received, then APIM will stream across the response body from the backend service to the client.
While the data stream is underway, if any exception is encountered, then the above error message is logged.
Resolution:
Users can implement the retry logic in APIM for avoiding this error:
Reference: APIM Retry Policy - https://docs.microsoft.com/en-us/azure/api-management/api-management-advanced-policies#Retry
Scenario 9: The underlying connection was closed: The connection was closed unexpectedly
Symptom:
API requests fail with Backend Connection Failure with the below error message highlighted under the errorMessage column in the diagnostic logs:
“The underlying connection was closed: The connection was closed unexpectedly”
Cause:
This error occurs when either the APIM service or the backend service abruptly terminates the connection while the communication between the APIM service and the backend was still underway.
Resolution:
In order to isolate the source of the issue and resolve the same, the scenario would require collection of network traces from the underlying VMs/nodes hosting the APIM service while the issue is being reproduced and then analyze the traces for establishing the point of failure.
Implementing retry logic may help to some extent if the frequency of the issue is highly rare.
Error Code: 501
Scenario 1: Not Implemented
Symptom:
Sometimes, you can observe API requests fail with HTTP 501 errors with either of the below error messages highlighted under the errorMessage column in the diagnostic logs:
NOTE: This is not an exhaustive list and the error message would depend on the actual cause:
- “Header BPC was not found in the request. Access denied.”
- “Unable to match incoming request to an operation.”
- “Header RegionID was not found in the request. Access denied.”
Cause:
This is not a rarely observed error with the usage of APIM services.
The above HTTP server error response code means that the server does not support the functionality required to fulfill the request.
In APIM terms, if the client makes a request to the server but the server finds the request as inappropriate since it does not support the feature/method to process the request, then it could return a 501 response to the caller.
Reference: https://www.checkupdown.com/status/E501.html
The server returning the 501 response in this scenario would be the
- Backend if the BackendResponseCode in the logs is 501. APIM would return the same response to the client.
- APIM service if the ResponseCode is 501 and BackendResponseCode is either blank or 0 in the diagnostic logs.
Resolution:
In case it’s the APIM service which returns a 501 response and not the backend, a very popular occurrence is where APIM logs the following error message – “Unable to match incoming request to an operation” for which both the API configuration within the APIM service as well as the request formation and invocation processes have to be reviewed at client-side as per the scenario.
Or there are also possibilities where the 501 error code is being returned by a policy effect that is being evaluated during request processing. If that is the case, you would find the corresponding policy name highlighted under the “ErrorSource” column in the diagnostic logs.
Resolution:
The best option in such scenarios is to collect Ocp-Apim Trace which would retrieve detailed request processing details and assist isolating the point of failure.
Error Code: 502
Scenario 1: Bad Gateway
Cause/Resolution:
APIM services forwards a 502 Bad Gateway response to the client in case of Backend Connection Failures.
Hence, the troubleshooting and debugging remain the same as the Backend Connection Failures section documented above and is dependent on the details observed under the “ErrorMessage” column in the diagnostic logs.
The most commonly found error message logged by APIM for a 502 response is “The remote name could not be resolved”
Error Code: 503
Scenario 1: Service Unavailable
Symptom:
Sometimes, you can observe API requests failing with HTTP 503 errors and the error message indicating that the Service is Unavailable.
Below is a sample error message observed on Postman while attempting to invoke an API
Cause:
503 responses are mostly returned by the backend servers amongst popular occurrences.
However, APIM services also return a 503 response to the client even before the request is forwarded to the backend in scenarios where there are certain policy effects being applied to the incoming request before forwarding it to the backend and the request is terminated due to the application/evaluation of the inbound policy effect.
Resolution:
Verify the “ErrorSource”, “ErrorReason” and “ErrorMessage” columns in such scenarios and proceed accordingly.
Error Code: 504
Scenario 1: Gateway Timeout
Cause/Resolution:
Below are some of the popular scenarios where APIM services return a 504 response to the client:
Scenario 1: The APIM service has waited too long to establish a connection with the backend server but the backend is not available or responding.
The troubleshooting performed remains the same as that of troubleshooting Backend Connection Failures highlighted above.
In the diagnostic logs, specifically look out for the sub-component time values and the columns “ErrorReason” and “ErrorMessage” in order to isolate the source of the issue.
Scenario 2: The backend service is taking too long for request processing leading to the APIM service terminating the connection. In such scenarios, you can observe under the diagnostic logs that the “BackendTime” is high when compared to the total time taken for request processing and consumes most of the total time.
There are 2 possible solutions for mitigating this issue:
- Increase the timeout value of the APIM service under the <forward-request> policy such that it is in tally with the average time taken by the backend for request processing.
Reference: https://docs.microsoft.com/en-us/azure/api-management/api-management-advanced-policies#ForwardReques... - Improve backend performance by reducing the response time.
Scenario 3: The timeout value configured for the APIM service within the <forward-request> policy is low.
Popular mitigation step is to Increase the timeout value of the APIM service under the <forward-request> policy section such that it is in tally with the average time taken by the backend for request processing.
NOTE: For APIM API request processing, the default timeout value imposed by APIM services is 300 seconds/5 minutes.
The default timeout value can be increased using the forward-request APIM policy - https://docs.microsoft.com/en-us/azure/api-management/api-management-advanced-policies#ForwardReques...
For "timeout", the maximum value can be set to any valid integer, but as the above documentation states, the real maximum value is going to be around 240 seconds since values greater than 240 seconds may not be honored as the underlying network infrastructure can drop idle connections after this time.
Reference: https://docs.microsoft.com/en-us/azure/api-management/api-management-advanced-policies#attributes-1
当在复杂的环境中面临问题,格物之道需:浊而静之徐清,安以动之徐生。 云中,恰是如此!