HttpClient exception:ExceptionType:System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: Operation ca

error msg:

System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled. ---> System.Net.Sockets.SocketException: Operation canceled --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token) at System.Net.Http.HttpConnection.ReadAsync(Memory`1 destination) at System.Net.Http.HttpConnection.RawConnectionStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken) at System.Net.Security.SslStreamInternal.<FillBufferAsync>g__InternalFillBufferAsync|38_0[TReadAdapter](TReadAdapter adap, ValueTask`1 task, Int32 min, Int32 initial) at System.Net.Security.SslStreamInternal.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory`1 buffer) at System.Net.Http.HttpConnection.FillAsync() at System.Net.Http.HttpConnection.ReadNextResponseHeaderLineAsync(Boolean foldedHeadersAllowed) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken) --- End of inner exception stack trace ---

 

 

相关bug

1.

Most / all of these requests were using POST operations (or other methods that have method bodies).
We suspect, but aren't sure, that this might be related to connection pooling in some manner.

 

2.

That error message is for SocketError.OperationAborted, which likely means the socket was closed while the send or receive was in progress.

 

3.

Is there anything here that we can look at?
To reproduce this we need to schedule > 5 runs of 4,000+ tests that takes an hour plus.
In such scenario, we typically see 1 - 3 such failures using SocketsHttpHandler.

Note that we never use SocketsHttpHandler directly, so I don't think that we improperly using it.
We got through HttpClient, a shared instance that is re-used among threads, as recommended.

The code is actually doing something like:

try
{
    return await httpHandler.SendAsync(...);
}
catch(HttpRequestException e)
{
}

So we can catch the exception and then check state / logging / something.

Anything you can recommend that we'll look at to get to the root cause?

 

4.

Is it possible you're hitting a timeout? It shouldn't manifest this way, but it would give us a place to start looking. For example, if you change your HttpClient's Timeout to Timeout.InfiniteTimeSpan, does this still happen?

Or if it's potentially related to connection pooling, what happens if you change the SocketHttpHandler's PooledConnectionLifetime and PooledConnectionIdleTimeout to infinite... does it still repro?

 

5.

And... I have a repro. On my machine, this reproduce the scenario instantly.
You were correct about the timeout causing this.

using System;
using System.Net.Http;
using System.Threading.Tasks;

namespace ConsoleApp15
{
    class Program
    {
        static async Task Main(string[] args)
        {

            // http://slowwly.robertomurray.co.uk/delay/3000/url/http://www.google.co.uk


            var client = new HttpClient(new SocketsHttpHandler
            {
            })
            {
                Timeout = TimeSpan.FromMilliseconds(250)
            };

            var tasks = new[]
            {
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
            };

            await Task.WhenAll(tasks);

            Console.WriteLine("Done");

        }

        public static async Task Run(HttpClient client)
        {
            for (int i = 0; i < 25; i++)
            {
                try
                {
                    await client.GetStringAsync("http://slowwly.robertomurray.co.uk/delay/3000/url/http://www.google.co.uk");
                    Console.WriteLine("Should timeout");
                }
                catch (Exception e)
                {
                }
                await client.GetStringAsync("http://slowwly.robertomurray.co.uk/delay/100/url/http://www.google.co.uk");
            }

        }
    }
}

  

6.

That's hitting a timeout and throwing an OperationCanceledException... that's expected.

What's not expected is getting an IOException (not wrapped in an OperationCanceledException) out of SendAsync and when nothing on the request is being canceled / timing out.

 

7.

I captured System.Net logs and Wireshark traces of another failure. Here's where I see the failure in Wireshark:

1990370 214.439692    127.0.0.1 → 127.0.0.1    TCP 84 51686 → 51687 [FIN, ACK] Seq=1 Ack=10334254 Win=525568 Len=0
1990371 214.439736    127.0.0.1 → 127.0.0.1    TCP 84 51687 → 51686 [ACK] Seq=10334254 Ack=2 Win=525568 Len=0
1990408 274.618073    127.0.0.1 → 127.0.0.1    TCP 84 51687 → 51686 [RST, ACK] Seq=10334254 Ack=2 Win=0 Len=0

The RST looks unusual to me, especially since it's occurring after the connection should have been closed. The timestamp on the RST makes it look like is was sent 60 seconds after the connection was closed, but I'm still investigating whether or not that's accurate or just an issue with the logs I collected.

I'm going to try capture another network trace to see if the issue occurs after a FIN again, since that could be valuable info.

 

 

8. 

We are seeing the exact same thing when running our applications on Linux. A temporary "workaround" for us has been to set the timeout to 100 seconds(it's not an acceptable work around).

When we have the timeout to something more reasonable, like 10 seconds or something, we get the error every 500 request or something like that.

The httpendpoint that we are calling are under our control and when the error happens, we can't even see that the request has reached the service. So it seems like the request doesn't leave the box.
We are using dotnet core 2.1.

I've started a stopwatch for each requests and then Im catching TaskCanceledException and logs the following(this is when we have the timeout set to 15 seconds):

Task was cancelled after 15000ms System.Threading.Tasks.TaskCanceledException:
The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled. ---> System.Net.Sockets.SocketException: Operation canceled
--- End of inner exception stack trace ---
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
at System.Net.Security.SslStreamInternal.g__InternalFillBufferAsync|38_0[TReadAdapter](TReadAdapter adap, ValueTask`1 task, Int32 min, Int32 initial)

 

 

9.

Thanks for the additional info. It's possible the behavior you're seeing is related, but it sounds more like an actual timeout. In this case there are two differences from what I've observed. (1) in all other cases I've seen the connection is actually sending the data, as observed via wireshark and (2) in other cases the failure occurs well before a timeout should have occurred.

There's an open issue right now (#20296) tracking the fact that we throw TaskCanceledException on an actual request timeout. I think that's probably the actual root cause here.

If you think that your issue isn't actually being caused by timeouts though, feel free to open a new issue with all of the info you shared here.

 

 

10.

Interesting development. We have encountered this issue after adding to our test suite a test that spawns a process with powershell script (very simple one - writing 1 line to a file). ~30 seconds after this tests there is a failure in 1-3 random subsequent tests.

Because of that we thought that this might be related to processes inheriting handles from the parent one and it appears that we are right.

We have made a PR that unskipps those tests and switches the implementation of Process to the one that sets 'inherit handles' to 'false'. After doing so everything seems to be fine, it looks like the issue is with handle inheritance and the behavior of those handles after that child process ends.

The PR in question is here: ravendb/ravendb#8028

Our idea is that, at least as a workaround, it would be good to be able to set the

retVal = Interop.Kernel32.CreateProcess(
                                null,                // we don't need this since all the info is in commandLine
                                commandLine,         // pointer to the command line string
                                ref unused_SecAttrs, // address to process security attributes, we don't need to inherit the handle
                                ref unused_SecAttrs, // address to thread security attributes.
                                true,                // handle inheritance flag
                                creationFlags,       // creation flags
                                (IntPtr)environmentBlockPtr, // pointer to new environment block
                                workingDirectory,    // pointer to current directory name
                                ref startupInfo,     // pointer to STARTUPINFO
                                ref processInfo      // pointer to PROCESS_INFORMATION
                            );

  

11.

That's very interesting -- thanks for sharing that info. I briefly looked into the possibility that this was related to socket handles a few weeks ago after observing similar behavior with the h2spec process spawned by the Kestrel tests. It's probably worth looking into in more detail if you're seeing the same behavior.

 

 

12.

Note that as of 3.0, socket handles are no longer inherited: #32903

 

 

13.

that would probably solve the issue, but:

  1. Can this be applied to 2.1?

or

  1. Can you add private bool _inheritHandles = true field to Process.cs that we would be able to set via reflection to false at least?

 

 

14.

If that actually "fixes" it, I have trouble believing it's anything other than masking a symptom, and we should actually find the root cause and fix that.

Unless we believe the root cause really is that the socket handle is getting inherited and somehow that child process then closing is causing problems for the socket? In which case we should be able to repro that pretty easily, and if that's actually the problem, then I agree it would be a candidate for back porting.

 

 

15.

Fixing the root cause would be best, but till then could you consider applying point 2. so we can workaround this?

 

 

16.

Applying it where... as a servicing fix to 2.1? That's unlikely.

 

 

17.

Adding that private field in Process.cs (probably Process.Windows.cs) and using it instead of

would be a problem?

 

 

18.

The bar for servicing fixes is very high. You're suggesting adding unexposed, unvalidated, functionality-accessible-only-via-reflection-and-that-we-encourage-developers-to-use-via-reflection as a servicing fix. Hence "unlikely". If you need it immediately, you can of course make the change in your own local fork.

Let's get to the bottom of the problem and then figure out the right fix.

 

 

19.

I agree with @stephentoub.
I feel we are getting closer. @rmkerr invested already weeks of his time trying to get to the bottom of this (incl. extensive discussion with winsock team in Windows). Let's see if this new information can help root cause it and address the root cause.
If it doesn't help too much, we can think about either porting the change #32903 to 2.1, or doing something to enable people work around the suspectible root cause.

 

 

20.

I've been trying to build a repro that launches a process and then has several sockets sending and receiving data, but I have not been able to force the issue to occur. @ppekrol, would it be possible to set up a project with just the affected test and the code you're using to spawn the process?

If not I'll put some more time into the repro when I'm back in the office next week.

 

 

 

21.

Hi @rmkerr
Apologies for the delay, I really doubt that this would be doable. We already spend significant amount of hours to isolate the issue, create a smaller subset of tests, but for an unknown reason, it only appears for us on a larger set of tests e.g. our full suite. I do not have an idea what else could be done to isolate this. We tried moving that from SlowTests project (takes around 20 min to complete) to FastsTests (around 90 seconds), but it does not reproduce then.

 

 

 

 

22.

No problem @ppekrol. The info you shared is still valuable. I'm back in the office today, and will spend some time trying to cut down the size of the repro.

 

 

 

23.

I'm able to reproduce the behavior described by @ppekrol with a modified version of the kestrel test suite.

To start, I disabled all of the tests except for the H2Spec tests (which are launches as a separate process) andConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders, which was the test that I observed failing the most frequently. In order to reproduce the issue more consistently, I set up the ConnectionNotClosed... test to run five times.

I was able to further minimize the repro by replacing the H2Spec tests with the following code:

[Fact]
public void LaunchProcess()
{
    for (int i = 0; i < 100; ++i)
    {
        //Thread.Sleep(100);
        Process.Start("Powershell.exe", @"""C:\Tools\HelloWorld.ps1""").WaitForExit();
    }
}

The targeted script just prints "hello world" to the console.

When running the test suites after this modification, I was able to reproduce the failure on 6 out of 10 runs. After replacing Process.Start with Thread.Sleep (as commented out above), the tests passed on 10 out of 10 runs.

Obviously this is pretty rough testing, and there isn't a lot of data. But I do think the behavior is interesting.

 

posted @ 2019-01-28 11:05  PanPan003  阅读(8756)  评论(0编辑  收藏  举报