So I had to spend some days and nights searching for the reason.
Internet is not really good - it mostly reports issues related to registering in Application_Start. That was not the case, because application had been running for some time in production.
(Practially I should have made my mind when I saw those pages - because application in "global.asax.cs" is only a special kind of module).
Ok, so I made some research. It included debugging during the failure, dump file analysis from up to down and reverse-engineering of HttpApplication. I'll skip all those details and proceed to results.
- Due to integrated mode, IIS manages pipeline. Runtime just gets notifications from IIS. They contain information which module/handler pair should be used for event handling.
- During first-time app initialization, Runtime registers event handlers in IIS, specifically calling webengine4.MgdRegisterEventSubscription.
- Registration is in form [moduleName, eventType, moduleIndex], where moduleIndex is the index of PipelineModuleStepContainer instance in HttpApplication._moduleContainers array.
- Runtime maintains pool of HttpApplication instances (HttpApplicationFactory._freeList). In case if all app instances are busy processing requests, it creates new instance. It is called “normal” instance.
- During normal instance initialization, event handlers populate HttpApplication._moduleContainers, which will be used for events handling.
- When handling request, HttpApplication relies on data integrity and doesn’t check absence of handler. It basically makes this._moduleContainers[moduleIndex]._moduleSteps[eventIndex], which triggers NullReferenceException.
At the diagram above IIS contains registration for BeginRequest and AuthenticateRequest handler with module index = 6. At the same time HttpApplication doesn't contain such handlers. So we have a bang.
This gives us common rule of thumb for such class of errors:
NullReference exception occurs, if a module subscribed for event handling during first-time initialization; and for some reasons did not it during initialization of another HttpApplication instance.
In my particular research NewRelic's module caused issues.
Hi Mikalai,
ReplyDeleteWe are running in to the same issue on one of our production servers. The application runs fine for hours or days and then suddenly stops working. The event log shows the exact same message as in the exception report above.
We also use New relic. What did you do to resolve the problem?
Thanks,
Peter
I have no clear evidence, that the issue had been caused by New Relic itself. Also nobody experienced that issue before. But the fact is - after New Relic had been disabled, the issue had gone.
DeleteI think, that the problem was a result of a conflict between two entirely different monitoring systems. This is the only possible explanation. NR sets hooks for unmanaged functions. If someone else over-writes that hook, NR infrastructure corrupts, causing the problems in question. There is no protection from low-level hooks.
So my proposal for the further research would be:
- disable NR and see what happens.
- ask for NR support to examine the issue.
- try different monitoring system.
- try to find conflict between different systems
Hope that helps.
We had this issue today, it's related to version 3.12.140.0 of the agent.
ReplyDeletehttps://docs.newrelic.com/docs/release-notes/agent-release-notes/net-release-notes/net-agent-3121400
Version 4.0.146.0 solved the following bug:
"Fixed a bug that would sometimes cause IIS to get into a bad state under certain load conditions where it would stop serving requests. This could be identified by an exception from IIS around System.Web.HttpApplication.PipelineStepManager.ResumeSteps."
https://docs.newrelic.com/docs/release-notes/agent-release-notes/net-release-notes/net-agent-401460