I just received the official case summary from Microsoft.  I thought I would publish it here for the benefit of anyone using old man Google to solve this problem.

—————-
Symptoms
=-=-=-=-=-=-=
When a large number of items is put into the inbox, calendar, contacts, etc., Exchange Active Sync (EAS) or Entourage Client Sync will fail.

Errors
=-=-=-=-=-=-=
The errors can vary from the standard “TIMEOUT” errors to 0x85010014.  In a DAV trace, to summarize, you will see I/O consistently “PENDING” and 0 bytes read from file.  This ‘file’ is the streaming file.

More Information
=-=-=-=-=-=-=-=-=-=-=
EAS and Entourage request mail from IIS/DAV.  DAV requests the mail from the STORE.  If the store needs to package a large amount of items, it will package them using a STREAM file.  IIS/DAV reads from this stream file.  IIS/DAV returns the information to EAS/Entourage.

Problem
=-=-=-=-=-=
IIS/DAV uses Kernel32::ReadFile() to read from the stream.  A 3rd party kernel driver (FAMv4.sys) intercepts these calls to ReadFile() and returns bad data.  This causes our ‘read’ thread to go into a perpetual ‘PENDING’ state.  For every “PENDING” returned, a POLLING thread is spawned, causing performance problems for the W3WP.exe ExchangeAppPool as well.

Resolution
=-=-=-=-=-=-=
Open REGEDIT.exe
Navigate to HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServices
Look for FAMv4 under the Services Key.
Set the “Startup” value to 4 so that it disables the FAMv4 service.
Open a cmd prompt.
Type NET STOP FAMV4.  This stops the FAMv4 service.
Sync your EAS or Entourage clients without issue.

More Information about FAMv4
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
FAMv4 stands for File Access Manager and it is made by Vision Works Solutions Incorporated.  Their website is http://www.vwsolutions.com/.  It is essentially an open file utility that allows backup programs to backup open files.  You can find out more about how it works from http://www.vwsolutions.com/FAM/howitworks.aspx. It works with several backup programs and since it is ‘hidden’ in the registry and not listed in services.msc, it is probably licensed by other backup companies as their open file backup solution.

We ask that our customers contact Vision Works Solutions at 1.888.310.6706 or the customer’s individual backup solution to obtain a fix for FAMv4.sys.

About 3-4 weeks ago we suddenly experienced a loss of the ability to sync our Entourage clients to the Solerant internal Exchange 2007 server.  This event happened out of the blue with no recent changes to the server that we could see.  The most recent change was the application of Exchange 2007 SP1 Hotfix Roll-up 3 about a week prior to the issue.  We opened a case with Microsoft and immediately began getting the run-around.  Exchange support didn’t want to touch it since the best we could tell it only affected Entourage users, and Mac Office support couldn’t help us on the server.  We stuck through it and began working with the Entourage team to find the key symptoms.

  • Entourage was unable to sync folders that had more than 140-150 items in them.  This included Contacts, Calendars, etc.
  • Using tcpflow to analyze the stream to the server you could see it litereally die mid-stream while receiving the list of items in the folder
  • Synchronization was extremely slow, even in the folders that were working properly.
  • OWA and ActiveSync were operating properly as far as we could tell.  (We later realized they were slow as well.)

Unfortunately due to the politics between the Mac Office team and the Exchange team this issue was open for over 3 weeks.  I can’t tell you for sure why it was not escalated to the proper people sooner, but after about 3 weeks of constant battling with the Entourage team I was scheduled for another late night call with an Exchange team member who did some server-side captures from diagnostic tools on the Exchange server.  After multiple traces were completed and uploaded to Microsoft the call was terminated with the expectation that I would receive an update as soon as they had something.  Apparently IIS processes were hanging, crashing, and restarting and we were not the only people having this issue.  The data was being forwarded to the Exchange product team for further analysis.

The following day I received a call from a US-Based (Thank God!) Exchange support engineer who had pulled my case.  Apparently he had isolated an issue which was causing serious performance issues with OWA and Activesync and noticed the similarities to a growing list of Entourage tickets.  At his instruction we looked in the registry and found exactly what he was expecting.

The engineer had identified a driver on all of these systems that was creating the performance problems.  (This part is out of my area of expertise, but I’ll try.)  This driver is a low-level driver that hooks into the file system driver stack interrupting all of the reads/writes.  Its purpose is to be an open file backup agent.  We stopped this driver from the command line with “net stop famv4″ and the server started operating completely as before and all Entourage clients synchronized immediately.

The most interesting part of this discovery was that there was no backup software of any kind installed on the Exchange server at the time of this discovery.  Microsoft had no idea where this driver was coming from which left it up to me to figure it out.  The only candidate in my mind was an online backup service we had tried out for a while and had issues with.  I contacted the vendor and confirmed that the driver was in fact their Open File backup driver.  Additionally, they were aware that it had been causing problems with OWA and Activesync but had heard nothing about Entourage issues.  They also confirmed that they had issues in the past with it not properly uninstalling the driver but were certain it had been fixed.  (Clearly it had not been!)  This software apparently updated itself while it was still installed causing our issue.  We removed the application during troubleshooting but it did not remove the offending driver.

I provided both the vendor and Microsoft with the appropriate information so they could contact each other and resolve the matter properly.  Unfortunately this vendor cost me over 35 hours of my personal life (family time, lost sleep, etc.) and probably another 15 hours on the clock and its unlikely they’ll do anything to make up for this.

Lessons Learned
  • Microsoft Exchange and Entourage support teams do not work together in a collaborative way.  This may cause significant delay in getting an issue resolved
  • Business Critical issues taken to the Mac Office team are not treated the same way that they are when they go to Exchange.  We’ve stayed on the phone for 24+ hours with Exchange support in order to resolve an issue.  The Mac Office team is not willing or able to do this.
  • Be careful with backup software, especially those that handle open files.  I have a long history of issues with Open File Agents and this is a perfect example why.

I debated over the weekend wether or not to post/link the company here.  It would not be proper to do so and therefore I will not.  if you are concerned and need to know, please email me directly.

We’ve awoken this morning to most of our servers requesting to install the update which disables the Scalable Networking Pack. Surprisingly Microsoft has decided to push this update out globally. Since MS has decided it was a bad idea to turn on SNP by default, this is probably the end of the issue. Capable drivers or not, with it disabled everything should return to normal. It will, however, be very interesting to see what the fallout from this update is. ;)

Automatic Updates - KB 948496

I did a little digging tonight and it looks like HP has updated the drivers for their on-board Broadcom NICs. The updates were posted on February 28th, 2008 and can be found here. The link is for the DL380 G4 but should apply to multiple servers based upon the release notes.

Based upon the driver version they are deploying this may or may not still leave you with Vista issues. This remains to be seen.

So after nearly 8 months of battling this ongoing issue Microsoft has decided to speak up and acknowledge it. Today they published KB 948496 titled “An update to turn off default SNP features is available for Windows Server 2003-based and Small Business Server 2003-based computers” which addresses this very issue. I’m certain that the work of the Exchange team had a lot to do with the final recognition of this issue so much thanks to them!

To summarize, Microsoft has acknowledged that computers which have TOE capable network cards and have SP2 installed may face issues due to the Scalable Network Pack being enabled by default. This is due, primarily, to faulty drivers but as I stated in my previous blog entry even the most recent drivers direct from Broadcom do not address all of the issues. To address this problem they have released an update EXE which disables the Scalable Network Pack. Additionally, they provide methods to manually disable these features in the article.

While this is a major battle won in the war on this issue we cannot consider it complete won until Broadcom releases a driver that fully fixes the issue and Dell/HP/Etc. certify that driver for their customers.

In late August to early September of 2007 we had a sudden breakout of issues with brand new servers. We found that we were installing brand new Dell PowerEdge 2900/2950/1900/1950 units that were having significant Network Performance issues and Exchange performance problems.

Network Throughput

The first symptom that was noticed was network throughput. All of these servers were plugged into 1000BaseTX networks with clients running at 1000BaseTX or 100BaseT. Throughput varied from test to test, but overall speeds for CIFS/SMB file copy operations were at speeds that would be slow on a 10Mbit network. Various troubleshooting was attempted with Dell and various switch manufacturers and we found that putting the servers on a 100Mbit port resulting in full 100Mbit throughput. Various combinations of Flow Control on switch ports seemed to help but this was not a usable permanent solution.

Exchange MAPI Errors

A second symptom cropped up almost immediately after this one where certain users were unable to open their Exchange 2003/2007 mailbox through Outlook yet were able to open it through OWA or sync with their mobile devices. Profiles were re-created and attempts made on other desktops but this did not fix the issue. The problem appeared to be tied to the user and MAPI. A scan of the event logs on the server showed multiple errors related to MAPI. (9646 Errors)

Stumbling onto a Clue

In late October of 2007 Dell shipped out their October edition of the Server Updates disc in their OpenManage updates pack. Luckily a few of my clients subscribed to this and I noticed it come in. Out of desperation we popped the update disc into one of the servers that was giving us problems and found that nearly everything was in need of updating. (This was a bit of a surprise as this particular server was only 6 weeks old at this point.) Immediately after updating all of the drivers/firmware the first problem appeared to be completely resolved. We were now able to max out throughput over the Broadcom gigabit cards in the servers, even running a bonded pair. Something definitely changed.

Upon further testing we found that the fix came from the updated network drivers dated 10/20/2007 on Dell’s support site. The actual driver installed by this update was version 3.5.8.0 dated 7/27/2007, way earlier than the 10/20 date of release from Dell. (One can only assume Dell takes a long time to QA their drivers.)

Dell 10/20/2007 Driver

Exchange Issues Continue

Unfortunately, the updated driver did not resolve our Exchange issues. Being forced to dig further into the issue we stumbled across a post on the Exchange Team Blog detailing problems caused by network drivers (particularly Broadcom) and the Windows Scalable Networking Pack which is installed and enabled by default in Windows 2003 Service Pack 2.

The Fix

The blog post goes fairly extensively into the problem, however to summarize it appears that certain network vendors have been extremely slow to properly implement the features required by the Scalable Network Pack in their drivers. The key word in that sentence is “properly” as the pack tries to turn it on anyway resulting in serious issues. At best this causes performance issues/limitations which may not be noticed at all. For us this problem was causing severe limitation on the TCP stack that presented itself in many ways. The following are a few of the issues that were affecting us that we have later found to be a direct result of the problem.

  • MAPI errors resulting in limitations on the number of MAPI clients that could connect
    • Once the limit was reached no further MAPI sessions could be created on the server
    • Users unable to open their mailbox through Outlook
  • Randomly unable to make RDP connections to the server or connections dropped immediately after connecting
  • Intermittent RPC communications failures
  • Networking throughput is severely decreased (This was partially fixed by the October Dell driver update)

Unfortunately, the driver provided by Dell is far too old to address these problems. Per the Microsoft blog posting the Broadcom driver needs to be at least 3.7.19 to fully support the Scalable Network Pack. This leaves us in a tight spot to address the performance issues. We are left with two options.

  • Install unsupported drivers directly from the chipset manufacturer
    • Addresses the source of the problem
    • Potentially introduces support issues with your server vendor
  • Disable the Scalable Network Pack
    • Disables key improvements brought by SP2 as a workaround
    • Should make note of this and re-enable these features when drivers are later updated

Obviously it is in the best interest of our customers to stay supportable so we have chosen to address this by disabling the features of the Scalable Network Pack that are affecting performance on our servers. The easiest way to do this is by issuing the following command at a command prompt.

Netsh int ip set chimney DISABLED

Further details can be found on the Exchange Team Blog posting and I highly recommend you read it. This fix has addressed the key problems we experienced without our customers’ file servers and Exchange servers but we are keeping a close watch on this hoping for a permanent resolution. We have tested the latest drivers directly from Broadcom on test servers and have found that while it addresses these immediate issue we continue to have network throughput issues with Vista client systems under certain circumstances.

Conclusion

While this incident has severely questioned my trust in Broadcom I have to admit that they are not the only vendor having problems. While Broadcom seems to be the biggest offender here we have seen throughput issues (particularly with Vista clients) on servers using Intel NICs. Most of these are Dell and HP servers which haven’t received vendor driver updates in several years. A driver update from Intel instead of the Dell/HP/etc. site appears to resolve the problem immediately.

I see this problem as one so severe vendors should be calling their customers to address it yet if you call Dell/HP support they don’t even have a clue what you are talking about. Hopefully this post has provided you some insight into the problem. We are still working on this internally and hope to find a permanent solution. I will update with another post when we do.


© 2007 My Technical Life | Powered by Wordpress