In late August to early September of 2007 we had a sudden breakout of issues with brand new servers. We found that we were installing brand new Dell PowerEdge 2900/2950/1900/1950 units that were having significant Network Performance issues and Exchange performance problems.
The first symptom that was noticed was network throughput. All of these servers were plugged into 1000BaseTX networks with clients running at 1000BaseTX or 100BaseT. Throughput varied from test to test, but overall speeds for CIFS/SMB file copy operations were at speeds that would be slow on a 10Mbit network. Various troubleshooting was attempted with Dell and various switch manufacturers and we found that putting the servers on a 100Mbit port resulting in full 100Mbit throughput. Various combinations of Flow Control on switch ports seemed to help but this was not a usable permanent solution.
Exchange MAPI Errors
A second symptom cropped up almost immediately after this one where certain users were unable to open their Exchange 2003/2007 mailbox through Outlook yet were able to open it through OWA or sync with their mobile devices. Profiles were re-created and attempts made on other desktops but this did not fix the issue. The problem appeared to be tied to the user and MAPI. A scan of the event logs on the server showed multiple errors related to MAPI. (9646 Errors)
Stumbling onto a Clue
In late October of 2007 Dell shipped out their October edition of the Server Updates disc in their OpenManage updates pack. Luckily a few of my clients subscribed to this and I noticed it come in. Out of desperation we popped the update disc into one of the servers that was giving us problems and found that nearly everything was in need of updating. (This was a bit of a surprise as this particular server was only 6 weeks old at this point.) Immediately after updating all of the drivers/firmware the first problem appeared to be completely resolved. We were now able to max out throughput over the Broadcom gigabit cards in the servers, even running a bonded pair. Something definitely changed.
Upon further testing we found that the fix came from the updated network drivers dated 10/20/2007 on Dell’s support site. The actual driver installed by this update was version 126.96.36.199 dated 7/27/2007, way earlier than the 10/20 date of release from Dell. (One can only assume Dell takes a long time to QA their drivers.)
Exchange Issues Continue
Unfortunately, the updated driver did not resolve our Exchange issues. Being forced to dig further into the issue we stumbled across a post on the Exchange Team Blog detailing problems caused by network drivers (particularly Broadcom) and the Windows Scalable Networking Pack which is installed and enabled by default in Windows 2003 Service Pack 2.
The blog post goes fairly extensively into the problem, however to summarize it appears that certain network vendors have been extremely slow to properly implement the features required by the Scalable Network Pack in their drivers. The key word in that sentence is “properly” as the pack tries to turn it on anyway resulting in serious issues. At best this causes performance issues/limitations which may not be noticed at all. For us this problem was causing severe limitation on the TCP stack that presented itself in many ways. The following are a few of the issues that were affecting us that we have later found to be a direct result of the problem.
- MAPI errors resulting in limitations on the number of MAPI clients that could connect
- Once the limit was reached no further MAPI sessions could be created on the server
- Users unable to open their mailbox through Outlook
- Randomly unable to make RDP connections to the server or connections dropped immediately after connecting
- Intermittent RPC communications failures
- Networking throughput is severely decreased (This was partially fixed by the October Dell driver update)
Unfortunately, the driver provided by Dell is far too old to address these problems. Per the Microsoft blog posting the Broadcom driver needs to be at least 3.7.19 to fully support the Scalable Network Pack. This leaves us in a tight spot to address the performance issues. We are left with two options.
- Install unsupported drivers directly from the chipset manufacturer
- Addresses the source of the problem
- Potentially introduces support issues with your server vendor
- Disable the Scalable Network Pack
- Disables key improvements brought by SP2 as a workaround
- Should make note of this and re-enable these features when drivers are later updated
Obviously it is in the best interest of our customers to stay supportable so we have chosen to address this by disabling the features of the Scalable Network Pack that are affecting performance on our servers. The easiest way to do this is by issuing the following command at a command prompt.
Netsh int ip set chimney DISABLED
Further details can be found on the Exchange Team Blog posting and I highly recommend you read it. This fix has addressed the key problems we experienced without our customers’ file servers and Exchange servers but we are keeping a close watch on this hoping for a permanent resolution. We have tested the latest drivers directly from Broadcom on test servers and have found that while it addresses these immediate issue we continue to have network throughput issues with Vista client systems under certain circumstances.
While this incident has severely questioned my trust in Broadcom I have to admit that they are not the only vendor having problems. While Broadcom seems to be the biggest offender here we have seen throughput issues (particularly with Vista clients) on servers using Intel NICs. Most of these are Dell and HP servers which haven’t received vendor driver updates in several years. A driver update from Intel instead of the Dell/HP/etc. site appears to resolve the problem immediately.
I see this problem as one so severe vendors should be calling their customers to address it yet if you call Dell/HP support they don’t even have a clue what you are talking about. Hopefully this post has provided you some insight into the problem. We are still working on this internally and hope to find a permanent solution. I will update with another post when we do.