Tuesday, January 30, 2007

Fast Excel Reporting

A recurring topic where I work is how to improve performance of code that exports data to Excel for reporting. Lately I've seen a lot of interesting advice on different options, so I thought I'd post some of it here.

Approach #1: C# with interop for Excel COM API
The fundamental problem with this approach is that the COM API is inherently slow, and all of the marshalling (especially with variant types!) has a lot of overhead. The net effect is that a lot of people who started out with this approach have seen performance degrade. Now, I should mention that in our core reporting engine we still primarily do things this way, and we produce reports with literally 100,000's of rows, ending up with several-hundred-megabyte size files - and yes some of them take a while, but nothing that seems that extremely bad, given what they are doing. From what I recall, the worst is something like a half hour, for almost 1M rows by 50+ columns. And obviously any user who really wants us to put that much data in one spreadsheet is not really going to be able to use it very effectively.

The first approach most people try is cell-by-cell, which is pretty slow, because in addition to marshalling you are dealing with lots of IDispatch-type calls over the interop barrier. An improvement on this, to eliminate the "COM-chattiness" is to pass a large chunk of data to Excel at once, as an array, and then to apply post formatting afterwards. The formatting can be done within a VBA macro, or it can be done via COM - in this case it's best to try to optimize formatting by grabbing ranges of similar cells and formatting them together at once.

Approach #2: Directly calling via COM vtable interfaces
This is a technique that can be used to eliminate a lot of the overhead of calling through a generic COM interop layer. The problem with the typical interop layer is that you are essentially going through the same IDispatch interface that was used back in the VB days. By encoding the vtable interface and linking it in, typically in C++ code, the maximum performance with a COM call can be achieved. This is also theoretically possible in C#, but you must be a master of P/Invoke to set it up properly, and since the Excel API is fairly extensive, to be able to use most of it you would spend a ridiculous amount of time trying to write all of the call signatures, and then debug all of the calls to get marshalling and types right. It sounds like a nightmare to me, but I've heard of people having done it.

Approach #3: Use clipboard as a protocol
Since Excel is integrated with both the text clipboard and also much more extensive clipboard formatting including DDE/OLE formats, you can pretty much pass anything to Excel via the clipboard. This would be particulary useful if you need to put random things like graphics into a spreadsheet. A common approach to sending a large set of data to Excel is to format your data as a big text csv string, copy that to the clipboard, and then paste into Excel, or even use some of the paste special to get formatting and types to work properly. This technique can provide big gains over API calls in some scenarios. One of the problems with this approach is that it is using the shared clipboard, so it's hard to run something like this in the background if there are user apps running as well. Additionally, it presents serious challenges for any kind of multi-threading application.

Approach #4: Generate Excel XML document format
More recent versions of Excel, and particularly the most recent 2007 version, have extensive support for XML. By reverse-engineering the schema from an Excel-generated XML document, you can understand how to control pretty much everything. Then the problem becomes a much simpler one of how to generate XML in a quick way - something that has numerous solutions readily available on the web. I recently came across this article, where someone has taken the time to document this approach step-by-step.

Approach #5: Use a pre-packaged solution
The last time this topic came up, someone recommended Spreadsheet Gear. This is a (not free, not open) component that can be used to generate Excel documents without even having Excel available on the server. It looks like they have basically reverse-engineered the entire document format (the legacy proprietary one, not the new open XML one). Then they've put a much more friendly and performant .NET API on top of that, and added more value around it. I haven't spent a lot of time looking at this yet, but anecdotal evidence on the site claims impressive performance. I'm looking forward to getting a license through our company (it's pretty expensive for what it does!), and then I'll be able to say something more concrete about this - but it definitely looks promising.

Approach #6: XML and Scalability
Assuming the future direction of these document formats is XML, that should provide a new range of possiblities. One idea that comes to mind is to leverage the open format to create a scalable reporting platform. Imagine you really do need to generate a ridiculously large XML-based Excel report. And imagine that the time involved in generating this report is not only document generation, but also back-end number crunching and data processing. With the normal current approaches, you would probably break these two aspects apart, so that the processing bit could be scaled and optimized - while the document generation could be isolated in a non-scalable way.

But, if you can use an XML format, perhaps you could divide the work up in such a way that you could have a scalable system, with many "workers" producing individual pieces of the report, perhaps across many virtual servers. You would end up with many individual XML fragments, and then you could have a final component that pieces everything back together and streams out the final report. This approach would have the benefit of simplifying the architecture of the workers - you wouldn't need to break up the processing bit from the document generation bit. And the scalability should only be limited to the power of your XML parser.

No comments: