Tags

Data reporting seems on the surface to be very simple. Three-tiered application, with a data layer (SQL Server, for example), an object layer (class library), and a presentation layer (a website). The reporting application is great with a small data set without much issue. You have a finite set of filters, up to the number of columns in the data set being reported on, and you present the report based on those filters. Easy, right?

Well, what about when the data set is getting bigger? How can you manage that and still present a reasonable user experience?

Many times, as a requirement, I will be given “page renders in 5 seconds after each filter”, even when the dataset reaches the millions or even billions of rows. For a while, you can get away with creative indexing. When you know the requirements are not “filter on every column”, you can index columns that will be filtered pretty easily and that gets you some impressive speed gains into the billions of rows of data. One day I will do a detailed post about how indexes work. It’s pretty neat stuff.

But, even the indexes can’t help forever. It helps if the data sets are static between re-indexing, but it can be even worse if the data keeps moving around on you. So now, you have to come up with a strategy using pre-compiled reporting data sets.

Pre-compiling involves knowing the filters that are going to be used, and taking a snapshot of the data at every permutation of the filters. I will leave the storage of this data up to you. Is it a copy of the data in multiple new tables? Is it a table of recIDs with a filter ID column? You get to architect that one. Good luck.

But, how many permutations are there? It can be crazy, of course. Let’s use an example:

I am a data analyst at a government agency that would has the following data on the individual citizens of the US:

  • Income – let’s say 10 bucket-ized ranges, like 0-25K, 25-50K, etc. Doesn’t matter to this example.
  • Family Size – 10 categories, Let’s say 1-10+
  • Home Owner? – 2 categories
  • Race Code – Let’s say 5 categories
  • Zip Code – lots and lots…
  • State – 50 plus “other”.

Now, the application is to report on data filtered on all of these factors. So, we need pre-compiled data sets for each permutation of the filters. Let’s skip zip code for a minute and consider all of the data sets we need. Remember, it’s all plus one since the filter may include “NOT filtered” for a field. So, that’s 11 x 11 x 3 x 6 x 52 = 113,256 data sets. If we want to limit to not pre-compiling for State, we have a more reasonable 2178 data sets. We can index each sub-data set on the state field, and we get some really great performance. Either way, that seems like a lot of stuff, and we have not even added in zip code.

So, pre-compiling all possible data sets here is not a reasonable option. How can we use the concept of pre-compiling but not have thousands of subsets? One good way would be to create pre-compiled subsets of each single filter. So, there would be 10 income ones, 10 family ones, 2 home owner ones, 5 race code ones, which is only 10 + 10 + 2 + 5 = 27, and each of these sets would be indexed on zip code and state. Then, we can just inner join (Venn Diagram intersection if you are doing this in the object layer with lists or sets) the data to get the final filtered data.

This way, we are using the concept of pre-compiled data, indexes, and SQL’s inner join (or the object layer’s set functionality) to all provide the best possible user experience for the lowest computing cost.

Advertisements