I created a great little class library for data management a while ago that provides the backbone for any data reading and writing to any database. Basically, it takes data specifications and abstracts them into data objects that I can manage from a website. Then it defines in a second layer the method by which the data is delivered (FTP, XMLHttp, Json, already defined data tables, whatever) and how that data gets loaded into that data specification (delimited – tab, comma, etc. – file, fixed field width file, XML node, data node, whatever).
I am pretty proud of this little library, and sometime when I have time I will post it as a series with some explanations. I have been using it now for six or seven years, and there has been very little change to it. Still passes the unit tests, still hums along.
It is easy to forget to look inside the guts of something when you have been using it that long. It becomes the black box, and if something is slow you don’t really feel like opening the hood and messing around.
Recently, we opted to receive a very large file monthly and import that file into our database. The provider would be sending us a fixed-width text file. Perfect. I have just the library for that one. I set up a quick data template, a quick data transfer template, created a table, and mapped my columns to it. So cool, so easy.
Problem: it was 165 million records.
And it turns out my library loads the entire file into memory in this iteration of the library, line by line. Never been a problem before, the most I have ever read in as about a million. That’s only like a gigabyte file at the most, therefore a gig of memory. This one will require 104 GB of memory. That’s a lot of zeroes and ones.
So, I rewrite the class to handle a finite number of rows at a time. Problem now is: how do I get to the starting row? My first pass was to skip the first N rows to get to my starting row. Now, though, I skip 10000 rows the first time, 20000 the second time, 30000 the third, and so on. By the time I reach the end, I am skipping 164,990,000 rows to get to the last 10000. I have created a big-O n-squared function Boo me.
Google… help me out brother.
I search “c# random access text file” (go ahead… I’ll wait) and EVERY post I see in the first few pages (that was SFW) recommends loading the file into memory and indexing it. Wow. They call out large files in the hundreds of megabytes. Where have they been?
Okay, no text reader allowed I guess. I finally settled this with a stream reader, set the position, and read a block. I know each line is a fixed length, since it is a fixed width file. So, here is what I came up with:
How well does it work? It previously took 40 seconds to get to the millionth line and read in 10000 lines of the text file on my local computer. Now it takes only 9 seconds. Imagine how much worse that would be at the 140 millionth record, and my new code will still only take 9 seconds. Problem solved.