Tags

,

Have you ever had an unindexed SQL table and inserted a bunch of rows to it? Say, from the Import Export wizard in SQL Server Management Studio? Then, you select it back out with some where clause and it takes FOREVER for the results to come back? Then you know why indexes are so important.

Quick aside – looking around for some info on this post, I found a great post from a couple of years ago about why the order of these results can be in no particular order. I really like the title of this post about the results, “No Seatbelt – Expecting Order without Order By” by Connor Cunningham at Microsoft: https://blogs.msdn.microsoft.com/conor_cunningham_msft/2008/08/27/no-seatbelt-expecting-order-without-order-by/

Anyway, back to indexing…

So, indexes can really speed up your select statement. But, you may notice if you have a lot of indexes, it can really slow down your inserts, updates, and deletes.

Why? What is an index and how does it work?

I am going to go full computer nerd in this post. This is one I actually enjoy explaining to those that have the patience to listen to me on it. You’ve been warned.

Let’s take the example of the big table above – let’s say it’s a table called “People” with 1 million rows (big-ish…) – with no indexes. Let’s keep it simple. It has first name, last name, phone, address, email, and favorite color. The table is divided up into data pages for storage, and the rows in that page are able to be referenced by row number or row offset. For major simplicity, let’s say pages have 20 rows in them. That would mean the 72nd row would be on page 4, row 12. That is the POINTER to the row. If I know the row’s POINTER, I can get to it pretty much instantly.

Now, let’s say in my table with 1 million rows, row 623,845 has the phone number “555-1212”. I am doing a query “SELECT * FROM People WHERE phone = ‘555-1212’” to find that row. Where does SQL start? Well, since it has NO CLUE where that row is, it has to start from the beginning. So, it looks at row 1. Nope. Record 2. Nope. It has to check 623,845 rows before it gets a “Yep”. And then, are there other rows where the phone is 555-1212? It has to CONTINUE to look, since this is not a unique row. The search time for this is equal to the number of rows in the table. Even if I constrain the column to only have unique values, SQL will have to look, on average, at HALF the rows EVERY time it queries the data. This is called a Big-Oh-of-N operation (written as O(N)), meaning the operation is linear to the number of rows in the table.

So, let’s say I know that the phone number field is something I will often be searching. The smart thing to do here would be to add an index to the field. When we do that, a B-Tree is built on the data in the field. B-Trees are self-balancing trees, similar to binary trees. For this explanation, let’s say it is a binary tree and assume it is always balanced. The node object of this binary tree will be a structure with the data being indexed, and the POINTER to the actual record. So, this tree contains an actual copy of the data in the row, but structured such that the middle value (in order) is the root of the tree, and if the value I am looking for is LESS than the value at the root, I traverse the tree to the left, and if it is GREATER I traverse the tree to the right.

So, how long does it take to find the value in the binary tree? In a table with a million rows, it would take a maximum of 20 comparisons to find the value I am looking for. Each search removes half of the remaining values from contention, so you quickly find your value. This is called a Big-Oh-of-Log-N operation (written as O(log(N)). The node you find contains the value for comparison, and the pointer to the actual record.

So, that is how you found the record so quickly. The select is much faster, and the speed difference increases asymptotically as the table gets bigger. 20 searches get you a million rows, but 30 searches get you a BILLION rows. It truly boggles the mind.

So why are updates, deletes, and inserts now slower? Well, simple. When you insert into the table (or update, or delete), you now have to also insert to (or update, or delete from) your index as well. Not to mention that this is not really a binary tree, and the B-Tree re-balancing itself has a cost.

Why not index all of the columns? Well, one reason is that (as just now mentioned), updates, deletes, and inserts are now slower, and it gets even worse when you add more indexes. Also, each one of these indexes takes up space equal to the column’s contents plus the (very small) overhead of the tree nodes. So, ignoring the tree nodes, if you indexed every column, the table would be twice as big to store as the actual data. Stick to indexing the columns on which you most often need to filter.

So, that is how indexing works in a nutshell. I hope this improves your understanding of SQL indexing.

Advertisements