IT Training Courses
Microsoft Gold IT Training Partner
800-326-1044
IT Training Newsletters SQL Server 2005 Data Mining

SQL Server 2005 Data Mining

by Saskia Schott
Principal Instructor for SQLSoft+
MOF, MSF, MCSA, MCSE, MCDBA, MCT

 

You ask:
"What is data mining? Do I need a pick and shovel? Or are there other tools I can use?"

...and you add: "When I Googled 'data mining,' I got 21,500,000 entries!"

I reply:
"I'm glad you asked."

Let's start with a definition of data mining. According to the Web site, www.twocrows.com, data mining is:

An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.

And according to Wikipedia ( http://en.wikipedia.org/wiki/Data_mining), data mining is:

The nontrivial extraction of implicit, previously unknown, and potentially useful information from data.

So, you have a lot of data and you want to know about patterns in that data and about relationships that you hadn't suspected. The problem with that is, as the Wikipedia site points out, you can fool yourself by "data dredging." Data dredging implies scanning the data for any relationships, and then when one is found, coming up with an interesting explanation. (This is also referred to as "overfitting the model.") The problem is that large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached are likely to be highly suspect.

Having said that, sometimes you know what to ask about your business, and sometimes you don't. Let's say you're in the beverage sales industry. You started with coffee, but now you sell other drinks. You can create a Sales data warehouse that stores your aggregated sales data, including day of the week sold, how much, the weather on that day, the location, type of location, whether there was a promotion or not, size of drink ordered, and maybe if you have a "frequent buyer" program, you have information about what a particular customer orders whenever she goes into one of the stores.

But suppose that every person she brings in becomes a "frequent buyer"? Or that she tries every new drink you offer, and while lots of customers try your new drinks, she seems to be a market leader or indicator? What question would you ask to get that information? If a customer likes a new drink and reorders, it's a hit; if they don't, it fails. How do you find that information? Aha! That's what data mining is about. And there are different mining models or algorithms for different relationship discovery.

In SQL 2000 the example used was, if there is a group of high school students, and we know their grades, whether they want to go to college, whether their parents want to go them to go to college, and their parents' income level, what factor correlates most highly with them actually going? As it happened, it was parental encouragement. Those with high grades, without parental encouragement, were very unlikely to go to college. This kind of model is a Decision Tree model. The other type of model SQL 2000 had was called Clustering.

What About SQL 2005?

But SQL 2005 is better and has more, doesn't it? Yes, there are five new models. And they are:

  • Association Rules
  • Naive Bayes
  • Sequence Clustering
  • Microsoft Time Series
  • Neural Networks

Association Rules

Association Rules find rules in your data in the form A and B -> C. This algorithm is especially useful for large catalogs—current testing has created models on catalogs containing over half a million items. The predictive capability of Association Rules makes it particularly ideal for cross-sales applications, particularly in Web scenarios. In the example below, given a customer has a computer, what movies are they likely to have watched, and are they likely to want to watch?

Figure 1: Association Rules Viewer

Naive Bayes

Naive Bayes is a probabilistic model. For instance, it can show the difference between churners and continuing customers, or the difference between a gold-card customer and a platinum-card customer.

Figure 2: Naive Bayes Viewer

Sequence Clustering

The Sequence Clustering algorithm combines expectation maximization clustering with sequence technology. The sequence can be any series of events tied to a user. You can create clusters, for example, describing how customers use your Web site and even predict the next page they may visit.

Microsoft Time Series

The Microsoft Time Series algorithm allows you to analyze and forecast any time-based data, such as sales or inventory. Using regression trees technology, this algorithm can detect patterns that exist across multiple series so you can see how different events are related. For example, you can find out how sell-in and sell-through numbers are related, or which product sales lead to other sales.

Figure 3: Time Series Model

Neural Networks

Neural Networks is an artificial intelligence technique that excels in finding complex relationships in data that would be missed by other algorithms. While the patterns it finds can be difficult to explain, Neural Nets is currently the best algorithm for finding non-linearities that exist in your data. Due to its thoroughness, it can be slower than other algorithms like Naive Bayes.

When Can I Dig?

So, how soon can you get to the point where you or your business analyst can mine your data warehouse? Can you learn about creating data warehouses, and data mining right away? You can, of course, go to the SQL Server 2005 Web site at: http://www.microsoft.com/technet/prodtechnol/sql/2005/default.mspx. You can download SQL 2005 and play with it. And you can watch the Webcasts at: http://www.microsoft.com/events/series/technetsqlserver2005.mspx.

But to get hands-on experience, and learn from an instructor that has been working with SQL 2005 for a year or more? SQLSoft+ has been partnering with Aspirity for a number of years. Aspirity wrote the Microsoft course on SQL 2000 Data Warehousing and is about to release a non-Microsoft five-day course on Data Warehousing and Data Mining.

Recommended Classes