There are many approaches for assessing the quality and characteristics of a data mining model. Firstly, you can perform several statistical validity to determine the correctness and accuracy of a model. To achieve this, you may separate the data into training and testing sets to test the accuracy of predictions. A common approach is to simply ask business experts or analysts to review the results of the data mining model to determine whether the discovered patterns have meaning in the targeted business scenario.
"All of these methods are useful in data mining methodology and are used iteratively as you create, test, and refine models to answer a specific problem. No single comprehensive rule can tell you when a model is good enough, or when you have enough data." ("Testing and Validation (Data Mining).") However, this blog is to provide you with an empirical research that discusses how well a model will work.
One of the hardest task in data mining is building a model that can accurately predict a correct output (or in some cases, close to correct output). A major problem arises when the data used for training a data mining model are used in validating the model. This is usually frowned upon by data mining professionals because this approach generates a non-realistic and overoptimistic prediction.
After going through several research documents in order to find a good rule of thumb in selecting a particular approach for assessing the quality of data mining models, I was able to congregate process describe in the blog.
Nathalie Japkowicz, Jerffeson Souza, and Stan Matwin, the Director of the Institute for Big Data Analytics, described patterns that identify recurring solutions for the problem of evaluation of data mining models. The pattern model evaluation guides a designer in choosing the best technique in a particular context. I have written this blog with the idea of introducing this patterns and their solutions while integrating the theoretical aspect with the industry's current and practical problems. In addition, I described and translated some of the terms in the paper to that used by industry professionals.
The pattern language presented in the paper has five patterns. A particular pattern can be selected based on the solution to be applied on the pattern.
The above table was retrieved from "Evaluating Data Mining Models: A Pattern Language." My next blog post shall describe the patterns and approaches in detail and their application and relevancy to professionals in the industry so they can better understand the practicability of these patterns. In addition, I shall apply them on particular datasets.
Souza, Jerffeson, Stan Matwin, and Nathalie Japkowicz. Evaluating Data Mining Models: A Pattern
Language (n.d.): n. pag. University of Ottawa. Web. 9 Sept. 2014.
"Testing and Validation (Data Mining)." Testing and Validation (Data Mining).
Microsoft Developer Network, n.d. Web. 09 Sept. 2014.