Data Mining Using SAS Enterprise Miner
By: Randall Matignon
An Overview of SAS Enterprise Miner
The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3. Enterprise Miner an awesome product that SAS first introduced in version 8. It consists of a variety of analytical tools to support data mining analysis. Data mining is an analytical tool that is used to solving critical business decisions by analyzing large amounts of data in order to discover relationships and unknown patterns in the data. The Enterprise Miner data mining SEMMA methodology is specifically designed to handling enormous data sets in preparation to subsequent data analysis. In SAS Enterprise Miner the SEMMA acronym stands for Sampling Exploring Modifying Modeling and Assessing large amounts of data.
The reason that SAS Enterprise Miner has been given this acronym is that usually the first step in data mining is to sample the data in order to acquire a representative sample of the data. The next step is to usually explore the distribution or the range of values of each variable to the selected data set. This might be followed by modifying the data set by replacing missing values or transforming the data in order to achieve normality in the data since many of the various analytical tools depend on the variables having a normal distribution. The reason is because many of the nodes in Enterprise Miner calculate the square distances between the variables that are selected to the analysis. The next step might be to model the data. In other words there might be interest in predicting certain variables in the data. The final steps might be to determine which models are best by assessing the accuracy between the different models that have been created.
The Ease of Use to Enterprise Miner
SAS Enterprise Miner is a powerful new module introduced in version 8. But more importantly SAS Enterprise Miner is very easy application to learn and very easy to use. SAS Enterprise Miner is visual programming with a GUI interface. The power of the SAS Enterprise Miner product is that you do not even need to know SAS programming and have very little statistical expertise in the development of your Enterprise Miner project since it is as simple as selecting icons or nodes from the EM tool palette or menu bar and dragging the icons onto the EM diagram workspace or desktop. Yet an expert statistician can adjust and finetune the default settings and run the SEMMA process flow diagram to their own personal specifications. The nodes are then connected to one another in a graphical diagram workspace. SAS Enterprise Miner is visual programming with SAS icons within a graphical EM diagram workspace. It is as simple as dragging and dropping icons onto the EM diagram graphical workspace. The SAS Enterprise Miner diagram workspace environment looks similar to the desktop in Windows 95 98 XP and Vista. Enterprise Miner is very easy to use and can save a tremendous amount of time having to program in SAS. However SAS Enterprise Miner has a powerful SAS Code node that brings in the capability of SAS programming into the SEMMA data mining process through the use of a SAS data step in accessing a wide range of the powerful SAS procedures into the SAS Enterprise Miner process flow diagram. Enterprise Miner produces a wide variety of statistics from descriptive univariate and goodnessoffit statistics numerous types of charts and plots traditional regression modeling decision tree analysis principal component analysis cluster analysis association analysis link analysis along with automatically generated graphs that can be directed to the SAS output window.
The Purpose of the Enterprise Miner Nodes
Data Mining is a sequential process of Sampling Exploring Modifying Modeling and Assessing large amounts of data to discover trends relationships and unknown patterns in the data. SAS Enterprise Miner is designed for SEMMA data mining. SEMMA stands for the following.
Sample Identify the analysis data set with the data that is large enough to make significant findings yet small enough to compile the code in a reasonable amount of time. The nodes create the analysis data set randomly sample the source data set or partition the source data set into a training validation and test data set.
Explore Explore the data sets to view the data set to observe for unexpected treads relationships patterns or unusual observations while at the same time getting familiar with the data. The nodes plot the data generate a wide variety of analysis identify important variables or perform association analysis.
Modify Prepares the data for analysis. The nodes can create additional variables or transform existing variables for analysis by modifying or transforming the way in which the variables are used in the analysis filter the data replace missing values condense and collapse the data in preparation to time series modeling or perform cluster analysis.
Model Fits the statistical model. The nodes predicts the target variable against the input variables by using either leastsquares or logistic regression decision tree neural network dmneural network userdefined ensemble nearest neighbor or twostage modeling.
Assess Compare the accuracy between the statistical models. The nodes compare the performance of the various classification models by viewing the competing probability estimates from the lift charts ROC charts and threshold charts. For predictive modeling designs the performance of each model and the modeling assumptions can be verified from the prediction plots and diagnosis charts.
Note: Although the Utility nodes are not a part of the SEMMA acronym the nodes will allow you to perform group processing create a data mining data set to view various descriptive statistics from the entire data set and organize the process flow more efficiently by reducing the number of connections or condensing the process flow into smaller more manageable subdiagrams.
The Enterprise Miner Nodes
Sample Nodes
The purpose of the Input Data Source node is to read in a SAS data set or import and export other
types of data through the SAS import Wizard. The Input Data Source node reads the data source and
creates a data set called a metadata sample that automatically defines the variable attributes for later
processing within the process flow. In the metadata sample each variable is automatically assigned a level of measurement and a variable role assignment to the analysis. For example categorical variables with more than two class levels and less than ten class levels are automatically assigned a nominal measurement level with an input