Simplify Data Mining Solutions Development using Oracle JDM 11g and Java 5 annotations (Part-1)
In this article I will be using Oracle implementation of the JDM 1.1 standard API (OJDM API) for the examples and includes Oracle specific JDM extension features.
Java 5 (Tiger release) introduces many new language features such as generics, enum, annotations etc. In this article we will focus on using annotations feature to develop data mining (DM) annotations that will greatly simplify developing data mining scripts from analysts perspective. Java Annotations provide ability to specify meta-data in the Java program elements, such as package, class, constructor, method, field etc. For example, to specify Data Mining Engine connection details a custom DM annotation called DME is used instead of JDM Connection API. This will hide all the details of the API to be used to obtain the DME Connection and Data Mining (DM) Annotation Processor will create a DME Connection by using the annotation details. Rest of the article introduces DM annotations and how to use them in developing data mining solutions.
Listing-1@DME(url = "myHost:1521:orcl", user = "dmuser")
public class CustomerAttrition { ... }
Java Annotations are specified using @ prefix and followed by the name of the annotation and then values associated with the annotation specific attributes. In the Listing-1 DME is the annotation name that is used to specify the DME location (url) and user authentication information. In this example DME annotation is specified at the class level so that all program elements such as constructor, methods in that class can access the DME connection details.
Before describing list of data mining annotations and their usage, let us briefly look at a simple and complete example with data mining annotations. This example will give a higher-level understanding of how a data mining annotated Java program is structured. The Listing-2 describes a Java class called CustomerAttrition that is used to model, test and predict customers who are likely to attrite. Line 6 specifies class level DME annotation for connection details and Line 9-16 specifies the DM annotations for the method
predictCustomerResponses
. When you execute this class by calling predictCustomerResponses method
(Line-30) with the given DM annotations, it will build a classification model called AttritionModel using decision tree algorithm and training data in KNOWN_CUSTOMERS table. After successful model build AttritionModel will be tested using test data in TEST_CUSTOMERS table and apply the model to make attrition value prediction for the new customers in NEW_CUSTOMERS table and creates output table CUST_ATTRITION_PREDICTIONS with the predictions. Both test and apply tasks run in parallel after successful model build. If we need to get the same functionality done using the regular OJDM API it will take significant amount of coding and need good understanding of the OJDM API. As you can see DM annotations greatly simplify developing a data mining script from an analyst perspective. Given a code template with simple instructions, DM analyst can write and execute these scripts. At the end of this article I will provide a code template and brief instructions from a non-Java programmer perspective.01 package ojdm.annotation.sample;
02
03 import javax.datamining.JDMException;
04 import ojdm.annotation.*;
05
06 @DME(url = "myHost:1521:orcl", user = "dmuser")
07 public class CustomerAttrition {
08
09 @Function ( "Classification" )
10 @Algorithm ( "Tree" )
11 @BuildInputData ( value="KNOWN_CUSTOMERS", caseId={"CUST_ID"} )
12 @TargetAttribute ( value="ATTRITION", dataType="NUMBER" )
13 @Model( "AttritionModel" )
14 @Test ( input="TEST_CUSTOMERS", output="AttritionModelMetrics" )
15 @Apply ( input="NEW_CUSTOMERS", output="CUST_ATTRITION_PREDICTIONS")
16 @Execute( tasks= {"BUILD", "TEST", "APPLY"}, hint="parallel" )
17
18 public void predictCustomerResponses(String dmePassword)
19
throws JDMException, NoSuchMethodException {
20 AnnotationProcessor.execute(
21 this.getClass().getMethod(
22
"predictCustomerResponses", new Class[] { String.class } ),
23
dmePassword);
24 }
25
26 public static void main(String[] args) {
27
28 CustomerAttrition custAttrition = new CustomerAttrition();
29 try {
30 custAttrition.predictCustomerResponses("dmuser");
31 } catch (NoSuchMethodException e) {
32 e.printStackTrace();
33 } catch (JDMException e) {
34 e.printStackTrace();
35 }
36 }
37 }
In this article I am not describing how to create DM Java annotations, curious Java programmers can refer to links provided at the end of this article for more details on Java Annotations. Next section describes the list of DM annotations and how to use them.
Data Mining Annotations
Following are the list of Data Mining Annotations that are defined in this article.
DM Annotation | Description |
DME | Specifies the DME connection details, such as the location of the DME using URL and authentication details. Optionally user can specify locale information and password. It is not recommended to specify password as part of the annotation for security reasons. |
Function | Specifies the function to be used for model, optionally user can specify settings associated with the function. |
Algorithm | Specifies the algorithm associated with the function. When this annotation is not specified functions default algorithm will be used. |
BuildInputData | Specifies the input data to build the model. Optionally user can specify the case id and attribute level details. |
Test | Specifies the input, output tables and settings related to testing the model. |
Apply | Specifies the input, output tables and settings related to applying the model. |
Execute | Specifies the list of tasks to be executed and settings related to the task execution. |
CostBenefits | Specifies the costs of false predictions and benefits of true predictions. Used only by the classification function. |
TargetPriors | Specifies the target priors associated with the original data. Used only by the classification function. |
DME
Purpose
This annotation is used to specify the DME connection details with the location using URL and user name. Optionally user can specify locale information and password. It is not recommended to user password as part of the annotation for security reasons.
Syntax
@DME ( url="<url>",
user="<username>"
[,locale="<locale>"]
[, password="<password>"] )
[,locale="<locale>"]
[, password="<password>"] )
Semantics
url
Use this to specify the location of the Data Mining Engine (DME). In case of Oracle Data Mining it is the location of the database and URL follows the JDBC URL syntax <hostname>:<port number>:<sid>
user
Use this to specify the user name of the DME. In case of ODM it is the user name of the database.
locale
Use this to specify the DME locale to indicate that DME must be set to the specified locale. Locale is specified as a single string that can specify language, country and variant using the following syntax "<language>[,<country>[,variant]]".
password
Use this to specify the password to connect to the DME. It is not recommended to specify the password for security reasons. Instead supply the password as an argument to get the connection.
Function password
Use this to specify the password to connect to the DME. It is not recommended to specify the password for security reasons. Instead supply the password as an argument to get the connection.
Purpose
This annotation is used to specify the function used to build the model, optionally user can specify the settings associated with the function.
Syntax
@Function ( value="<function>"
[, settings={ @Setting( name="<settingName1>", value="<settingValue1>" ),
@Setting( name="<settingName2>", value="<settingValue2>" ) ... }
[, name="<function settings name">] )Semantics
value
Used to specify function to build the model. Following are the list of functions supported by the ODM.
- classification
- regression
- clustering
- association
- attributeImportance
- featureExtraction
settings
Use this to specify settings associated with the function. Following table lists the valid settings associated with each function.
The settings annotation allows to specify an array of Setting annotations. For example, following shows the association function annotations. If there are no settings specified function default values will be used.
Function | Setting Name | Description | Valid Values |
classification | costBenefits | Name of the cost benefits table. The cost benefits specifies the cost of false predictions and benefits of right predictions. Only Decision Tree models can use a cost benefits at the model build time. All other classification algorithms can use a cost benefits at apply time. Cost benefits can be specified using @CostBenefits annotation. This setting is used to specify already saved cost benefits table. | Must have cost benefits table with the specified name. If there is a @CostBenefits annotation specified in the method then it will override this setting. |
priors | Name of the priors table. It is primarily used by the NaiveBayes algorithm model to specify prior probabilities to offset differences in distribution between the build data and the scoring data. SVM classification uses the priors table for weights. Priors can be specified used @Priors annotation. This setting is used to specify already saved priors table. | Must have priors table with the specified name. If there is a @Priors annotation specified in the method then it will override this setting. | |
weights | Name of the weights table. It is used only by GLM classification algorithm. Weights table stores weighting information for individual target values in a GLM classification model. The weights are used by the algorithm to bias the model in favor of higher weighted classes. | Must have a table with the specified weights. If there is a @Weights annotation specified in the method then it will override this setting. | |
clustering | numberOfClusters | Number of clusters generated by a clustering algorithm. Default is 10. | Must be a positive integer value. >=1 |
association | maxRuleLength | Maximum rule length for association rules. Default is 4. | Must be a positive integer value between 2 and 20. (>=2 and <=20) |
minConfidence | Minimum confidence for association rules. Default is 0.1. | Must be value between 0 and 1. (>=0 and <=1) | |
minSupport | Minimum support for association rules.Default is 0.1. | ||
featureExtraction | numberOfFeatures | Number of features to be extracted by a feature extraction model. The default is estimated from the data by the algorithm. | Must be a positive integer value. >=1 |
classification and regression GLM specific settings | missingValues | Used to override the default missing value treatment for the GLM algorithm model builds. By default missing values are treated by the algorithms. User can override this behavior by specifying "deleteRow" value. | Must be one of the following values:
|
rowWeightsColumn | Name of a column in the training data that contains a weighting factor for the rows. Row weights can be used as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times. Row weights can also be used to emphasize certain rows during model construction. For example, to bias the model towards rows that are more recent and away from potentially obsolete data. | Must be a valid column name in the model build input table/view. | |
All Functions | autoPrep | Used to specify the flag to indicate auto data preparation on/off. By default for GLM and DT algorithms auto data preparation (ADP) will be set to ON. For others it is set to Off. Using this setting override default ADP settings. | Valid values are
|
The settings annotation allows to specify an array of Setting annotations. For example, following shows the association function annotations. If there are no settings specified function default values will be used.
@Function (name="association" settings= { @Setting(name="maxRuleLength" value="5"),
@Setting(name="minConfidence" value="0.2"),
@Setting(name="minSupport" value="0.4") } )
name
Use this to specify the name of the function settings table to save the specified function and algorithm settings to build the model. These settings can be reused to build future models by referring to the function settings table name.
Algorithm
Purpose
This annotation is used to specify the algorithm used to build the model. When this annotation is not specified then the default algorithm associated with the function will be used.
Syntax
@Algorithm ( value="<algorithm>"
[, settings={ @Setting( name="<settingName1>", value="<settingValue1>" ),
@Setting( name="<settingName2>", value="<settingValue2>" ) ... } ])
Semantics
value
Use this to specify algorithm to build the model. Following are the list of valid algorithm values associated with the specified function. User can specify either abbreviated algorithm name or the full name.
- classification
- NB (or) naiveBayes
- DT (or) decisionTree
- SVM (or) supportVectorMachine
- GLM (or) generalizedLinearModels
- ABN (or) adaptiveBayesianNetworks (deprecated in 11.1)
- regression
- SVM (or) supportVectorMachine
- GLM (or) generalizedLinearModels
- clustering
- KM (or) kMeans
- OC (or) oCluster
- association
- AP (or) apriori
- attributeImportance
- MDL (or) minimumDescriptionLength
- featureExtraction
- NMF (or) nonNegativeMatrixFactorization
settings
Use this to specify settings associated with the algorithm. Following table lists the valid settings associated with each algorithm.
The Algorithm annotations settings attribute allows to specify an array of Setting annotations. For example, following shows the naive bayes algorithm annotations. If there are no settings specified algorithm default values will be used.
@Algorithm (name="NB" settings= {BuildInputData@Setting(name="maxRuleLength" value="5"),
@Setting(name="minConfidence" value="0.2"),
@Setting(name="minSupport" value="0.4") } )
Purpose
This annotation is used to specify the input data details for model building .
Syntax
@BuildInputData ( value="<input table name or SQL select statement>",
[,caseId="<case id column names>" ]
[,target="<target attribute column names>"]
[,include="<list of columns to be included>"]
[,exclude="<list of columns to be excluded>"]
[,expressions="<list of derived attribute expressions>"]
[,target="<target attribute column names>"]
[,include="<list of columns to be included>"]
[,exclude="<list of columns to be excluded>"]
[,expressions="<list of derived attribute expressions>"]
Semantics
value
Specifies the name of the table/view or SQL select statement used to create the input dataset.
caseId
Specifies the name of the column that uniquely identifies each case of the input data.
target
Specifies the name of the column that has target
include
Specifies the list of included column names to build the model.
exclude
Specifies the list of excluded column names to build the model. Note that either include or exclude attributes can be specified. If both are specified then only include attributes are taken into account and exclude attributes will be ignored.
expressions
Specifies the list of SQL expressions that are to be embedded with the model. This attribute can have an array of Expression annotations to specify per column SQL expressions. For example, to define a logarithm value of revenue column and call this new logarithm column as log_revenue you can specify an expression associated with the revenue column/attribute as @Expression( value="LOG(10, revenue)", inverse="EXP(log_revenue)", outputAttrName="log_revenue" ).
Purpose
This annotation is used to specify the test settings to evaluate the model performance. ODM supports testing of the classification and regression models.
Syntax
@Test ( input="<input test table>",
output="<output test metrics table>"
[,positiveTargetValue="<String representation of the positive target value>"]
[, numberOfLiftQunatiles="<Number of lift quantiles>"] )
[, residualOutput="<regression residual output>"] )
[,positiveTargetValue="<String representation of the positive target value>"]
[, numberOfLiftQunatiles="<Number of lift quantiles>"] )
[, residualOutput="<regression residual output>"] )
Semantics
input
Specifies input data for testing the model.
output
Specifies the output test metrics table.
positiveTargetValue
Specifies the positive target value for which lift and ROC will be computed.
Specifies the positive target value for which lift and ROC will be computed.
numberOfLiftQunatiles
Specifies the number of quantiles for computing lift value.
residualOutput
Specifies the name of the residual output table.
Specifies the number of quantiles for computing lift value.
residualOutput
Specifies the name of the residual output table.
Apply
Purpose
This annotation is used to specify the settings to apply the model.
Syntax
@Apply ( input="<apply input table name>",
output="<apply output table name>"
[,applyType="<prediction/topN/value>"]
[, topN="<topN value>"]
[, values="<list of values>"]
[, sql="<Oracle sql query using apply functions>"] )
[,applyType="<prediction/topN/value>"]
[, topN="<topN value>"]
[, values="<list of values>"]
[, sql="<Oracle sql query using apply functions>"] )
Semantics
input
Specifies apply input table name.
Specifies apply input table name.
output
Specifies apply output table name.
applyType
Specifies type of apply. There are three types of apply prediction, topN, and value. Following table illustrates these options for different types of mining functions.
prediction | topN | value | |
classification | Outputs the most probable target value and associated probability for the given case. | Outputs the most probable top N target values that have and associated probabilities for the given case. | Outputs the probability associated with the specified target value(s) for the given case. |
regression | Outputs the prediction value of the target attribute for the given case. | N/A | N/A |
clustering | Outputs the most likely cluster that the case belongs to. | Outputs the most likely clusters for the given case. | Outputs the probability associated with the specified cluster(s) for the given case. |
topN
Specifies the N value for the apply type topN.
values
Specifies the values for the apply type value.
sql
Specifies the Oracle SQL using apply SQL functions. When the SQL query is specified, it overrides the applyType option and uses the SQL to create the apply output table.
ExecutePurpose
This annotation is used to specify execution commands and execute options.
Syntax
@Execute ( tasks="<list of tasks to be executed>",
[,type="<execution type>"
[,jobClass="<DBMS Scheduler job class name>"]
[,schedule="<time to run the tasks>"] )
[,jobClass="<DBMS Scheduler job class name>"]
[,schedule="<time to run the tasks>"] )
Semantics
tasks
Specifies list of tasks to be executed. There are three types of tasks that are specified in this article BUILD, TEST and APPLY. One can extend this to have more tasks.
type
Specifies the execution type. There are two options for this attribute: serial and parallel. By default it is set to serial. When it is serial all tasks will be executed in the specified sequence in this annotation. When it is parallel tasks that can be executed in parallel will be executed in parallel.
jobClass
Specifies the name of the jobClass that the task belongs to. OJDM uses DBMS Scheduler to execute the mining tasks. A task is created as the DBMS scheduler job. DBMS scheduler provides ability to classify the jobs into a specific category and allows to allocate resources. Here user can specify the jobClass. By default it uses the systemDefault job class.
schedule
Specifies the time at which the execution of tasks to be started. Time format is fixed to (MM:DD:YYYY HH24:MI).
Purpose
This annotation is used to specify the cost associated with the false predictions and benefits associated with the right predictions. This annotation is applicable for only classification function, where target attribute has discrete number of target values.
Syntax
@CostBenefits ( value="<list of cost/benefit associated with each target value combintation>",
[,name="<name of the cost benefits table>"] )
Semantics
value
Specifies the cost/benefit associated with each combination of the target values. For example, cost associated with the false prediction of responding customer as not-responds costs $200 and with the true prediction of responding customer benefits $150. With the false prediction of not-responding customer as responds costs $10 and true prediction of not-responds customer benefits $10. This example cost/benefit scenario is specified using the annotation as follows:
@CostBenefits ( value= {"responds,responds : -150",
"responds,not-responds : 200",
"not-responds,responds : 10",
"not-responds,not-responds : -10" }
Note that benefits are represented as negative values to indicate negative cost. By default false prediction costs are set to 1 and true prediction benefits are set to 0.
name
Specifies the name of cost matrix to save the settings in the specified table for reuse.
TargetPriorsPurpose
This annotation is used to specify the prior values associated with the target attribute values.
Syntax
@TargetPriors ( value="<list of target priors>" )
Semantics
value
Specifies the prior probability values associated with each target value. Sum of all target value prior probabilities must be equal to one. Here is an example that shows the prior probability of the response value is 0.20 and not-response value is 0.80.
@TargetPriors ( value={ "response : 0.20",
"not-response : 0.80" } )
Summary
This part of the article described Java 5 Annotations and OJDM 11g approach to simplify the data mining solution development. In the next part I will describe more details about the Data Mining Annotation processor and how to use it to develop more complex data mining solutions with the minimum development effort.
References
Java Annotation References:
- Making the Most of Java's Metadata, Part 1: by Jason Hunter
- Making the Most of Java's Metadata, Part 2: Custom Annotations by Jason Hunter
- Making the Most of Java's Metadata, Part 3: Advanced Processing by Jason Hunter
- Annotations in Tiger, Part 1: Add metadata to Java code
- Annotations in Tiger, Part 2: Custom annotations
- Using Annotations to add Validity Constraints to JavaBeans Properties
Oracle Data Mining and Java Data Mining links:
No comments:
Post a Comment