Wednesday, July 11, 2007

Simplify Data Mining Solutions Development using Oracle JDM 11g and Java 5 annotations (Part-1)


Often I come across data mining analyst comments in using Java API for Data Mining (JDM) "I need to have good Java programming skills to do data mining using JDM API". It is true that JDM API is designed for Java programmers to develop data mining/predictive analytics applications/solutions. However, in practice not many data mining analysts are familiar with the Java programming. Objective of this article is to provide an approach for data mining analysts who are not Java programmers, but want to be able to write data mining scripts that can be easily integrated with the applications. This article is divided into three parts, part-1 of this article provide the details of the proposed Data Mining annotations with a simple example that illustrate data mining solution development using Data Mining Annotations. In Part-2, I am planning to post the details of data mining annotation processor and a guide to develop solutions using the DM annotations that are developed using OJDM. In Part-3, I am planning to provide code templates and environment setup to start using this annotations approach to develop data mining solutions.

In this article I will be using Oracle implementation of the JDM 1.1 standard API (OJDM API) for the examples and includes Oracle specific JDM extension features.

Java 5 (Tiger release) introduces many new language features such as generics, enum, annotations etc. In this article we will focus on using
annotations feature to develop data mining (DM) annotations that will greatly simplify developing data mining scripts from analysts perspective. Java Annotations provide ability to specify meta-data in the Java program elements, such as package, class, constructor, method, field etc. For example, to specify Data Mining Engine connection details a custom DM annotation called DME is used instead of JDM Connection API. This will hide all the details of the API to be used to obtain the DME Connection and Data Mining (DM) Annotation Processor will create a DME Connection by using the annotation details. Rest of the article introduces DM annotations and how to use them in developing data mining solutions.
@DME(url = "myHost:1521:orcl", user = "dmuser")
public class CustomerAttrition { ... }

Listing-1

Java Annotations are specified using @ prefix and followed by the name of the annotation and then values associated with the annotation specific attributes. In the Listing-1 DME is the annotation name that is used to specify the DME location (url) and user authentication information. In this example DME annotation is specified at the class level so that all program elements such as constructor, methods in that class can access the DME connection details.

Before describing list of data mining annotations and their usage, let us briefly look at a simple and complete example with data mining annotations. This example will give a higher-level understanding of how a data mining annotated Java program is structured. The Listing-2 describes a Java class called CustomerAttrition that is used to model, test and predict customers who are likely to attrite. Line 6 specifies class level DME annotation for connection details and Line 9-16 specifies the DM annotations for the method predictCustomerResponses. When you execute this class by calling predictCustomerResponses method (Line-30) with the given DM annotations, it will build a classification model called AttritionModel using decision tree algorithm and training data in KNOWN_CUSTOMERS table. After successful model build AttritionModel will be tested using test data in TEST_CUSTOMERS table and apply the model to make attrition value prediction for the new customers in NEW_CUSTOMERS table and creates output table CUST_ATTRITION_PREDICTIONS with the predictions. Both test and apply tasks run in parallel after successful model build. If we need to get the same functionality done using the regular OJDM API it will take significant amount of coding and need good understanding of the OJDM API. As you can see DM annotations greatly simplify developing a data mining script from an analyst perspective. Given a code template with simple instructions, DM analyst can write and execute these scripts. At the end of this article I will provide a code template and brief instructions from a non-Java programmer perspective.
CustomerAttrition.java
01 package ojdm.annotation.sample;
02
03 import javax.datamining.JDMException;
04 import ojdm.annotation.*;
05
06 @DME(url = "myHost:1521:orcl", user = "dmuser")
07 public class CustomerAttrition {
08
09 @Function ( "Classification" )
10 @Algorithm ( "Tree" )
11 @BuildInputData ( value="KNOWN_CUSTOMERS", caseId={"CUST_ID"} )
12 @TargetAttribute ( value="ATTRITION", dataType="NUMBER" )
13 @Model( "AttritionModel" )
14 @Test ( input="TEST_CUSTOMERS", output="AttritionModelMetrics" )
15 @Apply ( input="NEW_CUSTOMERS", output="CUST_ATTRITION_PREDICTIONS")
16 @Execute( tasks= {"BUILD", "TEST", "APPLY"}, hint="parallel" )
17
18 public void predictCustomerResponses(String dmePassword)
19 throws JDMException, NoSuchMethodException {
20 AnnotationProcessor.execute(
21 this.getClass().getMethod(
22 "predictCustomerResponses", new Class[] { String.class } ),
23 dmePassword);
24 }
25
26 public static void main(String[] args) {
27
28 CustomerAttrition custAttrition = new CustomerAttrition();
29 try {
30 custAttrition.predictCustomerResponses("dmuser");
31 } catch (NoSuchMethodException e) {
32 e.printStackTrace();
33 } catch (JDMException e) {
34 e.printStackTrace();
35 }
36 }
37 }
Listing-2

In this article I am not describing how to create DM Java annotations, curious Java programmers can refer to links provided at the end of this article for more details on Java Annotations. Next section describes the list of DM annotations and how to use them.



Data Mining Annotations

Following are the list of Data Mining Annotations that are defined in this article.

DM Annotation
Description
DME Specifies the DME connection details, such as the location of the DME using URL and authentication details. Optionally user can specify locale information and password. It is not recommended to specify password as part of the annotation for security reasons.
Function Specifies the function to be used for model, optionally user can specify settings associated with the function.
Algorithm Specifies the algorithm associated with the function. When this annotation is not specified functions default algorithm will be used.
BuildInputData Specifies the input data to build the model. Optionally user can specify the case id and attribute level details.
Test Specifies the input, output tables and settings related to testing the model.
Apply Specifies the input, output tables and settings related to applying the model.
Execute Specifies the list of tasks to be executed and settings related to the task execution.
CostBenefits Specifies the costs of false predictions and benefits of true predictions. Used only by the classification function.
TargetPriors Specifies the target priors associated with the original data. Used only by the classification function.


DME

Purpose
This annotation is used to specify the DME connection details with the location using URL and user name. Optionally user can specify locale information and password. It is not recommended to user password as part of the annotation for security reasons.


Syntax
@DME ( url="<url>",
user="<username>"
[,locale="<locale>"]
[, password="<password>"] )

Semantics

url
Use this to specify the location of the Data Mining Engine (DME). In case of Oracle Data Mining it is the location of the database and URL follows the JDBC URL syntax <hostname>:<port number>:<sid>

user
Use this to specify the user name of the DME. In case of ODM it is the user name of the database.

locale
Use this to specify the DME locale to indicate that DME must be set to the specified locale. Locale is specified as a single string that can specify language, country and variant using the following syntax "<language>[,<country>[,variant]]".

password
Use this to specify the password to connect to the DME. It is not recommended to specify the password for security reasons. Instead supply the password as an argument to get the connection.

Function

Purpose
This annotation is used to specify the function used to build the model, optionally user can specify the settings associated with the function.


Syntax
@Function ( value="<function>"
[, settings={ @Setting( name="<settingName1>", value="<settingValue1>" ),
@Setting( name="<settingName2>", value="<settingValue2>" ) ... }
[, name="<function settings name">] )

Semantics

value
Used to specify function to build the model. Following are the list of functions supported by the ODM.

  • classification
  • regression
  • clustering
  • association
  • attributeImportance
  • featureExtraction

settings
Use this to specify settings associated with the function. Following table lists the valid settings associated with each function.


Function
Setting Name
Description
Valid Values
classification
costBenefits
Name of the cost benefits table. The cost benefits specifies the cost of false predictions and benefits of right predictions. Only Decision Tree models can use a cost benefits at the model build time. All other classification algorithms can use a cost benefits at apply time.
Cost benefits can be specified using @CostBenefits annotation. This setting is used to specify already saved cost benefits table.
Must have cost benefits table with the specified name. If there is a @CostBenefits annotation specified in the method then it will override this setting.

priors
Name of the priors table. It is primarily used by the NaiveBayes algorithm model to specify prior probabilities to offset differences in distribution between the build data and
the scoring data. SVM classification uses the priors table for weights.
Priors can be specified used @Priors annotation. This setting is used to specify already saved priors table.
Must have priors table with the specified name. If there is a @Priors annotation specified in the method then it will override this setting.

weights
Name of the weights table. It is used only by GLM classification algorithm. Weights table stores weighting information for individual target values in a GLM classification model. The weights are used by the algorithm to bias the model in favor of higher weighted classes.
Must have a table with the specified weights. If there is a @Weights annotation specified in the method then it will override this setting.
clustering
numberOfClusters
Number of clusters generated by a clustering algorithm. Default is 10.
Must be a positive integer value. >=1
association
maxRuleLength
Maximum rule length for association rules. Default is 4.
Must be a positive integer value between 2 and 20. (>=2 and <=20)

minConfidence
Minimum confidence for association rules. Default is 0.1.
Must be value between 0 and 1. (>=0 and <=1)

minSupport
Minimum support for association rules.Default is 0.1.

featureExtractionnumberOfFeatures
Number of features to be extracted by a feature extraction model. The default is estimated from the data by the algorithm.
Must be a positive integer value. >=1
classification and regression
GLM specific settings
missingValues
Used to override the default missing value treatment for the GLM algorithm model builds. By default missing values are treated by the algorithms. User can override this behavior by specifying "deleteRow" value.
Must be one of the following values:
  • deleteRows
  • meanMode

rowWeightsColumn
Name of a column in the training data that contains a weighting factor for the rows. Row weights can be used as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times. Row weights can also be used to emphasize certain rows during model construction. For example, to bias the model towards rows that are more recent and away from potentially obsolete data.
Must be a valid column name in the model build input table/view.
All Functions
autoPrep
Used to specify the flag to indicate auto data preparation on/off. By default for GLM and DT algorithms auto data preparation (ADP) will be set to ON. For others it is set to Off. Using this setting override default ADP settings.
Valid values are
  • ON
  • OFF

The settings annotation allows to specify an array of Setting annotations. For example, following shows the association function annotations. If there are no settings specified function default values will be used.

@Function (name="association" settings= { @Setting(name="maxRuleLength" value="5"),
@Setting(name="minConfidence" value="0.2"),
@Setting(name="minSupport" value="0.4") } )

name
Use this to specify the name of the function settings table to save the specified function and algorithm settings to build the model. These settings can be reused to build future models by referring to the function settings table name.

Algorithm

Purpose
This annotation is used to specify the algorithm used to build the model. When this annotation is not specified then the default algorithm associated with the function will be used.


Syntax
@Algorithm ( value="<algorithm>"
[, settings={ @Setting( name="<settingName1>", value="<settingValue1>" ),
@Setting( name="<settingName2>", value="<settingValue2>" ) ... } ])

Semantics

value
Use this to specify algorithm to build the model. Following are the list of valid algorithm values associated with the specified function. User can specify either abbreviated algorithm name or the full name.

  • classification
    • NB (or) naiveBayes
    • DT (or) decisionTree
    • SVM (or) supportVectorMachine
    • GLM (or) generalizedLinearModels
    • ABN (or) adaptiveBayesianNetworks (deprecated in 11.1)
  • regression
    • SVM (or) supportVectorMachine
    • GLM (or) generalizedLinearModels
  • clustering
    • KM (or) kMeans
    • OC (or) oCluster
  • association
    • AP (or) apriori
  • attributeImportance
    • MDL (or) minimumDescriptionLength
  • featureExtraction
    • NMF (or) nonNegativeMatrixFactorization

settings
Use this to specify settings associated with the algorithm. Following table lists the valid settings associated with each algorithm.


Algorithm
Setting Name
Description
Valid Values
Naive Bayes (NB)
singletonThreshold
The minimum percentage of singleton occurrences required for including a predictor in the model.
Valid value must be between 0 and 1.
(>=0 and <=1)

pairwiseThreshold
The minimum percentage of pairwise occurrences required for including a predictor in the model.Valid value must be between 0 and 1.
(>=0 and <=1)
Decision Tree
(DT)
homogeneityMetric (or) impurityMetricTree impurity metric for Decision Tree.
Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity(purity) for the entities in the node. Purity is measured inaccordance with a metric. Decision trees can use either gini
or entropy as the purity metric. By
default, the algorithm uses gini.
Valid values:
  • gini (default)
  • entropy

maxDepth
Specifies the maximum depth of the tree, from root to leaf inclusive. The default is 7.
Must be a >=2 and <=20. Default is 7.

minNodeSize
No child shall have fewer records than this number, which is expressed as a percentage of the training rows.
Default is 0.05, indicating 0.05%.
Must be >=0% and <=10%. Default is 0.05%

minNodeCaseCount
Specifies the minimum number of cases required in a child node. Default is 10.Must be positive integer greater than zero. Default is 10.

minSplitSize
Specifies the minimum number of cases required in a node in order for a further split to be possible. Expressed as a percentage of all the rows in the training data. The default is 1%.Must be >=0% and <=20%. Default is 0.1%

minSplitCaseCount
Specifies minimum number of records in a parent node expressed as a value. No split is attempted if number of records is below this value. Default is 20.
Must be positive integer greater than zero. Default is 20.
SupportVectorMachine (SVM)
activeLearning
When active learning is enabled, the SVM algorithm uses active learning to build a reduced size model. When active learning is disabled, the SVM algorithm builds a standard model.

By default it is enabled.

Valid values:
  • enable
  • disable

complexityFactor
Value of complexity factor for SVM algorithm.
Default value estimated from the data by the algorithm.

>0

convergenceTolerance
Convergence tolerance for SVM algorithm.
Default is 0.001.
>0

kernelFunction
Kernel for Support Vector Machine. The default is determined by the algorithm based on the number of attributes in the training data.
When there are many attributes, the algorithm uses a linear kernel, otherwise it uses a nonlinear (Gaussian) kernel.
Valid values:
  • linear
  • gaussian

kernelCacheSize
(Gaussian Kernel only)
Value of kernel cache size for SVM algorithm. Applies to Gaussian
kernel only.
Default is 50000000 bytes.
>0

standardDeviation
Value of standard deviation for SVM algorithm. This is applicable only for Gaussian kernel.
Default value estimated from the data by the algorithm.
>0

epsilon
(regression specific)
Value of epsilon factor for SVM regression.
Default value estimated from the data by the algorithm.
>0

outlierRate
(one-class specific)
The desired rate of outliers in the training data. Valid for One-Class
SVM models only (anomaly detection).
>0 and <1
GLM
confidenceLevel
The confidence level for coefficient confidence intervals.
The default confidence level is 0.95.
>0 and <1

diagnosticsTable
The name of a table to contain row-level diagnostic information for a GLM model. The table is created
during model build.
If you want to create a diagnostics table, you must specify a case ID when you build the model.

valid table name

referenceClass
The target value to be used as the reference value in a logistic regression model. Probabilities will be produced for the other (non-reference) class.

By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class.



ridgeRegression
Specifies whether or not ridge regression will be enabled.
By default, the algorithm determines whether or not to use ridge. Use this settings to explicitly enable/disable ridge.

Valid values:
  • enable
  • disable

ridgeValue
The value for the ridge parameter used by the algorithm. This setting is only used when you explicitly enable ridge regression.
If ridge regression is enabled internally by the algorithm, the ridge parameter is determined by the algorithm.
>0

VIFforRidge
(Linear regression specific)
Variance Inflation Factor (VIF) statistics when ridge is
being used.
By default, VIF is not produced when ridge is enabled.
When you explicitly enable ridge regression by setting, you can request VIF statistics by setting this to enable; the algorithm will produce
VIF if enough system resources are available.
Valid values:
  • enable
  • disable
ABN (Deprecared in 11.1)
maxBuildDuration
Maximum time to complete an ABN model build.
Default is 0, which implies no time limit.
>0

modelType
Type of ABN model
Valid values:
  • multiFeature
  • naiveBayes
  • singleFeature


maxNBPredictors
Maximum number of predictors, measured by their MDL ranking,
to be considered for building an ABN model of type is "naiveBayes".

Default is 10.

>0

maxPredictors
Maximum number of predictors, measured by their MDL ranking,
to be considered for building an ABN model of type single/multi feature.
Default is 25.


k-Means
distanceFunction
Distance Function for k-Means Clustering. The default is
euclidean.
Valid values:
  • cosine
  • euclidean (default)
  • fastCosine

splitCriterion
Split criterion for k-Means Clustering. The default criterion is
the variance.
Valid values:
  • size
  • variance (default)

numberOfIterations
Number of iterations for k-Means algorithm
Default is 3
>0 and <=20

minPercentAttrSupport
The fraction of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster.
Setting the parameter value too high in data with missing values can result in very short or even empty rules.
Default is 0.1.
>=0 and <=1

blockGrowth
Growth factor for memory allocated to hold cluster data.
Default value is 2
>1 and <=5
O-Cluster
sensitivity
A fraction that specifies the peak density required for separating a
new cluster. The fraction is related to the global uniform density.
Default is 0.5.
>=0 and <=1

bufferSize
Buffer size for O-Cluster.
Default is 50,000.
>0
NMF
convergenceTolerance
Convergence tolerance for NMF algorithm
Default is 0.05
>0 and <=0.5

numberOfIterations
Number of iterations for NMF algorithm
Default is 50
>=1 and <=500

randomSeed
Random seed for NMF algorithm.
Default is –1.


The Algorithm annotations settings attribute allows to specify an array of Setting annotations. For example, following shows the naive bayes algorithm annotations. If there are no settings specified algorithm default values will be used.

@Algorithm (name="NB" settings= {
@Setting(name="maxRuleLength" value="5"),
@Setting(name="minConfidence" value="0.2"),
@Setting(name="minSupport" value="0.4") } )
BuildInputData

Purpose
This annotation is used to specify the input data details for model building .


Syntax
@BuildInputData ( value="<input table name or SQL select statement>",
[,caseId="<case id column names>" ]
[,target="<target attribute column names>"]
[,include="<list of columns to be included>"]
[,exclude="<list of columns to be excluded>"]
[,expressions="<list of derived attribute expressions>"]


Semantics

value
Specifies the name of the table/view or SQL select statement used to create the input dataset.

caseId
Specifies the name of the column that uniquely identifies each case of the input data.

target
Specifies the name of the column that has target

include
Specifies the list of included column names to build the model.

exclude
Specifies the list of excluded column names to build the model. Note that either include or exclude attributes can be specified. If both are specified then only include attributes are taken into account and exclude attributes will be ignored.

expressions
Specifies the list of SQL expressions that are to be embedded with the model. This attribute can have an array of Expression annotations to specify per column SQL expressions. For example, to define a logarithm value of revenue column and call this new logarithm column as log_revenue you can specify an expression associated with the revenue column/attribute as @Expression( value="LOG(10, revenue)", inverse="EXP(log_revenue)", outputAttrName="log_revenue" ).

Test

Purpose
This annotation is used to specify the test settings to evaluate the model performance. ODM supports testing of the classification and regression models.


Syntax
@Test ( input="<input test table>",
output="<output test metrics table>"
[,positiveTargetValue="<String representation of the positive target value>"]
[, numberOfLiftQunatiles="<Number of lift quantiles>"] )
[, residualOutput="<regression residual output>"] )

Semantics

input
Specifies input data for testing the model.

output
Specifies the output test metrics table.

positiveTargetValue
Specifies the positive target value for which lift and ROC will be computed.

numberOfLiftQunatiles
Specifies the number of quantiles for computing lift value.

residualOutput
Specifies the name of the residual output table.


Apply

Purpose
This annotation is used to specify the settings to apply the model.


Syntax
@Apply ( input="<apply input table name>",
output="<apply output table name>"
[,applyType="<prediction/topN/value>"]
[, topN="<topN value>"]
[, values="<list of values>"]
[, sql="<Oracle sql query using apply functions>"] )

Semantics

input
Specifies apply input table name.

output
Specifies apply output table name.

applyType
Specifies type of apply. There are three types of apply prediction, topN, and value. Following table illustrates these options for different types of mining functions.


prediction
topN
value
classification
Outputs the most probable target value and associated probability for the given case.
Outputs the most probable top N target values that have and associated probabilities for the given case.
Outputs the probability associated with the specified target value(s) for the given case.
regression
Outputs the prediction value of the target attribute for the given case. N/A
N/A
clustering
Outputs the most likely cluster that the case belongs to.
Outputs the most likely clusters for the given case.
Outputs the probability associated with the specified cluster(s) for the given case.

topN
Specifies the N value for the apply type topN.

values
Specifies the values for the apply type value.

sql
Specifies the Oracle SQL using apply SQL functions. When the SQL query is specified, it overrides the applyType option and uses the SQL to create the apply output table.

Execute

Purpose
This annotation is used to specify execution commands and execute options.


Syntax
@Execute ( tasks="<list of tasks to be executed>",
[,type="<execution type>"
[,jobClass="<DBMS Scheduler job class name>"]
[,schedule="<time to run the tasks>"] )

Semantics

tasks
Specifies list of tasks to be executed. There are three types of tasks that are specified in this article BUILD, TEST and APPLY. One can extend this to have more tasks.

type
Specifies the execution type. There are two options for this attribute: serial and parallel. By default it is set to serial. When it is serial all tasks will be executed in the specified sequence in this annotation. When it is parallel tasks that can be executed in parallel will be executed in parallel.

jobClass
Specifies the name of the jobClass that the task belongs to. OJDM uses DBMS Scheduler to execute the mining tasks. A task is created as the DBMS scheduler job. DBMS scheduler provides ability to classify the jobs into a specific category and allows to allocate resources. Here user can specify the jobClass. By default it uses the systemDefault job class.

schedule
Specifies the time at which the execution of tasks to be started. Time format is fixed to (MM:DD:YYYY HH24:MI).

CostBenefits

Purpose
This annotation is used to specify the cost associated with the false predictions and benefits associated with the right predictions. This annotation is applicable for only classification function, where target attribute has discrete number of target values.


Syntax
@CostBenefits ( value="<list of cost/benefit associated with each target value combintation>",
[,name="<name of the cost benefits table>"] )

Semantics

value
Specifies the cost/benefit associated with each combination of the target values. For example, cost associated with the false prediction of responding customer as not-responds costs $200 and with the true prediction of responding customer benefits $150. With the false prediction of not-responding customer as responds costs $10 and true prediction of not-responds customer benefits $10. This example cost/benefit scenario is specified using the annotation as follows:

@CostBenefits ( value= {
"responds,responds : -150",
"responds,not-responds : 200",
"not-responds,responds : 10",
"not-responds,not-responds : -10" }

Note that benefits are represented as negative values to indicate negative cost. By default false prediction costs are set to 1 and true prediction benefits are set to 0.

name
Specifies the name of cost matrix to save the settings in the specified table for reuse.

TargetPriors

Purpose
This annotation is used to specify the prior values associated with the target attribute values.


Syntax
@TargetPriors ( value="<list of target priors>" )


Semantics

value
Specifies the prior probability values associated with each target value. Sum of all target value prior probabilities must be equal to one. Here is an example that shows the prior probability of the response value is 0.20 and not-response value is 0.80.
@TargetPriors ( value={ "response : 0.20",
"not-response : 0.80" } )


Summary
This part of the article described Java 5 Annotations and OJDM 11g approach to simplify the data mining solution development. In the next part I will describe more details about the Data Mining Annotation processor and how to use it to develop more complex data mining solutions with the minimum development effort.

References


Java Annotation References:

Oracle Data Mining and Java Data Mining links:








No comments: