Clustering and Plotting All Observations in BI Office 6
Tim Andrews , email@example.com
I recently watched John Hormaechea’s webinar where he taught us how to use R functionality within BI Office. I had not found the time to fully explore the BI Office/R integration yet, but sometimes it just takes a well put together tutorial such as John’s to prompt me to start using new functionality. Beyond the core functionality, I also found a couple of use cases to customize the R integration.
BI Office 6 includes three menu options for implementing R routines. These routines can be found in the Analytics ribbon tab: Forecasting, Clustering and Prediction.
The clustering functionality allows an analyst to create custom aggregate members representing data clusters (one member per cluster), in addition to one named set containing all members per cluster. These two outputs from the clustering functionality produce great analytical value. However, it’s not possible to plot every observation on the same chart, and color-code them by cluster. I can speculate that this would be due to a limitation within SSAS/MDX that does not allow new member attribute hierarchies (a “cluster” property) to be dynamically created at runtime. In addition, it’s not possible to customize the names of the clusters.
I decided to extend the default functionality to allow me to plot all members from all clusters on the same chart, while graphically visualizing the cluster for each member. Moreover, I decided to change the names of the clusters to contain more descriptive information.
For this example, I have created a data model based on the well-known Iris data set. As depicted below, this data set contains the petal and sepal dimensions, along with species (not shown). I have also added a row-identifier (the left most column).
Beginning Your Cluster Analysis
Step 1: Modifying the BI Office-Generated Clustering Script
I used the Clustering menu option to extract the BI Office-generated R code, with the following selections:
- Algorithm: KMEANS
- # Clusters: 3
- Items to Group By: Iris/Number (my leaf-level attribute)
- Numbers to Group By: Sum Petal Length, Sum Petal Width, Sum Sepal Length, Sum Sepal Width
This resulted in the following R code:
BI Office Generates R Script
I modified the code to the following:
#Defafult Pyramid code with column selection tweak km <- kmeans(input[,c('Sum_Petal_Length', 'Sum_Petal_Width', 'Sum_Sepal_Length', 'Sum_Sepal_Width')], 3); output <-km$cluster;
#Get the count of members in each cluster counts <- aggregate(output, by=list(output), FUN=length);
#Merge the elements with their counts. Be careful to keep the elements sorted properly output <- data.frame(cluster=output, RowNum = as.integer(rownames(input))) counts <- data.frame(counts); colnames(output) <- c("cluster", "RowNum"); colnames(counts) <- c("cluster", "count"); combined <- merge(output, counts, by="cluster") combined <- combined[order(combined$RowNum),] paste("cluster ", combined$cluster, " (", combined$count, " elements)", sep="");
#Resolve back to output vector output <- paste(combined$cluster, " (", combined$count, " elements)", sep="");
As the default, the R clustering routine reads a data frame called “input” and writes a vector called “output”. I altered the code to change the class of “output” from integer to string. I set it so that each string result contains the cluster number plus the count of elements contained within each cluster, producing results as below:
Results of Modified Code
Step 2: Using the Prediction Menu for Custom Clustering
After trial and error, I learned from Bar Amit, data scientist at Pyramid Analytics, that the Clustering menu functionality does not work with string output. He said that it would be possible to use the Prediction menu in its place.
I used the Prediction menu with the following settings:
- Algorithm: leave default
- What to Predict: Species (could be any unused attribute)
- Items to Predict For: Number (my leaf-level attribute)
- Categories to Predict By: None
- Nubmers to Predict By: Sum Petal Length, Sum Petal Width, Sum Sepal Length, Sum Sepal Width
- R Script: My custom script from above
Step 3: Visualize in a Scatterplot
Using the following settings, I rendered a scatterplot with observations colored by cluster:
- Two measures selected under Columns
- Custom prediction under Rows at top, leaf-level attribute (Number) under Rows at bottom
- Data Points -> Point Grouping selected
My result is as below: a scatterplot with all observations colored by cluster, with each cluster named after the number of observations contained within.
Thank you, Tim, for this great article! If you would like to be featured on the community, message me for more information!