4

Clustering with Mixed Data Types

by David Novick, Technical Writer, Pyramid Analytics
and Bar Amit, Data Scientist, Pyramid Analytics


Clustering algorithms typically create groups based on homogeneous data which can be either numerical (quantitative) with continuous values or categorical (qualitative) with discrete text values. Mixed type clustering can be used to create groups which combine both numerical and categorical data.  

The EMMD (Expectation Maximization for Mixed Data) algorithm in the Clustering tool supports numerical clustering, categorical clustering and any combination of the two.

Numerical vs. Categorical Clustering

To better understand the concept of numerical vs. categorical clustering, let’s consider how a travel agency might create airplane groupings when evaluating their annual sales figures.

One simple numerical approach would be to create groups based on the maximum number of passenger seats in a plane (for example, 10-99, 100-199, 200-299). Another numerical approach would be to group airplanes by their max range in nautical miles (for example, 200-999, 1000-1999, 2000-2999).

The simplest categorical approach would be to group by plane model (for example, B747, B757, B787). A different categorical approach would be to group by plane type (for example, single engine, twin engine, four engine).

A mixed data type approach would create cluster groups by combining several of the quantitative and qualitative features, as shown below.

  • 10-99 seats, 1000-1999 nautical miles, single engine
  • 10-99 seats, 2000-2999 nautical miles, twin engine
  • 300-599 seats, 6000-13999 nautical miles, four engine

Example of Mixed Data Clusters for Mexican Food

Promoting the “Burreen Burger”

In this example, Emma is the marketing manager of a food supplier named Mexic-Bar which sells raw foodstuffs to Mexican restaurants. Emma is working on promotion for a new veggie burger called the Burreen Burger and is focusing her promotional activities on Mexican restaurants in the greater San Diego area. Emma will utilize an existing food database to estimate which Mexican restaurants will be most likely to succeed in selling her new Burreen Burger.

Importing Data into BI Office

First Emma imports the following dataset into BI Office. For more information on uploading data, see the BI Office help and related blogs.

https://www.kaggle.com/srcole/burritos-in-san-diego

The uploaded data can be used to display a simple map of Mexican restaurants in the San Diego area, as shown below. The blue circles show the location of the restaurants, but don’t relay any special marketing information.

 
Figure 1. Simple map of Mexican restaurants.

Using EMMD to Improve the Map

Emma wishes to use the Clustering wizard in order to build two clustering groups that portray graphically whether each individual Mexican restaurant tends to host more Meat Lovers or Veggie Lovers. Once Emma creates a map showing these two clusters groups, she intends to post ads for the Burreen Burger – targeting those neighborhoods with higher levels of Veggie Lovers. Her ads will be posted strategically on building walls, bus stops and highway billboards.

Emma’s challenge is not to identify vegetarian Mexican restaurants per se, because they are very few in number. Rather, she intends to look at those restaurants serving both meat and vegetarian dishes – and to evaluate the relative popularity of meat meals vs. veggie meals at each restaurant. She assumes that restaurants serving a higher overall percentage of veggie meals will provide the best ROI for her advertising campaign.

For the purposes of her study, Emma considers the "meat grouping" to include beef, pork, chicken, fish, shellfish, etc.

Emma arrives at the conclusion that the most efficient statistical method will take into account both numeric and categorical data for the restaurants. Hence, she will need to employ the EMMD algorithm for mixed data.

Choosing the EMMD Algorithm

In the Algorithm dialog, Emma selects the EMMD algorithm and enters 2 as the number of clusters to be created.

Figure 2. Choosing EMMD algorithm and number of clusters.

Items to Group

The Items to Group dialog specifies the overall “framework” items for which the EMMD algorithm will calculate its results. Emma chooses the following parameters:

  • The “BurritodataBar2” dimension contains assorted data on Mexican restaurants in the San Diego area.
  • The “Address” hierarchy is a flat data structure containing the street address of each Mexican restaurant.

 Figure 3. Choosing dimension and hierarchy.

Categories to Group by

In the Categories to Group by dialog, Emma makes five selections: Beef, Bacon, Chicken, Lobster, and Pork. These are YES/NO data elements informing us of whether a given restaurant does (or does not) include a specific meat type on their menu. For example, a restaurant that serves beef dishes will have a YES value for the Beef data element.

The reason that Emma has chosen five meat categories and no vegetarian categories is that she is seeking those categories which will perform as ideal “splitters” for creating two distinct clustering groups. Due to the nature of clustering algorithms (like EMMD), the selection of both meat and vegetarian categories would often produce weaker statistical results than selecting meat categories only.

Emma's strategy is based on the assumption that the fewer meat options available in a restaurant, the more likely that the restaurant is frequented by “Veggie Lovers”. 

 Figure 4. Selecting five attribute hierarchies to form "meat grouping".

Numbers to Group by

At this stage, Emma selects two numerics to drive the creation of the cluster groups.

  • Sum Meat Filling – Contains the total weight of meat in a single burrito serving.
  • Sum Meat – Contains the total combined weight of the raw meats purchased each week by each Mexican restaurant.

 Figure 5. Choosing the measurement that will drive our clustering groups.

Cluster-Driven Promotion

Emma now runs the Cluster wizard based on her selections in the wizard dialogs. The following graph highlights in green those restaurants with relatively high numbers of Veggie Lovers, and highlights in red those restaurants with relatively high numbers of Meat Lovers.

This cluster map will help Emma to decide on the best locations for her Burreen Burger ads, since advertisements close to one or more green restaurants are most likely to draw in the Veggie Lovers.

 Figure 6. Clustered map of Mexican restaurants (green=Veggie Lovers, red=Meat Lovers).

1reply Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
Like4 Follow
  • 4 Likes
  • 3 yrs agoLast active
  • 1Replies
  • 1454Views
  • 3 Following