A Test for Identifying Categorical Data

This post describes a method for identifying whether a data set is composed of categorical values. Automatically identifying whether a data set contains categorical values enables applications to make use of such data without requiring users to supply this information.

Categorical data is composed of a limited number of possible values. For example, a table column representing gender would contain the text values male and female. This column contains categorical data because it holds only two values regardless of the number of rows. In contrast, a table column containing the text of a set of tweets does not contain categorical data because although some of the values might be repeated—as in the case of retweets—most of the values will be unique. Since repeated value can occur in non-categorical data, a test is required to differentiate between a data set composed of categorical values and a data set composed of non-categorical values that may contain repeats.

The following algorithm provides a useful test for identifying categorical data:

  1. Calculate the number of unique values in the data set.

  2. Calculate the difference between the number of unique values in the data set and the total number of values in the data set.

  3. Calculate the difference as a percentage of the total number of values in the data set.

  4. If the percentage difference is 90% or more, then the data set is composed of categorical values.

When the number of rows is less than around 50, then a lower threshold of 70% works well in practice.

blog comments powered by Disqus