Solving the Count Distinct Problem using HyperLogLog in Tableau

Vaishnavi Kousalyanandan
5 min readJan 27, 2024
Image from Unsplash

In the world of data analysis, especially for large-scale businesses like a Global E-commerce Platform, counting unique elements in massive datasets is a significant challenge. Consider the task at hand for such a platform: they want to better understand their customer base by tracking unique customer visits daily, weekly, and monthly.

The Challenge:

With millions of users and transactions, traditional counting methods like SQL queries struggle to keep up. They face two main issues:

  1. Time Constraints: Running detailed count queries on huge datasets takes a lot of time and in some cases becomes impossible. This slow process can delay important business decisions.
  2. Resource Intensity: Precise counting demands a lot of computational power and memory, which can be expensive and inefficient.

Solution: HyperLogLog (HLL) Algorithm

The HyperLogLog (HLL) algorithm steps in as a game-changer. It is a probabilistic technique used in data analysis to estimate the approximate count of distinct elements in a large dataset.

The HyperLogLog algorithm is able to estimate cardinalities of > 10 billion with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.

It works by observing unique patterns within a subset of the data, allowing for a quick and memory-efficient estimation of the overall number of distinct elements, such as unique visitors to a website or different items in a database.

The key advantage of HLL is that it doesn’t need a lot of memory or time to give you a good enough estimate of these unique counts. It’s especially useful when you want a quick answer and don’t need the exact number down to the last digit.

How does HyperLogLog work — An Intuition

Alright, imagine you have a big box of different-colored candies, and you want to know how many unique colors are in there. Now, instead of painstakingly going through each candy, you decide to take a handful of them, maybe 20 or 30, and look at their colors.

Here’s the clever part: you’re not interested in each candy individually; you’re just paying attention to whether you’ve seen a new color or one you’ve already noticed. For example, if you find a candy that’s a different color from what you’ve seen so far, you make a mental note of it.

Now, you repeat this process with another handful of candies, and again, you’re only caring about new colors. You keep doing this until you’ve gone through the entire box.

Once you’re done, you don’t need to remember every single candy; you just remember the unique colors you saw. And here’s the magic: based on the patterns of new colors you observed in those small groups, you can make a pretty good guess about the total number of unique colors in the entire box.

This is essentially how the HyperLogLog (HLL) algorithm works, but instead of candies, it deals with data. It doesn’t go through every single piece of information one by one; it looks at smaller groups, finds unique patterns, and uses those patterns to estimate the total number of unique things in the entire dataset.

So, HLL is like a smart way of quickly guessing how many different things you have in a big dataset without having to check every single piece. It’s a bit like a magic trick for data that helps us get a good enough answer without spending forever counting everything.

For a deep dive into the technical intricacies of HLL, check out this link: https://chengweihu.com/hyperloglog/

Implementing HyperLogLog in Tableau: A Step-by-Step Approach

When integrating HyperLogLog (HLL) for efficient unique count estimations, it’s important to note that I’ll be demonstrating the implementation specifically for Google BigQuery (BQ) and Tableau.

Keep in mind, if you’re using different data warehouses, the syntax might vary slightly. It’s essential to check the specific syntax requirements for HLL in your chosen platform. Here’s how to go about it:

Step 1: Creating the HLL Sketch in BigQuery:

Begin by generating an HLL Sketch column in your backend table. This is done using the HLL_COUNT.INIT() function in BQ. For instance, if you’re counting unique visitors using the ‘visitor_id’ column, utilize HLL_COUNT.INIT(visitor_id), ensuring the ‘visitor_id’ column is omitted in the select statement. The resulting HLL sketch is added as another column, aligning with the selected data levels. In the provided example, the goal is to calculate the unique count of ‘visitor_id’ at the event_date, country, and device type levels.

Data shown is illustrative and not derived from any actual employer data

Step 2: Data source connection to Tableau Dashboard:

For accessing the HLL sketch in Tableau, ensure that your BQ table is directly (live connection) connected to the dashboard rather than using any published data sources from Tableau server.

Step 3: Extracting Unique Values in Tableau:

To pull out unique values from the HLL Sketches in Tableau, you’ll need to use the below calculation in your dashboard. You can later incorporate this calculation into Tableau views, treating it like a regular count distinct field, and the data will be computed accordingly.

Keep in mind that the ‘hll_sketch’ column, while created in the backend, won’t be displayed as a database column in Tableau. However, you can still use this column in your calculations, and it will function correctly.

By following these steps, you’re setting up a robust system for handling and visualizing unique counts in large datasets, ensuring that your data insights are both fast and reliable.

Feel free to share your thoughts or questions or even your own experiences with HLL in the comments below. If you found this content valuable, consider hitting the clap button. Thank you for reading till the end. Until next time, stay connected and keep innovating 🚀

--

--

Vaishnavi Kousalyanandan

Data Engineer @ PayPal | Data Analytics 🔍 | Data Visualization 📈 | Tableau 📊