K-means Clustering and its real use-case in the Security Domain

Rishabh Arya
4 min readJul 18, 2021

--

What is K-means clustering ?

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K… Data points are clustered based on feature similarity.

Where do k-means clustering algorithm is used?

k-mean clustering algorithm is used in Machine Learning models where we have to do unsupervised learning with improper historical data, so for that case we use k-means clustering algorithm.

What are the basic steps for K-means clustering?

  • Step 1: Choose the number of clusters k.
  • Step 2: Select k random points from the data as centroids.
  • Step 3: Assign all the points to the closest cluster centroid.
  • Step 4: Re-compute the centroids of newly formed clusters.
  • Step 5: Repeat steps 3 and 4.

How does k-means clustering work?

The k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters.

Initially k number of so called centroids are chosen. A centroid is a data point (imaginary or real) at the center of a cluster. In Praat each centroid is an existing data point in the given input data set, picked at random, such that all centroids are unique (that is, for all centroids ci and cj, cicj). These centroids are used to train a KNN Classifier. The resulting classifier is used to classify (using k = 1) the data and thereby produce an initial randomized set of clusters. Each centroid is thereafter set to the arithmetic mean of the cluster it defines. The process of classification and centroid adjustment is repeated until the values of the centroids stabilize. The final centroids will be used to produce the final classification/clustering of the input data, effectively turning the set of initially anonymous data points into a set of data points, each with a class identity.

Use-Cases in the Security Domain

Here is a list of some of the interesting use cases of K-means in Security Domain:

1. Identifying crime localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

2. Insurance fraud detection

Machine Learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

3. Cyber-profiling criminals

Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

4. Call record detail analysis

A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.

5. Automatic clustering of it alerts

Large enterprise use infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.

6. Rideshare data analysis

The publicly available Uber ride information dataset, provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is useful not just in the context of Uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

7. Crime document classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. The document vectors are then clustered to help identify similarity in document groups.

These were few use cases but the list goes on be it in Security Domain or any other, K-means is very effective as well as easy way of Clustering in Machine Learning.

THANK YOU!

--

--

Rishabh Arya
Rishabh Arya

Written by Rishabh Arya

I am an active learner who likes to challenge every problem with a can-do mindset in order to make any idea a reality.

No responses yet