For Log 4, I have dived into clustering and
especially the different models, or methods, of how to conduct this specific
part of machine learning and data processing. There are mainly two models of
clustering: one is partition-based clustering; the other is hierarchical clustering.
I mainly focused on partition-based clustering algorithms, but I would still
like to explain the insights I got on the difference between these two models.
Hierarchical clustering is identified by agglomerative clustering and divisive
clustering. Agglomerative is gradually adding up from the bottom, while divisive
is going from the sum, the big data set gradually coming down to the bottom.
Each has its own difference; however, they all signify that the model is
hierarchical, and they all have layers. As for partition-based clustering, the main
difference would be that it has no layers, hierarchical or tree diagram
structure. After pondering for a while, I would say that they are different
ways of showing data. But as clustering models, they all have areas where they can
be applied to, for instance, differentiating human voices from noise to improve
quality when making phone calls, or what we call “noise cancellation.”
Canceling noise from human voices is a good example of a clustering model that
distinguishes unlabeled data since you cannot predetermine the noise as a certain
kind of labeled data. It’s also been widely used in the medical field for distinguishing
diagnosis related group or to segment customers in marketing. Now I would like to
talk about the partition-based clustering algorithms I did spend time on
understanding and learning. The first one is K-Means Clustering. It’s one of
the simplest unsupervised(unlabeled) learning algorithms. The main concept of
it is to group these unlabeled data in a way to minimize the distance difference
between each data in the different clusters we have. Interestingly, we actually
don’t know how many clusters there are in the data we are going to process, as
a result, we will predetermine the number of clusters so that we can
efficiently group our data. How does this work? After determining the number of
clusters, we then have our initial centroid. This centroid thing is the most crucial
part of the K-means Clustering because it’s the center of the cluster, thus each
small data point will determine which cluster to which it belongs. I found the
words here used by the author perplexing to understand. He stated, “each learning
data point becomes the reference point for determining the cluster to which it
belongs.” I then thought carefully and comprehended it as how it uses itself to
determine the cluster to which it belongs due to the distance, which means that
they are similar. In other words, the closer they are, the more similar they
are in terms of features, such as to identify people with the same interests on
a certain product used in market segmentation. After these small data points find
the clustering based on the nearest centroids, the centroid will now
recalculate it as the average of the value of these training data. Then after
acquiring the new centroid, the data moves around and finds which one is the
nearest cluster based on the centroid and we do it repeatedly until it doesn’t
move. When it doesn't move, it signifies that the data are all at the right
cluster to which it belongs; it means the training is now completed. My
impression on this is simply WOW! What a sophisticated process it is; it feels
hard to comprehend at first, but after understanding it, I find it a very
logical, useful, and intriguing process! The cleverest part in my opinion is
that it does the checking process over and over again, and when it doesn’t
change anymore, when it’s fixed, then it means the data are all now at its
right place and the data training process is finished. I feel it’s genius because
it’s so intuitive! Another algorithm I learned is the DBSCAN algorithm, also called
as Density based spatial clustering of application with noise. Just as the name
suggests, it distinguishes data on density. It’s unsupervised learning, just like
the K-means algorithm; yet, it considers the data outside of the dense area noise
or the boundary point. Unlike K-means model calculating all the distance from all
the data points to the centroids and to decide which cluster they belong to. DBSCAN
model simply determines the data group from its density, how dense and often
the data is regardless of distance.


No comments:
Post a Comment