Wednesday, October 9, 2024

Log 4

 


For Log 4, I have dived into clustering and especially the different models, or methods, of how to conduct this specific part of machine learning and data processing. There are mainly two models of clustering: one is partition-based clustering; the other is hierarchical clustering. I mainly focused on partition-based clustering algorithms, but I would still like to explain the insights I got on the difference between these two models. Hierarchical clustering is identified by agglomerative clustering and divisive clustering. Agglomerative is gradually adding up from the bottom, while divisive is going from the sum, the big data set gradually coming down to the bottom. Each has its own difference; however, they all signify that the model is hierarchical, and they all have layers. As for partition-based clustering, the main difference would be that it has no layers, hierarchical or tree diagram structure. After pondering for a while, I would say that they are different ways of showing data. But as clustering models, they all have areas where they can be applied to, for instance, differentiating human voices from noise to improve quality when making phone calls, or what we call “noise cancellation.” Canceling noise from human voices is a good example of a clustering model that distinguishes unlabeled data since you cannot predetermine the noise as a certain kind of labeled data. It’s also been widely used in the medical field for distinguishing diagnosis related group or to segment customers in marketing. Now I would like to talk about the partition-based clustering algorithms I did spend time on understanding and learning. The first one is K-Means Clustering. It’s one of the simplest unsupervised(unlabeled) learning algorithms. The main concept of it is to group these unlabeled data in a way to minimize the distance difference between each data in the different clusters we have. Interestingly, we actually don’t know how many clusters there are in the data we are going to process, as a result, we will predetermine the number of clusters so that we can efficiently group our data. How does this work? After determining the number of clusters, we then have our initial centroid. This centroid thing is the most crucial part of the K-means Clustering because it’s the center of the cluster, thus each small data point will determine which cluster to which it belongs. I found the words here used by the author perplexing to understand. He stated, “each learning data point becomes the reference point for determining the cluster to which it belongs.” I then thought carefully and comprehended it as how it uses itself to determine the cluster to which it belongs due to the distance, which means that they are similar. In other words, the closer they are, the more similar they are in terms of features, such as to identify people with the same interests on a certain product used in market segmentation. After these small data points find the clustering based on the nearest centroids, the centroid will now recalculate it as the average of the value of these training data. Then after acquiring the new centroid, the data moves around and finds which one is the nearest cluster based on the centroid and we do it repeatedly until it doesn’t move. When it doesn't move, it signifies that the data are all at the right cluster to which it belongs; it means the training is now completed. My impression on this is simply WOW! What a sophisticated process it is; it feels hard to comprehend at first, but after understanding it, I find it a very logical, useful, and intriguing process! The cleverest part in my opinion is that it does the checking process over and over again, and when it doesn’t change anymore, when it’s fixed, then it means the data are all now at its right place and the data training process is finished. I feel it’s genius because it’s so intuitive! Another algorithm I learned is the DBSCAN algorithm, also called as Density based spatial clustering of application with noise. Just as the name suggests, it distinguishes data on density. It’s unsupervised learning, just like the K-means algorithm; yet, it considers the data outside of the dense area noise or the boundary point. Unlike K-means model calculating all the distance from all the data points to the centroids and to decide which cluster they belong to. DBSCAN model simply determines the data group from its density, how dense and often the data is regardless of distance.

No comments:

Post a Comment