ML: Terminology & K-Nearest Neighbors
Bài đăng này đã không được cập nhật trong 6 năm
Machine learning is a large subject to learn, so before we start diving into it I though it would be best to understand some key terminology first.
Key Terminology
Gathering data is an important part of ML, data usually made up of multiple measurements for example like weight, height, length, color ...etc. These measurements are what we called features or attributes. Combining these features all together we get a meaningful data which we refer to as instance. Feature can be numeric (10kg, 25cm), binary (is it black?), enumeration (black, white, green). After we have done collecting data, the next step is to learn how to identify the data. This process is called classification.
There are many ML algorithms that are good at classification and after we've decided on which algorithm to use the next step is to train it or allow it to learn. To train algorithm we would need to feed it with quality data known as training set. A training set is like a data that we collect before, different only that it has label or target variable attach to it. The target variable is what we are trying to predict with our algorithm. The machine learns by finding some relationships between features and target variable. Last but not least is to test our algorithm, to do this we feed it with a separate data called test set. Classification used to predict what is a class an instance of data should fall into and beside this there is also another task that used to predict numeric value which is called regression. Using labeled data to help machine learn is called supervised learning and the opposite of it is unsupervised learning. In unsupervised learning there is no labeled data to feed into ML algorithm, it learn by grouping similar item together and the process is called clustering.
Classfifying with K-Nearest Neighbors
K-NN is really easy to grasp and very effective. It works by taking unlabeled data and compare it to existing set of example data, training set, then take the most similar and look at their labels. We look at the top most k neighbors then take a majority vote from those neighbors. The majority is the new class of data that we were asked to classify. For example to classify if a movie should fall into romance or action genre, we will could count the number of kiss or kick scenes in that movie. The movie that has a majority of kiss scense in it is likely to a romance movie, likewise if a movie has a majorify of kick scense in it then it is most likely to be an action movie. In the example below we don't what type of movie that ? is belongs to, but we have a way of figuring that out.
Title | Kiss | Kick | Type |
---|---|---|---|
California Man | 111 | 5 | Romance |
Beautiful Woman | 75 | 3 | Romance |
Amped II | 2 | 92 | Action |
Kevin Longblade | 10 | 101 | Action |
Robo Slayer 3000 | 5 | 99 | Action |
He’s Not Really into Dudes | 100 | 6 | Romance |
? | 14 | 87 | Unknown |
Distance Measuremeants
To find what is the type of a movie, first we plot these movie on a two dimensional plane and calculate the distance from ? movie to all other movies using Euclidian distance where the distance between two vectors A, B of two elements is given by: d = ((A0 - B0) ** 2 + (A1 - B1) ** 2) ** 0.5.
For example the distance between points (0,0) and (1,2) is ((0 - 1) ** 2 + (0 - 2) ** 2) ** 0.5 = 2.2360 After that we sort the distance with ascending order and pick, lets say 3, from the top do the majority vote on the label of those neighbors, which result in California Man, He’s Not Really into Dudes and Beautiful Woman are the closetes neighbors. In the end the ? movie is of type romance.
K-NN Characteristics
- Pros: High accuracy, insensitive to outliers, no assumptions about data
- Cons: Computationally expensive, requires a lot of memory
- Works with: Numeric values, nominal values
Approach
- Collect: Any method
- Prepare: Numeric values are needed for a distance calculation. A structured data format is best.
- Analyze: Any method.
- Train: Not need
- Test: Calculate the error rate.
- Use: This application needs to get some input data and output structured num- eric values. Next, the application runs the kNN algorithm on this input data and determines which class the input data should belong to. The application then takes some action on the calculated class.
All rights reserved