Course: Linguistic data: quantitative analysis and visualization
School of Linguistics, National Research University Higher School of Economics, MA programs: Computational linguistics, Theory of Language, Spring 2018
Course materials | Source code on Github
License: CC BY-SA 4.0. Please cite the authors of the course, their affiliation and provide the link to the course page.
In the example we have seen before, there were only two variables. However, usually one use PCA if there are three and more variables in the dataset. Let's look at another example with three variables.
This is how some fictive data would look like:
## x y z
## 1 1.0260509 2.316172 1.5304640
## 2 1.6672554 -1.448155 0.5606579
## 3 -2.5949498 -1.328355 0.5180702
## 4 4.9366130 2.384744 1.5730692
## 5 -0.3543403 -2.030523 -3.6689673
Each data point is described by three values, see columns x, y, and y. The 3D scatterplot will look like the following.
You can rotate data left/right and top/down and make the graph bigger or smaller using your mouse or touchpad. To scroll the page down, place the cursor below the graph.
If you rotate the 3D scatterplot you will see that the points form an elongated cloud. By analogy with the 2D case, it is possible to draw an ellipsoid (a three-dimensional analog of an ellipse) in which almost all points lie.
As the first principal component, we choose a coordinate axis along which the scatter of the points is the largest - that is, directed along the long axis of the ellipsoid.
So we introduced the first coordinate, PC1. Two other coordinates, PC2 and PC3, are to be perpendicular to PC1 (see 2D case above). These lines belong to the plane perpendicular to PC1 and passing through a fixed point, the center of the ellipsoid. In order to show how to draw the second and third principal components in this plane, let's look at this plane in such a way that PC1 is perpendicular to one's eye (please rotate the graph). Thus, we get the right projection of our points.
Now we see projection similar to the one we have seen in the 2D case. Again, most of the points lie in an elongated ellipse (which is the projection of our ellipsoid). This ellipse has two axes: large and small, and they are perpendicular.
We choose PC2 in such a way that the corresponding axis goes in the direction in which the scatter of the points is the largest. In other words, PC2 is to be directed along the long axis of the projected ellipse.
The third principal component must be perpendicular to PC2 and therefore must coincide with the smallest axis of the ellipse. The dispersion along PC3 is minimal.
The resulting picture will llok like the following.
Let's go back to the original axes x, y, and z, and look what happens.
Three new axes correspond to three principal components: green - the first (PC1), orange - the second (PC2), and magenta - the third (PC3). If you rotate the picture you will see that the points lie very close to the plane passing through the PC1 and PC2 axes: their deviation from this plane (which is equal precisely to PC3) is as minimal as possible.
This means that the first two principal components carry most of the information about the location of the points. If we "forget" the values of PC3, that is, project the cloud on the plane formed by PC1 and PC2, the loss of information will be minimal.
Let's look at the plane formed by the first two principal components.
That is exactly how PCA plots will look like if we plot them using the first two principal components.
PCA is sensitive to the choice of measurement units and their dispersion in each variable. It is possible that some variables would make a greater contribution to PC1 than others since they have larger dispersion. For example, one type of scores can be coded on a 100-point scale, while another one - on a five-point scale. There is no doubt that in this case the values of the first variable will have larger variation than the values of the second variable. Thus, the contribution of the first variable will look more significant. However, this is unlikely to be a reasonable solution: it would be more correct to normalize the estimates on the same scale and then apply the principal component method.
The picture can be even more complicated if there are data in the data set measured in different units (say, seconds, meters, and pounds). It is obvious that these units cannot be compared directly. The standard approach in this case is to normalize data, namely, divide the values of each variable by its standard deviation. So, all standard deviations become equal and there will be no "distortions". However, we have to take into consideration that we will lose some information in this case.
When we discussed how to draw the axes corresponding to the principal components, we formulated it like "let's draw it along the long axis of the ellipsoid". However, in practice, we usually have no well-shaped cloud (ellipsoid) but rather a set of points located arbitrary in the space. How to choose the optimal line in this case?
To do this, for each straight line, we calculate the sum of squares of distances from all points to this line and then, among all possible lines, choose the one for which this sum of squares will be minimal. This is similar to the search for a regression line: the difference is that when we are looking for a regression line, we calculate the distance along the vertical axis, and in the case of PCA we calculate the distance from a point to a straight line (along the line perpendicular to the straight line). Well, we have to go through a lot of possible straight lines, but computers nowdays are very effective even when we work with spaces of high dimensionality.