Vapnik–Chervonenkis Dimension and Polynomial Classifiers

In this post we will explore a rather unusual sort of binary classifier: the generalized polynomial classifiers. Our primary goal is to determine analytical bounds on the VC-dimension of this family of classifiers, giving some indication as to their potential expressive power.

Generalized Polynomial Classifiers

A generalized polynomial classifier is a binary classifier of the form

    \[H(x) = \begin{cases}+1 &\text{if } p(x) > 0 \\-1 &\text{if } p(x) \le 0\end{cases}\]


for multi-variate polynomials p(x) of degree \le d. While polynomials are often used for regression tasks, here we investigate their use in classification. For example, the polynomial z = x^3 + x^2 - 2y^2 + 2 gives a highly non-linear decision boundary:

Depending on the dimension of the feature space and the value of d, these classifiers resemble some more familiar models:

  • When our feature space is \mathbb{R} and d is arbitrary, the polynomial p(x) is a conventional polynomial of a single real variable:

        \[p(x) = a_0 + a_1x + \cdots a_{d}x^d\]

    In this case, finding p is a polynomial interpolation task, and H partitions the real line into positive and negative regions delineated by the roots of p.
  • When our feature space is \mathbb{R}^2 and d is arbitrary, the polynomial p(x) is a polynomial of two variables:

        \[p(x, y) = a_0 + a_1x + a_2y + a_3x^2 + a_4y^2 + a_5xy + a_6x^3 + a_7x^2y + \cdots\]

    In this case, finding p is a surface-fitting task, and the resulting H partitions the plane into regions delineated by the iso-contours p(x, y) = 0.
  • When our feature space is \mathbb{R}^n and d=2, we have a not-uncommon quadratic form:

        \[p(x) = A_0 + A_1x + x^\T A_2 x\]

    expressed in terms of conventional matrix-vector products.
  • When our feature space is \mathbb{R}^n and d is arbitrary, we have the most general form

        \[p(x) = A_0 + A_1x + A_2x^2 + \cdots + A_dx^d\]

    where A_k is a tensor1 of rank k and A_kx^k denotes the k-mode product:

        \[A_kx^k = \sum_{i_1, \cdots, i_k} A_k[i_1, \cdots, i_k]x[i_1] \cdots x[i_k]\]

    Hence, A_0 is a scalar, A_1 a vector, A_2 a matrix, etc.

For example, over \mathbb{R}^3 with d=2, we have the general form:

    \[p(x) = A_0 + \begin{pmatrix}A_1[0] \\ A_1[1] \\ A_1[2]\end{pmatrix}\begin{pmatrix}x \\ y \\ z\end{pmatrix} + \begin{pmatrix}        A_2[0, 0] & A_2[0, 1] & A_2[0, 2] \\        A_2[1, 0] & A_2[1, 1] & A_2[1, 2] \\        A_2[2, 0] & A_2[2, 1] & A_2[2, 2] \\    \end{pmatrix}\begin{pmatrix}x \\ y \\ z\end{pmatrix}\]

A GPC can be viewed as a generalization of an SVM with a polynomial kernel. The number of coefficients (and hence the number of potentially trainable parameters) is given by:

    \[\sum_{k=0}^d \binom{n+k-1}{k} = \binom{d+n}{d} = O\left(n^d\right)\]

Note that our chosen representation has some constant redundancy factor, which we can eliminate by stipulating WLOG that A_k be upper-triangular. This exponential growth in the number of parameters indicates that the expressive power of a GPC is likely quite high. However, it also indicates that training may be impractical, and that overfitting may be a serious issue.

We will use the notation \mathrm{PC(n, d)} to denote the family of polynomial classifiers over \R^n of dimension d.

Note that since we are using a regression tool for classification, we need to develop a continuous encoding for discrete labels. We use the obvious choice:

    \[(x, +) \mapsto (x, +1)\]

    \[(x, -) \mapsto (x, -1)\]

VC Dimension

VC-dimension, short for Vapnik–Chervonenkis Dimension, is a measure of the expressive power of a family of classifiers. It is defined as the size of the largest set of points that can be correctly classified (by a member of the family in question) for any assignment of binary labels. We sometimes say that the point set is “shattered” by the classifier in this case.

For example, consider a 2-dimensional feature space and the family of linear classifiers (i.e. classifiers whose decision boundaries are lines). This conventional family of classifiers is effectively just \PC(2, 1). Clearly, for any set of three points and any assignment of +/- labels, the data is linearly separable. However, no arrangements of four points can be shattered. There are two cases to consider:

  1. All four points are contained in the convex hull. Assign labels such that only opposite points have the same label (as in Figure 1). The resulting arrangement is not linearly-separable.
  2. Only three points are contained in the convex hull. In this case, one point is necessarily contained in a triangle formed by the other three. Assign the three hull points + and the one internal point -. The resulting arrangement is not linearly-separable.

This argument shows that the VC dimension of the family of 2D linear classifiers is < 4.

Visual proof that the VC-dimension of 2D linear classifiers is \ge 3 and < 4 (case (2) omitted). (Image Credit)

As another point of reference, the VC-dimension of a neural network using the sigmoid non-linearity is known to be quadratic in the number of trainable parameters. Specifically, the VC-dimension is O(|E|^2 \cdot |V|^2) where E and V represent the edges and vertices in the network’s computation graph.

We will use the notation \mathcal{VC}(\cdot) to denote the VC-dimension of a family of classifiers.

Theoretical Analysis of GPCs

First, an easy result on the monotonicty of the VC-dimension.

Theorem 1. Let n_1 \le n_2 and d_1 \le d_2. Then, \mathcal{VC}(\mathrm{PC}(n_2, d_2)) \ge \mathcal{VC}(\mathrm{PC}(n_1, d_1))
Proof. The result follows immediately from the fact that \mathrm{PC}(n_1, d_1) \subseteq \mathrm{PC}(n_2, d_2).

In the special case of polynomial classifiers over \R, the VC dimension is relatively straightforward to compute.

Theorem 2. \mathcal{VC}(\mathrm{PC}(1, d)) < d+2

Proof. Since a polynomial of degree d can have at most d roots, it can only partition the real line into at most d+1 regions of alternating sign. Consequently, a set of d+2 sorted points with alternating labels:

    \[\{(x_1, +1), (x_2, -1), (x_3, +1), \cdots, (x_{d+2}, +1)\}\]

cannot be correctly classified by any member of \mathrm{PC}(1, d). Thus, \mathcal{VC}(\mathrm{PC}(1, d)) < d+2.

Theorem 3. \mathcal{VC}(\mathrm{PC}(1, d)) \ge d + 1

Proof. For any set of d+1 points, there exists a polynomial (in fact, exactly one polynomial) of degree d that interpolates the points. Thus, any given labelled points {(x_i, \pm 1)}_{i=1}^{d+1} can be classified exactly regardless of labelling by constructing the interpolating polynomial (by lagrange interpolation, for example). Thus, \mathcal{VC}(\mathrm{PC}(1, d)) \ge d+1.

The prior two results show that \mathcal{VC}(\mathrm{PC}(1, d)) = d+1. We can also derive some explicit values for linear polynomials (i.e. hyperplane decision boundaries).

Theorem 4. \mathcal{VC}(\mathrm{PC}(n, 1)) \ge n+1

Proof. Take n+1 points from \mathbb{R}^n as the vertices of a simplex as follows:

    \begin{equation*}\begin{gathered}     x_0 = (0, 0, 0, \cdots, 0) \\     x_1 = (1, 0, 0, \cdots, 0) \\     x_2 = (0, 1, 0, \cdots, 0) \\     x_3 = (0, 0, 1, \cdots, 0) \\     \vdots \\     x_n = (0, 0, 0, \cdots, 1) \\ \end{gathered}\end{equation*}

We claim that these points can be correctly classified for any label assignment. Let I = \{i_1, i_2, \cdots, i_k\} denote the indices of the points that are assigned a + label. We construct a linear polynomial p such that

    \[p(x_i) = \begin{cases}1 &\text{if } i \in I \\0 &\text{if } i \notin I \\\end{cases}\]

giving us exactly the desired classification. There are two cases:

  1. If 0 \in I, then we take p(x) = 1 - \sum_{j \notin I} x[j]
  2. If 0 \notin I, then we take p(x) = \sum_{i \in I} x[i]

It is easy to see that p(x) gives the correct activation in both cases. Thus, we have shown that there exists a set of n+1 points that can always be shattered.

We now present the main result, an asymptotic lower bound for arbitrary dimension and degree. We begin with the special case of n=2 and then show that the approach generalizes.

Theorem 5. \mathcal{VC}(\mathrm{PC}(2, d)) \ge \left( \frac{d}{2} + 1 \right)^2

Proof. Consider a set of N^2 points in the plane arranged as the vertices of an N \times N integer lattice (with N to be determined later). In particular, take \{0, \cdots, N-1\} \times \{0, \cdots, N-1\}. Assign +1/-1 values to the points arbitrarily. Now, consider each row of the lattice independently, and fit a 1D interpolating polynomial p_i(x) to row i. We wish to construct a 2D polynomial out of the p_i‘s by somehow interpolating between them. To do this, define the following family of “selector” polynomials. Let \phi_i(x) be the 1D polynomial that interpolates:

    \[(0, 0), (1, 0), \cdots, (i, 1), \cdots, (N-1, 0)\]

Informally, \phi_i(x) “selects” point i while rejecting all others. We can now define our 2D polynomial as follows.

    \[p(x, y) = \sum_{i=0}^{N-1} p_i(x) \cdot \phi_i(y)\]

We can see that \phi_i(y) selects the appropriate row and p_i(x) gets the appropriate value from that row, thus correctly interpolating the full dataset. Since p_i and \phi_i are both of degree N-1, the resulting p is of degree 2(N-1). We require this quantity to be less than d, so we take N \le \frac{d}{2} + 1. We have thus demonstrated a way to correctly classify \left(\frac{d}{2} + 1\right)^2 2D points using a 2D polynomial of degree d regardless of labelling. Therefore, \mathcal{VC}(\mathrm{PC}(2, d)) \ge \left( \frac{d}{2} + 1 \right)^2.

Theorem 6. \mathcal{VC}(\mathrm{PC}(n, d)) \ge \Omega\left( \frac{d^n}{2^{n^2}} \right)

Proof. The proof of Theorem 5 generalizes nicely. We will construct an arrangement of points in \R^n that can be shattered. First, consider the largest arrangement of points in \R^{n-1} that can be shattered by a polynomial of degree N-1; that is, an arrangement of \mathcal{VC}(\mathrm{PC}(n-1, N-1)) many points. Take N copies of this arrangement and stack them along the nth dimension (in 3D, this looks like slices stacked vertically). Assign +1/-1 values to the points arbitrarily. For each slice i, take p_i as the (n-1)-D polynomial that correctly classifies the points. We now define our full n-D polynomial:

    \[p(x) = \sum_{i=0}^{N-1} p_i(x[0], x[1], \cdots, x[n-2]) \cdot \phi_i(x[n-1])\]

We can see that \phi_i selects the appropriate slice and p_i gets the appropriate value from that slice, thus correctly interpolating the full dataset. Since p_i and \phi_i are both of degree N-1, the resulting p is of degree 2(N-1). We require this quantity to be less than d, so we take N \le \frac{d}{2} + 1. We have thus demonstrated a way to correctly classify \mathcal{VC}(\mathrm{PC}(n-1, N-1)) \cdot N many N-D points using a N-D polynomial of degree d regardless of labelling. Therefore,

    \begin{align*}         \mathcal{VC}(\mathrm{PC}(n, d)) &\ge \mathcal{VC}(\mathrm{PC}(n-1, N-1)) \cdot N \\         &\ge \mathcal{VC}(\mathrm{PC}(n-1, d/2)) \cdot d/2     \end{align*}

We can solve this recurrence with the base case established by Theorem 5 to obtain the main result:

    \begin{align*}         \mathcal{VC}(\mathrm{PC}(n, d)) &\ge d^n \Big/ 2^{\left(\frac{n(n-1)}{2} + n - 1 \right)} \\         &\ge \Omega\left( \frac{d^n}{2^{n^2}} \right)     \end{align*}

It is exciting to see that the expressive power of polynomial classifiers exhibits (at least) exponential asymptotic growth. However, the curse of dimensionality is strong, with the doubly exponential constant term dependent on n.

The results are summarized in the following table

Training GPCs

One immediate way to train a multivariate polynomial classifier is to treat the higher-order terms as the output of a lifting kernel and solve the resulting linear system:

This matrix is a the multivariate analog of the Vandermonde matrix for ordinary polynomial interpolation and is of size N \times \binom{d+n}{d}. Solving this system (in the least-squares sense) via SVD requires at least O\left(N^2n^d\right) operations. If the number of parameters is fixed for a given classifier to-be-trained, this approach is effectively quadratic in the number of training points.

Another approach is to learn the weights gradually via gradient descent. We can pass the output of the polynomial through a sigmoid function \sigma: (-\infty, \infty) \to [0, 1] and then define a cross-entropy loss function. However, this approach may incentivize learning large coefficients so some regulatory terms may be needed in the loss function, or possibly some other kind of non-linear transform on top of the polynomial may be required. Implementing this approach is an avenue for future work.

Conclusion

In this post we investigated the VC-Dimension of the family of generalized polynomial classifiers for binary classification over multi-dimensional feature space. Exact values are computed for low dimension and low degree, and asymptotic bounds are derived for the general form. The quadratic dependence on dimension indicates that the approach may only be suitable for lower-dimension feature spaces in which the degree of the polynomial is much higher than the number of features. However, in these settings, they may prove comparable to neural networks as a universal function approximator.

Methods for training GPC’s were briefly discussed, but more work is required to see if they are really very practical to use (my guess is probably not 😛). Cheers!