What a large p for small n

“Large p small n” describes a scenario where the number of features ($p$) is much greater than the number of observations ($n$) for model training. While it is not a new problem, it continues to pose significant challenges in real-world applications of machine learning, especially for domains lacking rich data or fast and cheap data generation processes. In this blog post, I’ll document my recent thoughts on the “large p small n” problem.

1. Toy Problem Setup

For simplicity and easy illustration, let’s look at a binary classification problem with linearly separable data, where at least one hyperplane can perfectly distinguish between the two classes.

The equation of a hyperplane in a $p$-dimensional space is given by:

\begin{equation} \begin{aligned} w_1 x_1 + w_2 x_2 + … + w_p x_p + b = 0. \end{aligned} \end{equation}

Here, $x_1$, $x_2$, …, $x_p$ are the coordinates of a point (called features) in the $p$-dimensional space, and $w_1$, $w_2$, …, $w_p$, and $b$ are special numbers (called weights and bias) that determine the exact position and the norm direction of the hyperplane. Now, if we have a new point ($x’$) and want to know which group it belongs to, we can do two things:

Geometrically: We can look at where the point is located in relation to the line. If it’s on one side of the line, it belongs to one group; if it’s on the other side, it belongs to the other group.
Algebraically: We can plug the coordinates of the point into the left side of the eq 1 and calculate the result. If the result is positive, the point belongs to one group; otherwise, it belongs to the other group.

These weights and bias constitute $p+1$ unknown parameters that govern predictions. Our goal is to find the “true” values of these parameters that define the class assignments in the underlying data generation process, so that we can reliably predict the class of any future unseen point in the $p$-dimensional space. To achieve this, we typically convert this into an optimization problem and use maximum likelihood estimation based on the observed samples. This involves finding the optimal values of the parameters that minimize a loss function such as binary cross-entropy $\mathcal{L_b}$ or hinge loss $\mathcal{L_h}$:

\begin{equation} \begin{aligned} \mathcal{L_b} (y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} [y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i)], \end{aligned} \end{equation}

\begin{equation} \begin{aligned} \mathcal{L_h}(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \cdot \hat{y}_i), \end{aligned} \end{equation}

where $y_i$ is the true label of the $i^{th}$ observation, and $\hat{y}_i$ is the predicted label or the generalized form of predicted probability for positive class assignment of the $i^{th}$ observation derived by applying an activation function $\sigma$, such as the sigmoid (which is a smoothed step function), to the weighted sum:

\begin{equation} \begin{aligned} \hat{y_i} = \sigma(w_1 x_{i1} + w_2 x_{i2} +… + w_p x_{ip} + b). \end{aligned} \end{equation}

When finding a minimum value of a function, in math and physics, we often to check for points where the gradient is zero and the Hessian matrix is positive definite ¹. That means we may want to first solve $\nabla_{w, b} \mathcal{L} = 0$ to find the critical points in the parameter space. However, there appear to have no closed form solution due to the involved nonlinearity in deriving $\hat{y}_i$ in this case (Lipovetsky2015 or see discussions in link1 and link2). Consequently, gradient descent-based algorithms are often employed to find values for $w$ and $b$ that minimize the loss function $\mathcal{L}$.

2. Not All Solutions Are Equally Generalizable

While binary cross-entropy and hinge loss are both convex concerning $\hat{y}$ ², this convexity is not strict ³ due to the possible zero values of second-order derivatives of the losses with respect to $\hat{y}$. So, there exist infinitely many sets of weight and bias values capable of separating data points. Moreover, as the feature dimension increases, the volume of the feature space grows exponentially, reducing the likelihood of obtaining an unbiased or well-representative sample for model training exponentially ⁴. As a result, the infinitely many separating hyperplanes are not equally generalizable to unseen data that are far from the sampled space.

As seen, in large $p$ small $n$ scenarios, even with a simple linear model exhibiting perfect performance on the training set, obtaining parameters conducive to predicting labels accurately on unseen test data can be challenging. When it involves non-linear components, the convex optimization can turn into a non-convex one. This will be more difficult because the number of local minima can increase along with input dimension $p$. Any corresponding performance degrades in the unseen test set can be a result of overfitting on the limited training set.

3. Tackling the Problem of Large P Small N

To address this problem, we may often need to leverage additional information to guide feature engineering, model design, model training, and so on.

3.1 Feature selection

Feature selection is probably the simplest thing we may consider to address the large $p$ small $n$ problem as it aims to cut the number of features so $p$ and $n$ are more comparable.

In order to know which features to keep for building models, we typically leverage some assumptions or intuition to decide which features are more useful for the problems we try to predict. For example, we may assume that the features that have a high association with the target variable are more predictive than the ones that have a low association with the target variable. We may also (iteratively) try different feature sets for modeling and see which combination of features could lead to the most favorable performance in the validation sets.

While feature selection is simple and intuitive, the dropped features may actually be critical. In long-tailed problems, the dropped features may be the ones that are more likely to have a high variance in the training set. In this case, the model trained with a limited training set and reduced feature set may not be able to generalize well to the unseen test set.

3.2 Regularization

Regularization is another common approach to improve the robustness and generalization in the large $p$ small $n$ problem. The most common way to regularize the model training process is to add a penalty term to the loss function to penalize the model complexity, avoiding overfitting.

3.2.1 $L_p$ regularization

To penalize having non-zero values of the model parameters, $l_p$ regularization adds $L_P$ norm of the model’s parameters ⁵ raised to the power of $p$ to the loss function as below:

\begin{equation} \begin{aligned} \mathcal{L_{\text{total}}} = \mathcal{L_{\text{original}}} + \lambda_p \mathcal{L_{p}}, \end{aligned} \end{equation}

\begin{equation} \begin{aligned} \mathcal{L_p} = ||w||^p_p = \sum_{j=1}^{p} |w_j|^p, \end{aligned} \end{equation}

where $\lambda_p$ is a hyperparameter for regularization strength and $p$ is the order of the norm with typically non-negative value ⁶.

Different values of $p$ may have different effects on the model training process (Goodfellow, et al 2016). For example, adding $L_2$ regularization can shrink the weights toward 0 and is equivalent to multiply your likelihood a gaussian prior in the maximum a posteriori (MAP) estimation from the view point of Bayesian. $L_1$ regularization is equivalent to applying a Laplacian prior in the MAP estimation and leads to sparser solutions than $L_2$ ⁷. A more extreme regularization is $L_0$ that adds the sum of number of non-zero weights to the loss to force the model have more zero weights for certain features than $L_1$ does ⁸.

While $L_0$ and $L_1$ regularization may sometimes be used as a gradient-based feature selection method, they may introduce difficulties in optimization as the $L_0$ norm is not convex and the $L_1$ norm is not differentiable at the origin. Therefore, $L_2$ regularization might be more commonly used in practice.

Moreover, since the Hessian for $\mathcal{L_2}$ is positive definite ⁹, this makes the loss function landscape strictly convex as long as the original loss function $L_{original}$ in the parameter space is convex. This property of $L_2$ regularization ensures a single global optimal solution for simple but widely adopted linear models like logistic regression. In our previous toy problem, when applying $L_2$ regularization, there will be a unique hyperplane being found by the optimization process.

3.2.2 Other regularization techniques

In addition to $L_p$ regularization, there are other forms of regularization that allow us to incorporate domain knowledge into the model training process. For example, when there is a spurious correlation in the training data, the naively trained model may establish some unwanted behaviors such as a correlation between the model outputs and bias attributes that are spuriously correlated to the target variables (and features). Such a correlation indicates that the model leverages the spurious correlation to make predictions, leading to poor generalization in the test data that lacks such spurious correlation. To avoid this, we can add a penalty term $\mathcal{L_{\text{corr}}}$ to the loss function to penalize the model not to have such spurious correlation.

\begin{equation} \begin{aligned} \mathcal{L_{\text{corr}}} = \frac{1}{M} \sum_{i=1}^{M} |(\hat{y}_i -\overline{\hat{y}}) \cdot (a_i- \overline{a})|, \end{aligned} \end{equation}

where $|(\hat{y}_i -\overline{\hat{y}}) \cdot (a_i- \overline{a})|$ quantifies the magnitude of the correlation between model output $\hat{y}_i$ and bias attribute $a_i$ for the $i$th sample among select $M$ samples in the training data. Except for this simple example, we may also come up with more complex regularization like what is done in generative adversarial network (GAN)(Goodfellow, et al 2014) and contrastive learning (Oord, et al 2018).

We may also apply controls on the model training dynamics to regularize the model. For example, we can stop the model training early when the validation loss is not improving even though the learning curve so far shows that the training loss is continuously decreasing. This is called early stopping, which prevents the model from overfitting the training data. We can also apply dropout, which randomly turns off some of the neurons during training to force the model to learn more robust features (Srivastava2014). This regularization also provides some data augmentation, which adds some synthetic training data to increase $n$. We may also introduce additional relevant tasks to regularize the model. In particular, we may use different portions of the model for different tasks. The shared components will likely better capture the key components in the data, and the resultant model thus more robust.

3.3 Feature engineering

Feature engineering is another way to incorporate domain knowledge into the model training process. This involves using expert knowledge to transform the raw features into a lower-dimensional space of more relevant features. Dimensionality reduction techniques can also be used for this purpose. While feature engineering is widely used in conventional machine learning, its effectiveness depends on the reliability and completeness of the expert knowledge.

3.4 Architectural Biases

A sister method of feature engineering is to design the model with appropriate architectural biases. The structure of the model itself can also impose potentially useful biases. For example, Convolutional neural netowrks (CNN) can learn how to extract local features in an image with hundreds to hundreds of thousands of pixels using weight-sharing convolution filters. These filters are applied along input features and provide translation-equivariant feature maps ¹⁰, which are much more efficient and effective than using mlp or feature engineering approaches. The translation-equivariance property allows the model to learn to recognize the same object in different locations in the image. Without such a bias, the model may require a lot of more training data to learn.

However, the architectural bias can also be the bottleneck that limits the model generalization capability. For example, the CNN model may be biased towards leveraging the local patterns but fail to correctly recognize the distant global structure like a vision transformer can do (Dosovitskiy2020). Moreover, due to a lack of rotation-equivariance, the CNN model may fail to recognize rotated objects. In this case, we may need to have data augmentation via rotation or design a model that is rotation-equivariant.

3.5 Pre-training and Foundation Models

In recent years, foundation models pre-trained on large-scale datasets have become increasingly popular for tackling the large $p$ small $n$ problem. These models are trained on broad datasets using self-supervised learning objectives, which allows them to learn general-purpose feature representations that can be fine-tuned for specific tasks with limited data. For example, a model pre-trained on ImageNet can be fine-tuned on a small dataset of medical images to achieve good performance on disease classification. Foundation models in natural language processing, such as BERT (Devlin2018) and GPT (Brown2020), have also shown impressive performance on a wide range of tasks with limited fine-tuning data. These models are trained on massive text corpora and learn to capture complex linguistic patterns and world knowledge. By leveraging the knowledge learned during pre-training, foundation models can achieve good performance on downstream tasks with few or no labeled examples (i.e., few-shot or zero-shot).

As seen, with these pre-trained models, the models don’t just provide reasonable architectural biases but also better weight initialization, which helps effectively integrate large $p$ features than the model trained from scratch from the limited small $n$ training data. It also can significantly reduce the training time and generalization error.

Summary

All these methods, including feature selection, feature engineering, regularization, model design, and pre-training, are based on our prior understanding of the data generation process and the model training. They are sometimes called inductive biases, which refer to the assumptions, preferences, or prior knowledge that a learning algorithm incorporates to generalize beyond the training data and make predictions on unseen data. These biases guide the learning process and help the model to prioritize certain solutions or hypotheses over others. Whether the inductive biases being leveraged are sound or not determines whether we could generalize the model to unseen data.

For the large $p$ small $n$ problem, it is often believed that the most relevant information for modeling the problem lies in a low-dimensional manifold (Oord, et al 2018). Some methods use fewer parameters to learn this latent space but rely on stronger inductive biases, while others use more parameters with weaker inductive biases. In large $p$ small $n$ cases, models with more parameters are typically more prone to overfitting, requiring more data to estimate the optimal parameters.

Since the ImageNet moment, the combination of big data and deep learning with suitable inductive biases has led to numerous successes in various applications. For domains with limited data, successful examples like AlphaFold2 highlight the importance of inductive biases (Jumper2021). Developing appropriate inductive biases for large $p$ small $n$ problems remains an active area of research.

Although collecting more features or modalities of data may offer unique advantages, it is crucial to recognize that expanding the feature set can be costly and increase the risk of data biases and overfitting. Furthermore, acquiring additional samples for training and evaluation can be a challenging task sometimes. This issue is prevalent across various predictive modeling domains, including healthcare, business, and science. Consequently, investing resources to augment feature sets necessitates careful strategic planning and consideration to ensure that the benefits outweigh the potential drawbacks. Yep, what a large $p$ for small $n$ ;)

In summary, the large $p$ small $n$ problem presents unique challenges that require a balance between model complexity, inductive biases, and data availability. Addressing these challenges is crucial for developing effective predictive models in data-limited domains.

Citation

If you find this post helpful and are interested in referencing it in your write-up, you can cite it as

Xiao, Jiajie. (April 2024). What a large p for small n. JX’s log. Available at: https://jiajiexiao.github.io/posts/2024-04-29_large_p_small_n/.

or add the following to your BibTeX file.

@article{xiao2023howtoachieverobustai,
  title   = "What a large p for small n",
  author  = "Xiao, Jiajie",
  journal = "JX's log",
  year    = "2024",
  month   = "April",
  url     = "https://jiajiexiao.github.io/posts/2024-04-29_large_p_small_n/"
}

References

Lipovetsky, S. (2015). Analytical closed-form solution for binary logit regression by categorical predictors. Journal of applied statistics, 42(1), 37-49.
https://stackoverflow.com/questions/37997253/can-we-use-normal-equation-for-logistic-regression
https://sebastianraschka.com/faq/docs/logistic-analytical.html#is-there-an-analytical-solution-to-logistic-regression-similar-t
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press. Chapter 7.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Theodoridis, S., & Koutroumbas, K. (2006). Pattern recognition. Elsevier. Chapter 5, Section 5.9.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., … & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.

To check if the Hessian matrix is positive definite, we can examine whether all eigenvalues of the matrix are positive. In the 1D case, we can also simply use the second derivative test to check if the function is convex. If the second derivative is positive, the function is convex. If the second derivative is negative, the function is concave. If the second derivative is zero, the function is a flat line. It’s worth to note that, at a critical point, the function derivative can be undefined (such as $x=0$ for $f(x) = |x|$). ↩︎
The convexity of the loss function in the parameter space depends on the choice of activation function. If the activation function is convex, the loss will be convex in the parameter space, as the weighted sum is a linear (and thus convex) operation. On the other hand, like a typical neural network with at least two linear layers and some form of nonlinear activation function after each layer, the loss function landscape in the parameter space is not convex given the Hessian matrix $H$ is not positive semidefinite (meaning the eigenvalues of $H$ are non-negative). The composition of the loss function with the non-linear activation functions and the complex architecture of neural networks introduces non-convexity. Training a neural network thereby involves navigating a non-convex optimization landscape, which can have multiple local optima and saddle points (meaning the Hessian is indefinite, i.e., its eigenvalues have both positive and negative values). ↩︎
Strictly convex guarantees that there is only one minimum. Different types of convexity are roughly told via the following:
- $f$ is convex if and only if $f’’(x) \geq 0$ for all $x$
- $f$ is strictly convex if and only if $f’’(x) > 0$ for all $x$ (This is sufficient but not necessary for strictly convexity. When the function is not not differentiable, the gradient is replaced by the sub-gradient at the non-smooth point $x$.)
- $f$ is strongly convex if and only if $f’’(x) \geq m > 0$ for all $x$ (Strong convexity can provide faster convergence and tighter error bounds to the minimum compare to strictly convex function)
↩︎
This is an example of curse of dimensionality in sampling. For machine learning, a typical rule of thumb is that there should be at least 5 training examples for each dimension or 10 times the VC dimension in the representation space (Theodoridis2006). If one wants to apply this rule, the number of training examples required to learn a model in the training set grows exponentially with the number of dimensions. ↩︎
The calculation of $L_p$ regularization usually ignores the bias terms and moving averaged values in batch normalization since they don’t contribute to overfitting. However, for convenience, in many implementations, such as the widely adopted library PyTorch, all parameters that require updating by backpropagated gradients may be counted by the optimizer’s default weight decay calculation. Weight decay is a form of regularization that is sometimes (e.g., SGD without momentum or AdamW that has decoupled weight decay calculation) equivalent to $L_2$ regularization. ↩︎
For $p=0$, $L_0$ norm basically counts non-zero elements and $0^0\equiv0$. ↩︎
Laplacian prior has higher probability density near zero compared to a Gaussian prior, thus promoting sparsity. ↩︎
$L_0$ regularization directly penalizes the number of non-zero weights, while $L_1$ regularization indirectly promotes sparsity by shrinking weights more aggressively than $L_2$. ↩︎
The Hessian for $\mathcal{L_2}$ is a diagonal matrix with 2 as the diagonal elements. So the eigenvalues of the Hessian are equal to 2, which are always positive. Q.E.D. ↩︎
Translational equivariance means that the model can recognize the same object in different locations in the image. In math, a function or transformation is equivariant if it maintains a certain relationship between its input and output when both are transformed. In other words, if $f$ is equivariant with respect to transformations $T$ and $S$, then $f(T(x))=S(f(x))$ for all $x$. A stronger version of equivariance is invariance. A function or transformation is invariant if its output remains unchanged when its input is transformed in a certain way. For example, if a function $f(x)$ is invariant under translation, it means that $f(x+a)=f(x)$ for any value of $a$. For CNN, convolutional filters are translation-equivariant while max-pooling is translation-invariant. ↩︎

1. Toy Problem Setup#

2. Not All Solutions Are Equally Generalizable#

3. Tackling the Problem of Large P Small N#

3.1 Feature selection#

3.2 Regularization#

3.2.1 $L_p$ regularization#

3.2.2 Other regularization techniques#

3.3 Feature engineering#

3.4 Architectural Biases#

3.5 Pre-training and Foundation Models#

Summary#

Citation#

References#