Estimating Statistical Properties in Grouped Measurements

Intro

In this post, we’ll explore a statistical problem: estimating population means and their uncertainties from hierarchically structured data or so called grouped measurements. While averaging measurements may seem straightforward, the presence of natural groupings in data introduces important statistical considerations that require careful treatment.

To illustrate the practical significance of this problem, let’s examine how hierarchically structured measurements - where individual observations are naturally clustered into groups - arise across diverse real-world applications. Multiple-Instance Learning (MIL) represents an important machine learning paradigm specifically designed for analyzing such grouped or clustered data structures. The following scenarios showcase a few situations where measurements naturally organize into clusters of varying sizes:

Product Defect Rates Across Factories: A manufacturer tracks product quality across multiple factories. Each factory tests a different number of units daily, with each unit providing a pass/fail measurement. These measurements are naturally grouped by factory, with potential batch effects from different equipments or quality control processes.
Survey Response Across Regions: A nationwide survey collects responses from different geographical regions. Each region contributes a different number of responses based on population size, creating natural groups where responses are clustered by region. Regional cultural or demographic factors may introduce biases in how people respond.
Disease Prevalence Across Clinics: Medical researchers study disease occurrence patterns across multiple clinics. Each clinic reports data from their patient population, with varying sample sizes due to clinic size and local demographics.
Digital Pathology: Pathologists analyze tissue samples by examining multiple regions within each sample. Each region provides specific measurements about cell characteristics or tissue features. These measurements naturally group together under the same patient sample, with the number of regions varying based on the tissue sample size.
Liquid Biopsy Analysis: When analyzing blood samples from cancer patients, researchers examine millions of DNA molecules, including both normal cell-free DNA and tumor-derived DNA. Each molecule provides a distinct measurement, and these measurements are naturally grouped by patient sample, with molecule counts varying between samples.

These examples share a common pattern: the data consists of groups (referred to as samples), where each group contains a varying number of individual measurements (referred to as instances). This dual-level variation in the hierarchical structure presents an interesting statistical problem as seen in the following section. Our focus will be on developing robust methods for analyzing data where measurements naturally cluster into groups of different sizes, considering both the variations that occur within groups and those that emerge between different groups.

Problem Statement

Problem Statement

Given a dataset consisting of $N$ samples $S = {S_1, …, S_N}$ with hierarchical structure:

Sample Level: Each sample $S_i$ represents a distinct group or cluster
Instance Level: Within each sample $S_i$, there are $n_i$ individual measurements ${x_{i,1}, …, x_{i,n_i}}$
Measurement Properties:
- Each instance measurement $x_{i,j} ∈ ℝ$ follows some distribution $D_i$ specific to sample $i$
- Key statistical properties include:
  - Population parameters:
    - Mean: $μ_i = \mathbb{E}_j[x_{i,j}]$
    - Variance: $\sigma_i^2 = \mathbb{Var}_j[x_{i,j}]$
  - Sample statistics:
    - Mean: $\overline{x}_i = \frac{1}{n_i}\sum_j x_{i,j}$
    - Variance: $s_i^2 = \frac{1}{n_i-1}\sum_j (x_{i,j}-\overline{x}_i)^2$

How do we estimate a reliable overall mean $μ = \mathbb{E}[x_{i,j}]$ across all samples $S_i$ and instances $x_{i,j}$?
How can we properly quantify the uncertainty (e.g. standard error $σ_{\overline{μ}}$) in this estimate?

This problem requires attention to how the data is generated and structured, particularly when dealing with varying group sizes and potential batch effects. We need to derive an unbiased formula for the mean that accounts for samples of different sizes, while providing uncertainty estimates (e.g. standard error, confidence intervals) that consider both within-sample and between-sample variability.

Complete Pooling: Analyzing Identically Distributed Measurements as a Single Dataset

Consider the ideal scenario where all measurements are independently and identically distributed (i.i.d.) from a single underlying distribution. In this case, we can disregard group structure and estimate population parameters directly through simple aggregation.

For finite samples, we can estimate the population mean μ optimally through the pooled sample mean:

\begin{equation} \overline{x} = \frac{\sum_{i=1}^{N} \sum_{j=1}^{n_i} x_{i,j}}{ \sum_{i=1}^{N}n_i } \end{equation}

This estimator corresponds to both the arithmetic mean of all observations and the maximum likelihood estimate under i.i.d. assumptions.

The standard error of this estimate can be derived from the Central Limit Theorem as:

\begin{equation} \mathrm{SE} = \frac{s}{\sqrt{\sum_{i=1}^{N} n_i}} = \sqrt{\frac{\sum_{i=1}^{N} \sum_{j=1}^{n_i} (x_{i,j} - \overline{x})^2} {\left( \sum_{i=1}^{N} n_i \right) \left( \sum_{i=1}^{N} n_i - 1 \right) }} \end{equation}

where $s$ represents the pooled standard deviation. Here we use Student approximation to use the sample standard deviation $s$ instead of the unknown true standard deviation $σ$.

This approach, known as micro-averaging or sample size-weighted averaging, achieves statistical optimality under the i.i.d. assumption. When measurements are truly i.i.d., the group structure becomes irrelevant since each individual measurement contributes equally valid information about the underlying distribution. However, this method’s effectiveness critically depends on the i.i.d. assumption holding true.

In practice, the i.i.d. assumption often fails due to several interrelated factors. For example, sampling biases frequently emerge within individual groups, while systematic differences manifest between different groups. The data collection process itself can introduce batch effects that create artificial patterns or variations. Additionally, natural heterogeneity across different populations can lead to inherent variations that violate the assumption of identical distributions.

When i.i.d doesn’t hold, complete pooling can produce misleading results. Large samples with systematic biases will dominate both the mean estimate and uncertainty calculations, potentially masking important patterns in smaller samples. For example, if a large sample systematically oversamples a particular subpopulation or measurement range, this bias will disproportionately influence the overall estimates.

To address these limitations, researchers typically employ more sophisticated approaches that explicitly model the clustered nature of the data rather than treating all measurements as independent observations from a single distribution.

Macro-Averaging: Equal Weighting of Samples

The macro-averaging approach provides an alternative method for estimating the overall mean by first calculating individual sample means and then averaging them equally across all samples. This two-step process ensures each sample contributes equally to the final estimate, regardless of its size:

\begin{equation} \overline{x}_{macro} = \frac{1}{N} \sum_{i=1}^{N} \overline{x}_i, \quad \text{where} \quad \overline{x}_i = \frac{1}{n_i} \sum_{j=1}^{n_i} {x_{i,j}} \end{equation}

This approach enables independent computation of sample-level means $\overline{x}_i$, ensuring that each sample’s statistics are calculated without interference from other samples, which preserves the integrity of individual cluster information. The approach implements a democratic weighting scheme, where each sample contributes exactly $1/N$ to the final estimate, regardless of its size or variance characteristics. The method prevents larger clusters from overwhelming smaller ones, maintains accuracy even in the presence of strong within-cluster correlations, and handles substantial variations in sample sizes with stability.

A basic approximation of the standard error uses the between-sample variance:

\begin{equation} \mathrm{SE}_{\text{macro}} = \sqrt{\frac{1}{N(N-1)} \sum_{i=1}^{N} (\overline{x}_i - \overline{x}_{\text{macro}})^2} \end{equation}

However, this simple approximation obviously neglects both within-sample variability and the effects of varying cluster sizes. Assuming minimal within-sample variation but significant cluster size differences, we can derive a more comprehensive variance decomposition using the law of total variance:

\begin{equation} \mathrm{Var}(\overline{X}) = \underbrace{\mathbb{E}[N]\mathrm{Var}(X)}_{\text{between-sample}} + \underbrace{\mathrm{Var}(N)(\mathbb{E}[X])^2}_{\text{cluster-size effect}} \end{equation}

This leads to a more accurate standard error estimation:

\begin{equation} \mathrm{SE}_{\text{total}} = \sqrt{\frac{\mathbb{E}[N]s^2 + s_N^2(\overline{x}_{\text{macro}})^2}{N}} \end{equation}

where:

$\mathbb{E}[N] = \overline{n} = \frac{1}{N}\sum_{i=1}^N n_i$ is the average cluster size
$s^2 = \frac{1}{N-1}\sum_{i=1}^N (\overline{x}_i - \overline{x}_{\text{macro}})^2$ is the pooled between-sample variance
$s_N^2 = \frac{1}{N-1}\sum_{i=1}^N (n_i - \overline{n})^2$ captures the variance in cluster sizes

While this approach provides a more comprehensive treatment of cluster size effects and helps mitigate potential biases from grouped measurements, it becomes increasingly complex when incorporating within-sample variability in addition. This complexity motivates the need for a more elegant analytical framework, which we’ll explore in the following section on meta-analytic approaches.

Meta-Analytic Approaches: A Framework for Combining Multiple Samples

Meta-analysis, first introduced by Gene Glass in 1976 for the “analysis of analyses” (Glass1976), has become an important methodology in evidence synthesis across many fields. Originally developed to combine results from multiple clinical trials, it has evolved into a sophisticated statistical framework used in diverse areas from social sciences to engineering. The method arose from the need to make sense of sometimes conflicting results across multiple studies and to increase statistical power by pooling data systematically.

In our context of multiple-instance learning, meta-analysis provides powerful frameworks for combining estimates from multiple samples while accounting for different variance components. While meta-analysis can handle various types of estimates (like correlations or odds ratios), we’ll focus specifically on combining mean estimates across samples, which is our key quantity of interest from each sample.

When combining mean estimates from multiple samples, two critical types of variance need consideration:

Within-sample variance: Captures the uncertainty in individual sample mean estimates, expressed through standard errors that depend on both the sample size and the spread of measurements within each sample
Between-sample variance: Reflects true heterogeneity in means across different samples, which could arise from batch effects or systematic differences between samples

As we will see in details later, meta-analytical approaches are particularly valuable for analyzing hierarchical data structures because they:

Combine sample means while appropriately weighting by their precision (inverse variance)
Model both within-sample and between-sample variability simultaneously
Account for varying sample sizes and potential dependencies within groups
Provide frameworks for testing whether samples are homogeneous or heterogeneous

Fixed-Effects Model (FE): Combining Estimates Under Homogeneity

The fixed-effects model assumes:

All samples share a single true mean μ
Observed between-sample differences in means arise solely from:
- Random sampling variation within samples (quantified by $s_i^2/n_i$)
- No systematic between-sample heterogeneity

This is equivalent to assuming perfect transportability ¹ across samples - any apparent differences in their estimates would disappear given infinite within-sample measurements ($\lim_{n_i \to \infty} \overline{x}_i = \mu, \forall i $).

Under this model, we employ an inverse-variance weighted estimator that combines estimates by assigning greater weight to more precise measurements ²:

\begin{equation} \overline{x}_{\text{FE}} = \frac{\sum_{i=1}^{N} w_i \overline{x}_i}{\sum_{i=1}^{N} w_i}, \quad w_i = \frac{1}{\mathrm{Var}(\overline{x}_i)} = \frac{n_i}{s_i^2} \end{equation}

Where:

Variance of the sample mean (standard error squared): $\mathrm{Var}(\overline{x}_i) = \mathrm{SE}^2(\overline{x}_i) = {s_i^2}/{n_i}$
Sample variance of individual instances within sample $i$: $s_i^2 = \frac{1}{n_i-1}\sum_{j=1}^{n_i}(x_{i,j}-\overline{x}_i)^2$

The uncertainty estimate on $\overline{x}_{\text{FE}}$ can be measured by standard error: \begin{equation} \mathrm{SE}_{\text{FE}} = \frac{1} {\sqrt{\sum_{i=1}^{N} w_i}} = \sqrt{\frac{1} {\sum_{i=1}^{N} \frac{n_i} {s_i^2} }} \end{equation}

Random-Effects Model (RE): Accounting for both Within-sample and Between-Sample Heterogeneity

While the fixed-effects model efficiently combines estimates when samples are homogeneous, its fundamental assumption of a single true underlying mean (perfect transportability) often proves unrealistic in practice. Different samples frequently exhibit systematically different means due to various factors:

Data collection procedures and protocols
Population characteristics and selection biases
Environmental conditions and temporal variations
Batch effects and instrumental drift
Laboratory-specific practices
Geographic variations
Other unobserved confounding factors

To account for this between-sample heterogeneity, DerSimonian and Laird (1986) proposed the random-effects model (DerSimonian1986). This approach extends the fixed-effects framework by introducing an additional variance component $\tau^2$ that captures true differences between sample means. The model takes a hierarchical form:

\begin{equation} \overline{x}_i \sim \mathcal{N}(\mu_i, \sigma_i^2/n_i) \end{equation}

\begin{equation} \mu_i \sim \mathcal{N}(\mu, \tau^2) \end{equation}

This hierarchical structure acknowledges two distinct sources of variation:

Within-sample variation ($\sigma_i^2/n_i$): Captures measurement uncertainty
Between-sample variation ($\tau^2$): Models true heterogeneity between samples

The DerSimonian-Laird estimator combines sample means using modified weights $w^*$ that incorporate both variance components:

\begin{equation} \overline{x}_{\text{RE}} = \frac{\sum w_i^* \overline{x}_i}{\sum w_i^*}, \quad w_i^* = \frac{1}{s_i^2/n_i + \tau^2} \end{equation}

Similar to FE model, within-sample variance of the mean is estimated by $s_i^2/n_i = \frac{1}{n_i(n_i-1)}\sum_{j=1}^{n_i}(x_{i,j}-\overline{x}_i)^2$.

Between-sample variance term $\tau^2$ is estimated using the DerSimonian-Laird method:

\begin{equation} \tau^2 = \max\left(0, \frac{Q - (N-1)}{\sum_{i=1}^N w_i - \sum_{i=1}^N w_i^2/\sum_{i=1}^N w_i}\right) \end{equation}

where $w_i$ is the weight in fixed-effects model and Cochran’s $Q$ statistic quantifies heterogeneity:

\begin{equation} Q = \sum_{i=1}^N w_i(\overline{x}_i - \overline{x}_{\text{FE}})^2 \end{equation}

The Q statistic follows a chi-square distribution under the null hypothesis of homogeneity, providing a formal test for between-sample heterogeneity. Large Q values suggest significant heterogeneity and support using the random-effects model over fixed-effects.

Uncertainty for $\overline{x}_{\text{RE}}$ can be measured by standard error with the modified weights $w^*$. The standard variance estimator is:

\begin{equation} \mathrm{SE}_{\text{RE,naive}} = \frac{1} {\sqrt{\sum_{i=1}^{N} w_i^*}} = \sqrt{\frac{1} {\sum_{i=1}^{N} \frac{1}{s_i^2/n_i + \tau^2} }} \end{equation}

However, this naive variance estimation tends to underestimate uncertainty when

working with a small number of samples (N), which limits the precision of heterogeneity estimates [^4]
substantial heterogeneity exists between samples, making the simple pooled variance inadequate
sample sizes or weights are unbalanced across different groups

These may be tough common problems for meta analysis. For example, when data is sparse, both mean and variance estimators can be inaccurate due to limited sampling, leading to biased estimates, inflated standard errors, or even ill-defined statistics. One approach to address this is applying regularization through prior distributions. For example, when working with probabilities or rates that follow a Beta distribution, we can apply its conjugate Dirichlet prior:

\begin{equation} x \sim \mathrm{Beta}(a,b) \quad \Rightarrow \quad x \sim \mathrm{Dirichlet}(\alpha) \end{equation}

where $\alpha = (a,b)$. The posterior expectation and variance for class probabilities $p(x_i)$ are:

\begin{equation} \mathbb{E}[p(x_i)] = \frac{\alpha_i}{\alpha_0} \end{equation}

\begin{equation} \mathrm{Var}[p(x_i)] = \frac{\alpha_i(\alpha_0 - \alpha_i)}{\alpha_0^2(\alpha_0 + 1)} \end{equation}

where $\alpha_0 = \sum_{j=1}^K \alpha_j$ is the concentration parameter and $\boldsymbol{\alpha}’ = (\alpha_1 + n_1, \alpha_2 + n_2, \dots, \alpha_K + n_K)$ represents the posterior parameters after observing counts $n_i$ for each class. This prior distribution encodes our beliefs about the true parameter values, helping regularize and smooth the estimator and improve accuracy. This regularized approach helps stabilize estimates for individual samples by incorporating prior knowledge. We can then apply meta-analysis techniques to combine these refined sample-level estimates into robust population-level inferences.

In addition, there are other approaches. For examples, the Hartung-Knapp-Sidik-Jonkman (HKSJ) method (see original references 4-6, 11 and 12 in this article) provides a more robust variance estimator that better accounts for uncertainty in the estimation of between-study heterogeneity:

\begin{equation} \mathrm{SE}_{\text{RE,HKSJ}} = \sqrt{\frac{\sum_{i=1}^N w_i^{*}(\overline{x}_i - \overline{x}_{\text{RE}})^2}{(N-1)\sum_{i=1}^N w_i^{*}}} \end{equation}

As seen, the random-effects model has a more comprehensive characterization than FE. When the between-sample heterogeneity is significant, the random-effects model captures both within-sample and between-sample variability. When the between-sample heterogeneity is not significant, the random-effects model reduces to the fixed-effects model as $\tau^2 \rightarrow 0$.

Bayesian Hierarchical Modeling: A Probabilistic Framework

While the frequentist approaches discussed above provide practical solutions for combining estimates across samples, they rely on asymptotic approximations³ and point estimates that may not fully capture parameter uncertainty, especially with small sample sizes or complex hierarchical structures. A more comprehensive approach is offered by Bayesian hierarchical modeling, which naturally handles multi-level data structures by explicitly modeling the full data generation process. This Bayesian inference framework treats all unknown parameters as random variables with associated probability distributions, which enables us to:

Directly model the data generation process
Incorporate prior knowledge about parameters
Obtain full posterior distributions for uncertainty quantification
Account for different sources of variation simultaneously

The Hierarchical Structure

The beauty of Bayesian hierarchical modeling lies in how it breaks down complex data into interconnected, understandable layers - like a recipe where each step builds upon the previous ones. The Bayesian hierarchical model provides a natural framework for describing how information flows from raw measurements through connected levels to arrive at population-level insights. By examining these relationships between adjacent layers, we can better understand the complete data generation process:

Population Level: An overall mean μ represents the true population-wide average we want to estimate
Sample Level: Each sample i has its own true mean μᵢ that varies around μ
Instance Level: Individual measurements within each sample vary around their sample-specific mean μᵢ

This naturally maps to an example hierarchical model below:

Instance Level (Likelihood):

\begin{equation} \overline{x}_i \sim \mathcal{N}(\mu_i, \sigma_i^2/n_i) \end{equation}

This level models how individual sample means $\overline{x}_i$ vary around their true sample-specific means $\mu_i$. The Central Limit Theorem justifies using a normal distribution when sample sizes are sufficiently large. Here, $\sigma_i^2$ represents the within-sample variance for sample i, capturing the spread of individual measurements within that sample. The term $n_i$ is the number of instances in sample i - dividing by $n_i$ reflects how the variance of a sample mean decreases with larger sample sizes, a key result from sampling theory. This variance structure $\sigma_i^2/n_i$ directly incorporates both the inherent variability of measurements ($\sigma_i^2$) and the precision gained from larger samples ($n_i$).

Sample Level (Prior): \begin{equation} \mu_i \sim \mathcal{N}(\mu, \tau^2) \end{equation}

This level captures how true sample means $\mu_i$ vary around the population mean $\mu$. The normal distribution assumption reflects our belief that sample-level effects are symmetric and unbounded. The parameter $\tau^2$ represents the between-sample variance, quantifying how much the true sample means tend to differ from each other. A larger $\tau^2$ indicates greater heterogeneity between samples, while a smaller $\tau^2$ suggests more homogeneous samples. When $\tau^2$ approaches zero, the model effectively reduces to a fixed-effects model where all samples share nearly identical true means.

Population Level (Hyperpriors):

\begin{equation} \mu \sim \mathcal{N}(\mu_0, \sigma_0) \end{equation}

This level specifies our prior beliefs about the overall population mean $\mu$. The hyperparameters $\mu_0$ and $\sigma_0$ should be chosen based on domain knowledge or standardization of the data. For illustration, we might use $\mu_0 = 0$ and $\sigma_0 = 10$ as a weakly informative prior that can accommodate a broad range of possible values. In practice, these values should be adjusted based on the scale of your measurements and any prior knowledge about plausible parameter ranges. The choice of these parameters also influences the degree of regularization in the model - smaller values of $\sigma_0$ lead to stronger regularization while larger values allow for more flexibility in the estimates. For example, if working with standardized data (mean 0, variance 1), $\mu_0 = 0, \sigma_0 = 1$ might be more appropriate. For proportion data bounded between 0 and 1, using $\mu_0 = 0.5, \sigma_0 = 0.25$ could better reflect the constrained parameter space.

Between-Sample Variation:

\begin{equation} \tau \sim \mathrm{HalfCauchy}(\mu_\tau, \sigma_\tau) \end{equation}

This parameter models the scale of variation between different samples. The HalfCauchy prior is chosen for its heavy tail that allows for potentially large between-sample variation while still maintaining some regularization near zero. The location parameter $\mu_\tau$ is typically set to 0 to ensure non-negative scale parameters, while the scale parameter $\sigma_\tau$ controls the spread of the distribution. Common choices include:

$\sigma_\tau = 2$ for moderate regularization (recommended by Gelman)
$\sigma_\tau = 1$ for stronger regularization when sample sizes are small
$\sigma_\tau = 5$ for weaker regularization when strong heterogeneity is expected

The choice of $\sigma_\tau$ should be guided by the scale of your data and prior knowledge about between-sample variation.

Within-Sample Variation: \begin{equation} \sigma_i \sim \mathrm{Exponential}(\lambda) \end{equation}

This parameter captures the scale of variation within each sample. The Exponential prior with rate parameter $\lambda$ encourages smaller values while still allowing for larger variations when supported by the data. The choice of $\lambda$ depends on the scale of your measurements:

$\lambda = 1$ is common for standardized data (variance ≈ 1) such as z-scores or normalized features
$\lambda = 0.1$ for data with larger variance (≈ 10) such as raw pixel intensities in medical imaging
$\lambda = 10$ for small-scale measurements (variance ≈ 0.1) such as proportions or probabilities near 0 or 1

The rate parameter $\lambda$ effectively sets the prior expectation of the standard deviation as $\mathbb{E}[\sigma_i] = 1/\lambda$. This hierarchical structure automatically implements “partial pooling” - estimates for individual samples are shrunk toward the overall mean, with the degree of shrinkage determined by the relative magnitudes of within-sample and between-sample variation. This helps prevent extreme estimates from samples with limited data while still preserving meaningful differences between samples.

After specifying the model structure through likelihood, prior, and hyperprior distributions, we can estimate the joint posterior distribution of all parameters conditional on the observed data:

\begin{equation} p(\mu, \tau, {\mu_i}, {\sigma_i} | {\overline{x}_i}, {n_i}) \propto p({\overline{x}_i}|{\mu_i},{\sigma_i},{n_i}) \cdot p({\mu_i}|\mu,\tau) \cdot p(\mu) \cdot p(\tau) \cdot p({\sigma_i}) \end{equation}

where each term corresponds to components defined in equations earlier. This posterior can be estimated using either:

Markov Chain Monte Carlo (MCMC): Provides exact posterior sampling through various algorithms:
- Metropolis-Hastings: Uses proposal distributions to explore the parameter space:
  
  \begin{equation} \alpha = \min\left(1, \frac{p(\theta’|\text{data})q(\theta|\theta’)}{p(\theta|\text{data})q(\theta’|\theta)}\right) \end{equation}
  
  where $q(\theta’|\theta)$ is the proposal distribution and $\alpha$ is the acceptance probability
- Gibbs Sampling: Samples each parameter conditional on others:
  
  \begin{equation} \theta_i^{(t+1)} \sim p(\theta_i|\theta_{-i}^{(t)}, \text{data}) \end{equation}
- Hamiltonian Monte Carlo (HMC): Leverages gradient information for efficient exploration:
  
  \begin{equation} \theta^{(t+1)} \sim K(\theta^{(t)}, \theta) \cdot p(\theta|\text{data}) \end{equation}
  
  where $K$ is a transition kernel incorporating Hamiltonian dynamics
Variational Bayes (VB): Approximates the posterior by optimizing a simpler distribution $q_\phi(\theta)$ to minimize KL-divergence:

\begin{equation} \phi^* = \arg\min_\phi \text{KL}(q_\phi(\theta) || p(\theta|\text{data})) \end{equation}

VB is computationally more efficient than MCMC, making it suitable for large datasets, though typically less accurate in estimating the true posterior.

Modern probabilistic programming frameworks like Stan, PyMC, and TensorFlow Probability make these Bayesian inference accessible and computationally tractable. Although these Bayesian methods are often computationally more complex than the frequentist approaches described earlier, they offer several key advantages: (1) more accurate uncertainty quantification through full posterior distributions (2) natural incorporation of prior knowledge and domain expertise (3) better handling of small sample sizes through partial pooling and (4) flexibility to extend models with additional structure or covariates.

Summary

We’ve examined various approaches for estimating population means from hierarchically structured data, each with distinct assumptions and applications:

Method	Assumptions	Weighting	Uncertainty Handling	Use Cases	Complexity
Complete Pooling (Micro-Averaging)	IID measurements	Sample size	Variance in all instances in all samples	Homogeneous groups	Low
Macro-Averaging	Equal sample importance	Equal per sample	Between-sample variance	Small N, balanced groups	Low
Fixed-Effects (FE)	Homogeneous true means	Inverse within-var	Within-sample variance	Controlled experiments	Medium
Random-Effects (RE)	Heterogeneous true means	Inverse total var	Within + between variance	Observational studies	High
Bayesian Hierarchical	Inductive biases for likelihood, prior, and hyperprior	Adaptive shrinkage	Full posterior distribution	Small samples, complex hierarchies	Typically Very High

The choice of method depends on specific data distributions and your needs. In practice, Complete Pooling/Micro-Averaging works best when samples are homogeneous and efficiency is key, while Macro-Averaging is ideal when each sample should have equal influence. Random-Effects modeling offer a good balance of flexibility and interpretability, though may be less efficient than directly building an equivalent Fixed-Effects model when samples are similar. Bayesian approaches provide the most thorough uncertainty quantification but require more inductive biases and computational resources. For very large datasets (e.g. >1M instances) with limited computing power, consider using Variational Bayes or simplified meta-analytic methods. When unsure, the DerSimonian-Laird random-effects model with HKSJ variance adjustment can be a good default choice.

Citation

If you find this post helpful and are interested in referencing it in your write-up, you can cite it as

Xiao, Jiajie. (March 2025). Estimating Statistical Properties in Grouped Measurements. JX’s log. Available at: https://jiajiexiao.github.io/posts/2023-03-01_stats-for-mil/.

or add the following to your BibTeX file.

@article{xiao2025stats-for-mil,
  title   = "Estimating Statistical Properties in Grouped Measurements",
  author  = "Xiao, Jiajie",
  journal = "JX's log",
  year    = "2025",
  month   = "March",
  url     = "https://jiajiexiao.github.io/posts/2023-03-01_stats-for-mil/"
}

References

Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational researcher, 5(10), 3-8.
DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled clinical trials, 7(3), 177-188.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper).

Perfect Transportability describes a strong assumption that estimation results from any sample can be directly generalized to all other samples without adjustments $$ \forall i,k\ \lim_{n_i,n_k \to \infty} (\overline{x}_i - \overline{x}_k) = 0. $$ This implies sample collection mechanisms and subpopulation characteristics do not systematically influence measurements. For example, you will see perfect transportability if all clinics draw from identical patient populations, factories use identical processes/materials during Manufacturing. ↩︎
The inverse-variance weighting scheme (w = 1/SE²) naturally assigns higher weights to more precise estimates - those with smaller within-sample variance and/or larger sample sizes. This approach combines heterogeneous estimates by giving more influence to more reliable measurements while downweighting less precise ones. For example, in clinical trials, larger studies with tighter confidence intervals would receive proportionally more weight than smaller studies with wider intervals. ↩︎
Asymptotic approximations in this context refer to statistical properties that only hold as sample size approaches infinity. For example, the Central Limit Theorem guarantees sample means are normally distributed only for large enough samples. With small samples, these approximations become less reliable - normal distributions may not adequately describe uncertainty, and confidence intervals based on standard errors may not achieve their nominal coverage rates (e.g., a “95%” confidence interval may contain the true parameter less than 95% of the time). ↩︎

Intro#

Problem Statement#

Complete Pooling: Analyzing Identically Distributed Measurements as a Single Dataset#

Macro-Averaging: Equal Weighting of Samples#

Meta-Analytic Approaches: A Framework for Combining Multiple Samples#

Fixed-Effects Model (FE): Combining Estimates Under Homogeneity#

Random-Effects Model (RE): Accounting for both Within-sample and Between-Sample Heterogeneity#

Bayesian Hierarchical Modeling: A Probabilistic Framework#

The Hierarchical Structure#

Summary#

Citation#

References#