Rolf
Sundberg
Mathem.
statistics
Stockholm
University
Lelystad,
May 2000
Situation:
Calibration
is a process where we are establishing relation between
We
can only use a sample – not the whole population
The
relationship is used for prediction of true value for new pigs.
There
is interplay between prediction and estimation.
|
|
Prediction |
Estimation |
|
|
(R)MSEP |
(R)MSE |

(Imagined
known) theoretical distribution for (x,y)
Predict future yo for known xo by
Prediction error = residual
Accuracy measure
Minimum MSEP for
Under some conditions straight line:
And
Independent
of x0
Calibration data available instead of
theoretical distribution
Random sample or controlled x

Estimated straight line regression (e.g. by OLS =
Ordinary Least Squares)
Properties? How good?
If sample size n is large, so line is precisely
determined
The precision must be estimated, by
p is the dimension of x, in this example p = 1.
MSE not (quite) fair as prediction error
measure, because yi has already influenced
(1)
Cross-validation, leave-one-out
Computational
burden?
(2)
Variation of (1) : Leave out larger subsets of data, not only one at a
time
(3)
Sort of extreme of (2) :
Calibration set - Validation set
( test set ) .
Two MSEP when the two sets exchange roles


For large n and OLS regression, with p<<n
More precisely
Examples
from Danish study (1996)
|
|
N
|
p |
MSEP/MSE |
1+(p+1)/n |
|
OLS
(FOM/MK) |
202 |
4 |
1.03 |
1.025 |
|
PCR
(CC) |
344 |
11 |
1.07 |
1.03 |
|
PLS
(Autofom) |
344 |
127 |
1.09 !!!! |
1.4 !!!! |
General conclusion: MSE too
optimistic, a little or much
PLS
is one of several shrinkage methods (regularisation methods)
Others
= PCR, CR, RR, LSRR
Why shrink ?
To
compensate for (near-) collinearity


Obvious
risk for an extreme slope of the OLS-fitted plane, just by chance.
For
safety, reduce this slope
Some
linear combinations of x-variables are almost constant (over observations)
How detect near-collinearity ?
Corr
(X) – Matrix near singular, some
very small eigenvalue(s)
Statistical
consequence for OLS:
==> b likely to have large coefficients ( by chance)
This
is unavoidable if p$n
Near-
collinearity is typical if p is large
Near-collinearity may occur for p small
Different
approach:
What
are the “principal properties” of the measurement system?
What
is the natural (chemical/biological) rank of the system (the data)?
Variation
is typically taking place essentially only in a low-dimensional space (x-space)

|
Estimator |
Predictor |
|
No
systematic error, in OLS but can be far from truth/causality
=> misinterpretations |
Works if Shrunk (ommitted) directions have little influence or new data vary
little in shrunk directions (like calibr. Data) |
Estimation for description & interpretation,
MSE
Predictivity measures
internal
(simulated)
external
(true)
Representativity of calibration/test set
What is to be predicted?
True y
What can be achieved?
Measured y
Pretreatment: Shrinkage methods not invariant
under e.g. individual rescaling of x
1.
Principal components analysis = PCA
Regression = PCR

(t1,t2) are equivalent to
(x1,x2), but whereas x1 , x2 vary
equally much and are strongly correlated, t1 varies much more than t2
(»constant) and t1 , t2 are uncorrelated.
Much more likely t1 can explain
variation in y, than t2
PC1: t1 along direction that explains
most variation
PC2: t2 in orthogonal direction that
explains next most variation etc, if there are more dimensions.
So:
PCA
t1 , ………, tp
which replace x1,….., xp
t1 = c11 x1 + --- + c1p xp
the ti are called scores
the cij are called loadings
PCR:
Regress y on only ti ,……, tk,
instead of full regression on ti ,……, tp or equivalent
y on xi,……xp
Choose number k by cross-validation
Possible inefficiency:
there may be PCs which
do not influence y.
Why include them in regression? Only
contributing uncertainty.
PLS more efficient in this respect, but else
similar to PCR
OLS maximises Corr(y,t(x)) over t and Corr(y,t1) over t1
PLS
maximises
Cov(y,t(x)) over t and Cov(y,t1) over t1
PCR
maximises Var (t1)
The ci values form a direction vector where
PLS is a compromise between extremes
Wish to have highest possible correlation
Wish to have high(est possible) variance in t1
More general approach:
1) Maximise some expression: f (Corr(t1,y) , var(t1)) with respect to

where f is increasing both in Corr and in Var. This yields some direction c1.
2) OLS regress y on t1 to form predictor
3)
Calculate residuals from 2), and repeat the procedure on them, if
desirable
It can be shown then that
Note to step 2 above:
Upscaling of bRR
by least squares, so called LSRR.
In typical use, RR
»
LSRR (in the sense bRR »
bLSRR for small *)
Choose * by cross-validation
Now
OLS, PLS, PCR satisfy criteria of this type Þ
LSRR
OLS:
d
= 0
PLS:
d
®
4
first factor, first latent variable, first PC
PCR:
d
®
-lmax first factor, first latent variable, first PC
lmax
= maximal eigenvalue of (XTX)
So
all these methods are strongly related mutually.
One
more such method:
Maximise
Corr(t1
, y)/Var(t1)g
with respect to c1
Choose
g
that yields best cross validation.
Repeat
on y-residuals to form next factor t2, etc.
This
is Continuum regression (Stone & Brooks).
Any
of these shrinkage methods is justifiable and typically yield quite similar
predictors. Perhaps PCR and PLS are conceptually
preferable and PLS is slightly more efficient.
Schematic
picture of PLS or PCR


Shrinkage
methods are not invariant under transformations of x

Autoscaling (may be difficult), sometimes it is reasonable, often not, for instance with spectral measurements, difference spectrum.
How
about weight and fat thickness for pigs?
Two
different situations:
Calibration
Prediction
Calibrate
for individuals, which might be in the population.


(with
PCR or PLS) t1 and t2 from calibration.

Larger
samples
Wider
samples
More
/ other variables
Better
model (transform variables, include interactions, include nonlinearities).
Better
predictor.
Double
regression fits well with PCR/PCA
Proposed
procedure:
Use PCA to find
the principal components describing the variation in (X, Z) jointly, from the
total data set on (X, Z). Say the result is t = t (X, Z).
Use OLS
regression of y on t to construct a predictor based on t = t (X, Z).
Use
cross-validation to choose the number of PC’s in (X, Z) that best predicts y.
When only x is
available, predict y via PC’s, see next page.
Both
X and Z is selected to be able to describe y, so there will probably be a large
extent of co linearity in (X, Z). Hence PCR and similar
method is probably motivated.
(Application
on double regression)

Predict y0 by
Missing
data more easily handled by PLS & PCR than by methods like ridge regression.
(Data from Brockhoff et al. 1993)
Concerns:
Smell of apples after storage under n = 48 different conditions.
Y=
preference of smell on a 0-5 point scale, averaged over a trained sensory panel
of 10 assessors.
X=(x1,……,x15)
= intensities of p = 15 GC peaks, corresponding to 15 volatiles.
Questions:
Can
y be predicted from x?
Can
y be understood from x?
X
data (GC peak areas), n=48 samples, p=15 variables.

y
on x plots for some x-variables

MSEP

Regression
coefficients when Ordinary Least Squares regression is used.


Star:
PLS one LV
Box:
PCR one PC
Regression
of y = “preference” on GC



Karlsson,
Karlberg & Olsson, KTH & SU, Analyt. Chem. Acta 1995.

125
specimens, spectra at 316 wavelengths (UV – visible) and nitrate by reference
method.
Centered data
![]()
for each wavelength.

Minimum
norm LS

Shows
that we gain a little but restricting data to first 100 wavelengths (the
remaining ones appear not to contain any information).
PLS,
centered data.

_____
All wavelengths
--------100
first wavelengths
See
how similar the curves are !
(of
regression coefficients as function of wavelength)

PCR(20), PLS(20) and LSRR(0,003)chosen to have their highest peaks of about the same amplitude.
PCR:
dotted line, PLS: dashed line, LSRR: solid line
Regressions
of nitrate on absorbances at first 100 wavelengths. CV leave one-out MSEP
values.

PLS
and PCR plotted against number of
factors. PCR: dashed line, PLS: solid line
Calibration
set and test set after random split Þ
“leave one out” and test set yield
about the same MSEP values.

Calibration
set and test set separated in time
split Þ
“leave one out” is too optimistic in its MSEP

The
same situation as on the previous page, but with PCR instead of PLS.

Brown
P. J. (1993):
Measurement,
regression and calibration
Oxford
University Press
Martens H., Næs T. (1989)
Multivariate calibration
Wiley
Sundberg
R. (1999)
Multivariate
calibration – direct and indirect regression methodology.
Scand.
J. Statist. , Vol 26, pp 161 – 207 (with discussion)
(Review
paper; contains the spectroscopy example).
Sundberg
R. (2000)
Aspects
of statistical regression in sensometrics.
Food
Qual. & Preference, Vol 11, pp 17 – 26
(Contains
the sensometrics example)