Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

40 messages Options
Embed this post
Permalink
1 2
Markus Metz-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Nikos Alexandris

Nikos Alexandris wrote:
> Nikos:
>  
>>> Hi Agus. The difference is:
>>> Using SPOT (range up to max. 255):
>>>      
That implied to me that the max possible value of SPOT was used, not the
max observed (often the max observed is also the max possible). In my
experience, when comparing imagery or derived products from different
sensors, the least errors occur if everything is rescaled to the same
range, e.g. 0 - 255 for surface reflectance or -1 - 1 for NDVI, whatever
is convenient or appropriate.
Note that the scale factor for MODIS is 0.0001 which would give you a
new range of -0.01 - 1.6 (I hope I got my math right), not 0 - 255 as
for e.g. SPOT or Landsat.


_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Nikos Alexandris

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink

Nikos:  
> >>> Hi Agus. The difference is:
> >>> Using SPOT (range up to max. 255):

Markus M:  
> That implied to me that the max possible value of SPOT was used, not the
> max observed (often the max observed is also the max possible).

Right. Sorry for not being precise.


> In my experience, when comparing imagery or derived products from different
> sensors, the least errors occur if everything is rescaled to the same
> range, e.g. 0 - 255 for surface reflectance or -1 - 1 for NDVI, whatever
> is convenient or appropriate.

> Note that the scale factor for MODIS is 0.0001 which would give you a
> new range of -0.01 - 1.6 (I hope I got my math right), not 0 - 255 as
> for e.g. SPOT or Landsat.

The thing is by multiplying by 0.0001 thing are worse concerning the
*eigenvalues* (the eigenvectors are the same):


# using the MODIS bands as they are
r.info -r mod_b2

min=0
max=5504


# use of i.pca gives
r.info -h pca_mod_b267.1

[...]
Data Description:
   generated by i.pca
Comments:
   Eigen values, (vectors), and [percent importance]:
   PC1 6307563.04 (-0.6353,-0.6485,-0.4192)[98.71%]
   PC2  78023.63 (-0.7124, 0.2828, 0.6422)[1.22%]
   PC3   4504.60 (-0.2979, 0.7067,-0.6417)[0.07%]
   

# using MODIS bands multiplied by 0.0001
r.info -r mod_b2_x

min=0
max=0.5504


# using i.pca gives
r.info -h pca.mod_x.1

[...]  
Data Description:
   generated by i.pca
Comments:
   Eigen values, (vectors), and [percent importance]:
   PC1      0.06 (-0.6353,-0.6485,-0.4192)[98.71%]
   PC2      0.00 (-0.7124, 0.2828, 0.6422)[1.22%]
   PC3      0.00 (-0.2979, 0.7067,-0.6417)[0.07%]

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Markus Metz-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink

Nikos Alexandris wrote:

>
> The thing is by multiplying by 0.0001 thing are worse concerning the
> *eigenvalues* (the eigenvectors are the same):
>
>
> # using the MODIS bands as they are
> r.info -r mod_b2
>
> min=0
> max=5504
>
>
> # use of i.pca gives
> r.info -h pca_mod_b267.1
>
> [...]
> Data Description:
>    generated by i.pca
> Comments:
>    Eigen values, (vectors), and [percent importance]:
>    PC1 6307563.04 (-0.6353,-0.6485,-0.4192)[98.71%]
>    PC2  78023.63 (-0.7124, 0.2828, 0.6422)[1.22%]
>    PC3   4504.60 (-0.2979, 0.7067,-0.6417)[0.07%]
>    
>
> # using MODIS bands multiplied by 0.0001
> r.info -r mod_b2_x
>
> min=0
> max=0.5504
>
>
> # using i.pca gives
> r.info -h pca.mod_x.1
>
> [...]  
> Data Description:
>    generated by i.pca
> Comments:
>    Eigen values, (vectors), and [percent importance]:
>    PC1      0.06 (-0.6353,-0.6485,-0.4192)[98.71%]
>    PC2      0.00 (-0.7124, 0.2828, 0.6422)[1.22%]
>    PC3      0.00 (-0.2979, 0.7067,-0.6417)[0.07%]
>
>  
OK, I don't have the full discussion on i.pca in my head, so I don't
know how much sense my comments make. The loadings and percentages
explained variance are identical, that's good. The Eigenvalues are not,
it seems they were calculated from unstandardised (raw) values. For
imagery processing, that may be desired, for other applications AFAIK it
is required that input variables variables (here different bands) are
standardised first so they can be combined and principal components
extracted. I'm more familiar with non-spatial PCA, so it's high time I
read the manual of i.pca, and the new wiki page on it...

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Nikos Alexandris

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink

Nikos:
> > The thing is by multiplying by 0.0001 thing are worse concerning the
> > *eigenvalues* (the eigenvectors are the same):

> > # use of i.pca gives
> > r.info -h pca_mod_b267.1
[...]
> >    Eigen values, (vectors), and [percent importance]:
> >    PC1 6307563.04 (-0.6353,-0.6485,-0.4192)[98.71%]
> >    PC2  78023.63 (-0.7124, 0.2828, 0.6422)[1.22%]
> >    PC3   4504.60 (-0.2979, 0.7067,-0.6417)[0.07%]

> > # using i.pca gives
> > r.info -h pca.mod_x.1
[...]
> >    Eigen values, (vectors), and [percent importance]:
> >    PC1      0.06 (-0.6353,-0.6485,-0.4192)[98.71%]
> >    PC2      0.00 (-0.7124, 0.2828, 0.6422)[1.22%]
> >    PC3      0.00 (-0.2979, 0.7067,-0.6417)[0.07%]


Markus M:  
> OK, I don't have the full discussion on i.pca in my head, so I don't
> know how much sense my comments make. The loadings and percentages
> explained variance are identical, that's good.

Yep.


> The Eigenvalues are not, it seems they were calculated from unstandardised (raw) values.

Note: the percent importance is nothing else than just transforemd
eigenvalues (that is: sum-up all eigenvalues and say the sum is the
100%, take then the percent of each eigenvalue).

The fact that the 2nd an d 3rd eigenvalues  in the above example are
0.00 is a (another) print-out/report issue I think.

However, the multiplication of the MODIS bands with the recommended
factor (0.0001) does nothing to the way i.pca treats the data.

##
I am convinced that i.pca wrongly _depends_ currently on the range of
the input data whether to apply data centering or not.

I just need to rescale the MODIS bands in to 0,255 and confirm my
skepsis. If I am wrong then it might be even more complicated!!
##


>  For imagery processing, that may be desired, for other applications AFAIK it
> is required that input variables variables (here different bands) are
> standardised first so they can be combined and principal components
> extracted.

As Augustin Lobo and others suggested, data centering should be
performed. And me, with my limited knowledge and experience think the
same.


> I'm more familiar with non-spatial PCA, so it's high time I read the
> manual of i.pca, and the new wiki page on it...

Markus, I might missed to update some important sentences. So "handle
with care" :-)

Cheers, Nikos

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Edzer Pebesma-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Nikos Alexandris
Nikos Alexandris wrote:

> Edzer:
>  
> I do worry about the wiki that states results are "similar" between R and
>> grass, when the differences appear e.g. in the third or second digit of a
>> loading (eigenvector value). Clearly something is is going on, here.
>>    
>
> So you think it's really important? Isn't it a matter of pixel-inclusion
> in the various algorithms/programs?
>  
Well, it's just much larger than I could imaging numerical dispersion
would give; if you mean by pixel-inclusion different methods how missing
values are treated, then I can understand it a bit better, given that
there are substantial amounts of them. The wiki might mention this
possibility. (Although I don't see why grass wouldn't ignore partially
missing valued pixels)

Note that I didn't run the examples, I was just surprised. When I
compare different software implementations of an algorithm, I usually
don't stop worrying until differences happen beyond digit 5.

--
Edzer Pebesma
Institute for Geoinformatics (ifgi), University of Münster
Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
8333081, Fax: +49 251 8339763 http://ifgi.uni-muenster.de/
http://www.springer.com/978-0-387-78170-9 [hidden email]

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Edzer Pebesma-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Nikos Alexandris


Nikos Alexandris wrote:
>
>> I'm more familiar with non-spatial PCA, so it's high time I read the
>> manual of i.pca, and the new wiki page on it...
>>    
I think there's no such thing as spatial or non-spatial PCA. There's
just PCA.

--
Edzer Pebesma
Institute for Geoinformatics (ifgi), University of Münster
Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
8333081, Fax: +49 251 8339763 http://ifgi.uni-muenster.de/
http://www.springer.com/978-0-387-78170-9 [hidden email]

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
hamish-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Nikos Alexandris

Markus Neteler wrote:

> > Remember that you have to rescale these data first:
> >
> > https://lpdaac.usgs.gov/lpdaac/products/modis_products_table
> > -> MOD09Q1      Terra
>     Surface Reflectance Bands 1–2
>     Tile      250m
>     8 Day
> >     https://lpdaac.usgs.gov/lpdaac/products/modis_products_table/surface_reflectance/8_day_l3_global_250m/v5/terra
> >    -> layers
> >
> > ... MULTIPLY BY SCALE FACTOR ...

Nikos wrote:
> Yep, I know that. But why is it wrong to just use them as they are?


FWIW, the MODIS/Aqua Chlorophyll-a product requires more than simple
linear rescaling:
  r.mapcalc "${map}.chlor_A = 10^(($Slope * $map) + $Intercept)"

http://grass.osgeo.org/wiki/MODIS#Processing_2


so you may need to take care if working directly with raw 0-65535 data.


Hamish





_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Markus Metz-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Edzer Pebesma-2

Edzer Pebesma wrote:
> Markus Metz wrote:
>  
>>> I'm more familiar with non-spatial PCA, so it's high time I read the
>>> manual of i.pca, and the new wiki page on it...
>>>    
>>>      
> I think there's no such thing as spatial or non-spatial PCA. There's
> just PCA.
>  
That was a feeble attempt to buy time to go through some statistics
literature ;-)

So it seems that this thread is about the different values for
eigenvalues. AFAIKT, the answer is in the very first post of this thread
[1]. It seems that i.pca output is supposed to be identical to
prcomp(center=FALSE, scale=FALSE) output in R, because a PCA is
scale-sensitive and the eigenvalue as reported by i.pca is the variance
of the raw, unstandardised data. If outputs are not identical, either R
or grass do some hidden modification or there is a bug in either grass
or R (all within limits, e.g. identical up to the 5th digit in
scientific format is fine?).

Some textbooks give a rule of thumb for further analysis to use only
components with an eigenvalue >=1 which obviously only works if the
eigenvalue is calculated from standardised values (center=TRUE,
scale=TRUE or e.g. r.mapcalc standardised_map = (map - mean) / stddev).
E.g., comparing the results of MODIS raw and MODIS scaled with 0.0001
should give <eigenvalue #x of MODIS scaled> = 1E-8 * <eigenvalue #x of
MODIS raw>.

BTW, the rescaling method of i.pca is not very convincing, as pointed
out by Augustin Lobo. IMHO, fool-proof would be normalization (x - mean)
/ stddev.

[1] http://lists.osgeo.org/pipermail/grass-user/2009-March/049306.html
_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Edzer Pebesma-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
Markus, a few notes:

- if you do PCA on uncentered data, by computing the eigenvalues of the
uncentered covariance matrix, this implies that bands with a larger mean
will get more influence on the final PCAs. I have sofar not managed
finding an argument why this would be desirable.
- if you do PCA on (band-mean)/sd(band), it means that you first
normalize (scale) each variable to mean zero and unit variance. This
procedure is identical to doing PCA on the correlation matrix. It means
that, unlike for unscaled variables, variables with larger variance will
not get more influence on the PCA than others. For image analysis I can
see a place for both; if bands with low variance indicate insignificant
and perhaps noisy information, you may downweight them. Or not, if they
contain (equally) important information. Scaling becomes urgent when you
compute PCAs from a mix of things with uncomparable units, such as image
bands and DTMs.
- Only in case of normalized variables, or equivalently PCA on
correlations, it makes sense to select PC's with an eigenvalue larger
than 1. The reasoning is fairly weak, but goes like this: if a PC has
eigenvalue > 1, it explains more variance than any of the original
variables, which all have variance 1.

Maybe I should Cc: this to the wiki.
--
Edzer

Markus Metz wrote:

>
> Edzer Pebesma wrote:
>> Markus Metz wrote:
>>  
>>>> I'm more familiar with non-spatial PCA, so it's high time I read the
>>>> manual of i.pca, and the new wiki page on it...
>>>>          
>> I think there's no such thing as spatial or non-spatial PCA. There's
>> just PCA.
>>  
> That was a feeble attempt to buy time to go through some statistics
> literature ;-)
>
> So it seems that this thread is about the different values for
> eigenvalues. AFAIKT, the answer is in the very first post of this
> thread [1]. It seems that i.pca output is supposed to be identical to
> prcomp(center=FALSE, scale=FALSE) output in R, because a PCA is
> scale-sensitive and the eigenvalue as reported by i.pca is the
> variance of the raw, unstandardised data. If outputs are not
> identical, either R or grass do some hidden modification or there is a
> bug in either grass or R (all within limits, e.g. identical up to the
> 5th digit in scientific format is fine?).
>
> Some textbooks give a rule of thumb for further analysis to use only
> components with an eigenvalue >=1 which obviously only works if the
> eigenvalue is calculated from standardised values (center=TRUE,
> scale=TRUE or e.g. r.mapcalc standardised_map = (map - mean) /
> stddev). E.g., comparing the results of MODIS raw and MODIS scaled
> with 0.0001 should give <eigenvalue #x of MODIS scaled> = 1E-8 *
> <eigenvalue #x of MODIS raw>.
>
> BTW, the rescaling method of i.pca is not very convincing, as pointed
> out by Augustin Lobo. IMHO, fool-proof would be normalization (x -
> mean) / stddev.
>
> [1] http://lists.osgeo.org/pipermail/grass-user/2009-March/049306.html

--
Edzer Pebesma
Institute for Geoinformatics (ifgi), University of Münster
Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
8333081, Fax: +49 251 8339763 http://ifgi.uni-muenster.de/
http://www.springer.com/978-0-387-78170-9 [hidden email]

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Markus Metz-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink

Edzer Pebesma wrote:
> Markus, a few notes:
>
> - if you do PCA on uncentered data, by computing the eigenvalues of the
> uncentered covariance matrix, this implies that bands with a larger mean
> will get more influence on the final PCAs. I have sofar not managed
> finding an argument why this would be desirable.
>  
Add it to wiki? E.g. bands entered in a PCA should have the same mean,
but normalization is also an option.
> - if you do PCA on (band-mean)/sd(band), it means that you first
> normalize (scale)
I think scale and normalize are two different things.
> each variable to mean zero and unit variance. This
> procedure is identical to doing PCA on the correlation matrix. It means
> that, unlike for unscaled variables, variables with larger variance will
> not get more influence on the PCA than others. For image analysis I can
> see a place for both; if bands with low variance indicate insignificant
> and perhaps noisy information, you may downweight them.
Variance is dependent on range, I would rather use something like
coefficient of variation (stddev/mean) to get some scale-independent
indicator on the amount of information in a given band. A downscaled
band (e.g. MODIS scale of 0.0001) has still the same information but
lower variance.
> - Only in case of normalized variables, or equivalently PCA on
> correlations, it makes sense to select PC's with an eigenvalue larger
> than 1. The reasoning is fairly weak, but goes like this: if a PC has
> eigenvalue > 1, it explains more variance than any of the original
> variables, which all have variance 1.
>  
Sounds good to me, why should I use a component that explains less than
any of the original bands? And the whole purpose of a PCA is variable
reduction to get a new set of variables, each explaining the whole
dataset better than one of the original variables/bands. A PCA produces
as many components as input variables, so some selection is usually
necessary for further processing, could also be % explained variance.
OTOH, sometimes only the first component is of interest. There may be
exceptions for imagery processing, e.g. haze reduction (would have to
read up on imagery processing too to say anything more about where
components with eigenvalue < 1 could be useful).

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Edzer Pebesma-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
Markus Metz wrote:

>
> Edzer Pebesma wrote:
>> Markus, a few notes:
>>
>> - if you do PCA on uncentered data, by computing the eigenvalues of the
>> uncentered covariance matrix, this implies that bands with a larger mean
>> will get more influence on the final PCAs. I have sofar not managed
>> finding an argument why this would be desirable.
>>  
> Add it to wiki? E.g. bands entered in a PCA should have the same mean,
> but normalization is also an option.
>> - if you do PCA on (band-mean)/sd(band), it means that you first
>> normalize (scale)
> I think scale and normalize are two different things.
I believe that in statistics these two words don't have a generally
accepted definition. They're useful as long as you explain what you mean
by them.

>> each variable to mean zero and unit variance. This
>> procedure is identical to doing PCA on the correlation matrix. It means
>> that, unlike for unscaled variables, variables with larger variance will
>> not get more influence on the PCA than others. For image analysis I can
>> see a place for both; if bands with low variance indicate insignificant
>> and perhaps noisy information, you may downweight them.
> Variance is dependent on range, I would rather use something like
> coefficient of variation (stddev/mean) to get some scale-independent
> indicator on the amount of information in a given band. A downscaled
> band (e.g. MODIS scale of 0.0001) has still the same information but
> lower variance.
This may make sense for some (or many) cases, but not all.

>> - Only in case of normalized variables, or equivalently PCA on
>> correlations, it makes sense to select PC's with an eigenvalue larger
>> than 1. The reasoning is fairly weak, but goes like this: if a PC has
>> eigenvalue > 1, it explains more variance than any of the original
>> variables, which all have variance 1.
>>  
> Sounds good to me, why should I use a component that explains less
> than any of the original bands? And the whole purpose of a PCA is
> variable reduction to get a new set of variables, each explaining the
> whole dataset better than one of the original variables/bands. A PCA
> produces as many components as input variables, so some selection is
> usually necessary for further processing, could also be % explained
> variance. OTOH, sometimes only the first component is of interest.
> There may be exceptions for imagery processing, e.g. haze reduction
> (would have to read up on imagery processing too to say anything more
> about where components with eigenvalue < 1 could be useful).
>
Well, PCA only captures covariance or correlation, meaning linear
relationships, and it may be the case that the most interesting features
are non-linear. For instance, NDVI is the ratio of a sum over a
difference (or reversed?), which cannot be expressed as a linear
combination of bands. The first PCA(s?) usually express brightness, only
later ones give more interesting features resulting from more complex
interactions of bands (notably differences) -- loadings usually have the
same sign for the first PC, and mixed signs for later PC's. John C.
Davis in "statistics and data analysis for geologists" called this the
"size and shape effect". The most interesting PC's may have a EV smaller
than 1, when they come from correlation matrices. Geochemists don't shy
away from interpreting 7 or more factors.

Other interesting "features" of PCA are that it (a) ignores for two
bands whether they are adjacent in wavelength or not, and (b) ignores
for two pixels whether they are adjacent in space or not. At least for
"solving" problem (b), there's the max/min autocorrelation factors (by
Switzer et al) or MNF (max. noise fraction, by Green et al) -- are these
things available in grass, or in R?

--
Edzer Pebesma
Institute for Geoinformatics (ifgi), University of Münster
Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
8333081, Fax: +49 251 8339763 http://ifgi.uni-muenster.de/
http://www.springer.com/978-0-387-78170-9 [hidden email]

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Nikos Alexandris

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Markus Metz-2
Markus:
> It seems that i.pca output is supposed to be identical to
> prcomp(center=FALSE, scale=FALSE) output in R, because a PCA is
> scale-sensitive and the eigenvalue as reported by i.pca is the variance
> of the raw, unstandardised data.

The "thing" is that with the SPOT data all seems fine and "i.pca ==
prcomp(x, center=TRUE, scale=FALSE)" which is not the case for the MODIS
bands I work with.


> If outputs are not identical, either R or grass do some hidden
> modification or there is a bug in either grass or R (all within
> limits, e.g. identical up to the 5th digit in scientific format is
> fine?).
> Some textbooks give a rule of thumb for further analysis to use only
> components with an eigenvalue >=1

I think this depends on what you are trying to achieve. Of course,
components with small(-er) eigenvalues include more "noizzze". In my
change detection project I used *only* components with eigenvalues < 1.


> which obviously only works if the eigenvalue is calculated from
> standardised values (center=TRUE, scale=TRUE or e.g. r.mapcalc
> standardised_map = (map - mean) / stddev).

> E.g., comparing the results of MODIS raw and MODIS scaled with 0.0001
> should give <eigenvalue #x of MODIS scaled> = 1E-8 * <eigenvalue #x of
> MODIS raw>.

I didn't find the time to rescale and re-test. I will... at some
point :-)


> BTW, the rescaling method of i.pca is not very convincing, as pointed
> out by Augustin Lobo. IMHO, fool-proof would be normalization (x -
> mean) / stddev.

Kind regards, Nikos

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Nikos Alexandris

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Edzer Pebesma-2
On Wed, 2009-04-01 at 18:21 +0200, Edzer Pebesma wrote:

> Markus, a few notes:
>
> - if you do PCA on uncentered data, by computing the eigenvalues of the
> uncentered covariance matrix, this implies that bands with a larger mean
> will get more influence on the final PCAs. I have sofar not managed
> finding an argument why this would be desirable.
> - if you do PCA on (band-mean)/sd(band), it means that you first
> normalize (scale) each variable to mean zero and unit variance. This
> procedure is identical to doing PCA on the correlation matrix. It means
> that, unlike for unscaled variables, variables with larger variance will
> not get more influence on the PCA than others. For image analysis I can
> see a place for both; if bands with low variance indicate insignificant
> and perhaps noisy information, you may downweight them. Or not, if they
> contain (equally) important information. Scaling becomes urgent when you
> compute PCAs from a mix of things with uncomparable units, such as image
> bands and DTMs.
> - Only in case of normalized variables, or equivalently PCA on
> correlations, it makes sense to select PC's with an eigenvalue larger
> than 1. The reasoning is fairly weak, but goes like this: if a PC has
> eigenvalue > 1, it explains more variance than any of the original
> variables, which all have variance 1.
>
> Maybe I should Cc: this to the wiki.
> --
> Edzer

Nice to see this expert-comments! Really helpful to understand the
process better.

Thanks, Nikos

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Markus Metz-2

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Edzer Pebesma-2

Edzer Pebesma wrote:
> Markus Metz wrote:
>  
>> I think scale and normalize are two different things.
>>    
> I believe that in statistics these two words don't have a generally
> accepted definition. They're useful as long as you explain what you mean
> by them.
>  
At least in the statistics literature I use, these two methods are
differently defined. Scaling is like r.rescale, and normalization
converts data to a mean of 0 and a stddev of 1, the data distribution is
changed to a standard normal distribution. But usually I wouldn't worry
too much about terms as long as it is explained what they mean.
> Well, PCA only captures covariance or correlation, meaning linear
> relationships, and it may be the case that the most interesting features
> are non-linear.
So if a PCA does not capture non-linear relationships, I don't see how
it could help to use PC's that explain nearly no variation in the
dataset. And you could do e.g. a log transform first, or whatever else
is appropriate to convert the suspected type of non-linear relation to a
linear relation and then feed the transformed variables to a PCA.
> For instance, NDVI is the ratio of a sum over a
> difference (or reversed?), which cannot be expressed as a linear
> combination of bands.
Not directly, but being a normalized difference (should be standardised
not normalized) it can be approximated with linear combinations, i.e.
there is at least some correlation between the raw bands and a
normalized difference calculated from them.
> The first PCA(s?) usually express brightness, only
> later ones give more interesting features resulting from more complex
> interactions of bands (notably differences) -- loadings usually have the
> same sign for the first PC, and mixed signs for later PC's. John C.
> Davis in "statistics and data analysis for geologists" called this the
> "size and shape effect". The most interesting PC's may have a EV smaller
> than 1, when they come from correlation matrices. Geochemists don't shy
> away from interpreting 7 or more factors.
>  
The question is not the number of factors, but what criteria to use to
select and interpret the resulting PCs. What makes a PC interesting can
be the amount of explained variance, but also the dominant variables in
it. BTW, some textbooks recommend to use only rotated PCs if a rotation
could be performed. In a mathematical sense, the sign of the loadings is
arbitrary because the absolute value as well as the result of a PCA will
stay the same after new_var = -old_var. The same sign for the first PC
and so on is not generally valid and with regard to imagery probably
only applies to surface reflectance or radiation measured at the sensor,
and I would guess is dependent on the number of bands and the wavelength
captured by each.
All this is however far from the i.pca eigenvalue problem, going towards
comments on the general use of PCAs for remote sensing and as such
probably only of interest to the grass-stats ml.

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Nikos Alexandris

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Nikos Alexandris
Nikos:
> > If outputs are not identical, either R or grass do some hidden
> > modification or there is a bug in either grass or R (all within
> > limits, e.g. identical up to the 5th digit in scientific format is
> > fine?).
> > Some textbooks give a rule of thumb for further analysis to use only
> > components with an eigenvalue >=1

> I think this depends on what you are trying to achieve. Of course,
> components with small(-er) eigenvalues include more "noizzze". In my
> change detection project I used *only* components with eigenvalues <
> 1.

Hmm.. Markus, I was too quick yesterday. That's not true. All of the
PC's I've used have "eigenvalue > 1". Sorry :-p

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Nikos Alexandris

Re: Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Reply Threaded More More options
Print post
Permalink
In reply to this post by Nikos Alexandris
On Sun, 2009-03-22 at 23:53 +0100, Nikos Alexandris wrote:
> I have prepaired a grass-location
> containing the 3 modis bands I've used for all PCA related examples. I
> will eventually upload them in gregis [1]

If anyone would like to try what I tried here are the MODIS bands:

# grass location
https://git.berlios.de/cgi-bin/gitweb.cgi?p=gregis;a=blob_plain;f=peloponnesos/modis_peloponnese_postfire07.zip;hb=peloponnese

# geotiffs
https://git.berlios.de/cgi-bin/gitweb.cgi?p=gregis;a=blob_plain;f=peloponnesos/modis_peloponnese_postfire07_GeoTiff.zip;hb=peloponnese

Cheers, Nikos


_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Nikos Alexandris

Testing i.pca (continued...)

Reply Threaded More More options
Print post
Permalink
In reply to this post by Nikos Alexandris
Hi all! Another i.pca -vs- prcomp() round :-)

# 1. There *must* be something in i.pca's code that prevents data
centering when using MODIS data???

  I tested... tested... & tested... (almost all tests below) and I just
can't understand WHY i.pca does NOT perform data centering when using
MODIS data (raw, rescaled at 0~255 or 0~254). All I get is:

  i.pca == prcomp(x, center=FALSE, scale=FALSE)


# 2. Note: there is _NO_ problem when using the spot or landsat5 data
from the spearfish location!

# 3. Can some programmer have a look at i.pca to check whether there is
a switch to (de-)activate data centering?

I don't want to update the wiki-page before I understand what is going
on. Any help?

Kindest regards, Nikos
=======================================================================

Quickly:

* using MODIS data I get:
-------------------------

 ** i.pca == prcomp(x, center=FALSE, scale=FALSE)

 ** if I center the data manually (as suggested by Agus), i.e.:

r.mapcalc "mod_b2_centered = mod_b2 - mean(mod_b2)"

  then I get:

i.pca(on centered data) == prcomp(x, center=TRUE, scale=FALSE)


 ** it also works if I further scale manually the data like:

r.mapcalc "mod_b2_centered_scaled = mod_b2_centered/stdev"



* using spot or landsat5 data I get:
------------------------------------

 ** i.pca == prcomp(x, center=TRUE, scale=FALSE)



##########################################
##########################################

Principal Component Analysis:
comparison between grass' i.pca and R's prcomp()

##########################################
##########################################

Testing... (Wednesday, April 29, 11:58 AM)

# 1
i.pca on

# raw postfire modis bands 2, 6 and 7
i.pca input=mod07_b2,mod07_b6,mod07_b7 out=pca_mod_b267

# eigenvalues + eigenvectors
Eigen values, (vectors), and [percent importance]:
PC1 6307563.04 (-0.6353,-0.6485,-0.4192) [98.71%]
PC2   78023.63 (-0.7124, 0.2828, 0.6422) [ 1.22%]
PC3    4504.60 (-0.2979, 0.7067,-0.6417) [ 0.07%]


# 2
prcomp() on

# raw postfire modis bands 2, 6 and 7

# center=FALSE, scale=FALSE
prcomp(mod07_b267, center=FALSE, scale=FALSE) ## This is the same as
with i.pca on raw(=untouched) data

Standard deviations:
[1] 4279.0259  475.9080  114.2998

Rotation:
                PC1        PC2        PC3
mod07_b2 -0.6353496  0.7124426 -0.2979203
mod07_b6 -0.6485310 -0.2828396  0.7066890
mod07_b7 -0.4192117 -0.6422051 -0.6417431

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

# 3
i.pca on

# centered postfire modis bands 2, 6 and 7
i.pca input=mod07_b2_centered,mod07_b6_centered,mod07_b7_centered
out=pca_mod07_b267_centered

# output
Eigen values, (vectors), and [percent importance]:
PC1 270343.07 (-0.4403,-0.7222,-0.5335) [79.11%]
PC2  67140.50 (-0.8275, 0.0957, 0.5533) [19.65%]
PC3   4258.14 ( 0.3485,-0.6851, 0.6397) [ 1.25%]


# 4
prcomp() on

# raw postfire modis bands 2, 6 and 7 with center=TRUE, scale=FALSE
prcomp(mod07_b267, center=TRUE, scale=FALSE) ## This is the same as
using prcomp() on centered data

Standard deviations:
[1] 882.1814 438.7420 108.9791

Rotation:
               PC1         PC2        PC3
mod07_b2 0.4372107  0.83099407 -0.3439413
mod07_b6 0.7210155 -0.09527873  0.6863371
mod07_b7 0.5375718 -0.54806096 -0.6408165


# 5
princomp() on

# raw postfire modis bands 2, 6 and 7
print(loadings(princomp(mod07_b267)), digits=6, cutoff=0.001)

Loadings:
         Comp.1    Comp.2    Comp.3  
mod07_b2 -0.437211  0.830994  0.343941
mod07_b6 -0.721016 -0.095279 -0.686337
mod07_b7 -0.537572 -0.548061  0.640817

                 Comp.1   Comp.2   Comp.3
SS loadings    1.000000 1.000000 1.000000
Proportion Var 0.333333 0.333333 0.333333
Cumulative Var 0.333333 0.666667 1.000000

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

# 6
prcomp() on

# centered postfire modis bands 2, 6 and 7 with center=FALSE,
scale=FALSE
prcomp(mod07_b267_centered, center=FALSE, scale=FALSE) ## In this
example the data are already centered. This is the same as using
prcomp(x, center=TRUE, scale=FALSE) on raw data or i.pca on centered
data

Standard deviations:
[1] 882.1844 438.7430 108.9810

Rotation:
                        PC1         PC2        PC3
mod07_b2_centered 0.4372131  0.83099137 -0.3439448
mod07_b6_centered 0.7210164 -0.09527912  0.6863360
mod07_b7_centered 0.5375686 -0.54806499 -0.6408157


# 7
prcomp() on

# centered postfire modis bands 2, 6 and 7 with center=TRUE,
scale=FALSE
prcomp(mod07_b267_centered, center=TRUE, scale=FALSE) ## The parameter
"center=TRUE" has no effect in this example since the data used are
already centered

Standard deviations:
[1] 882.1814 438.7420 108.9792

Rotation:
                        PC1         PC2        PC3
mod07_b2_centered 0.4372107  0.83099407 -0.3439413
mod07_b6_centered 0.7210155 -0.09527872  0.6863371
mod07_b7_centered 0.5375718 -0.54806097 -0.6408165

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~


### ### ###

Comments

Using already centered data does not make sense to use the parameter
"center=". Thus the only difference will be between _scaled_ and
_unscaled_ data prior to PCA.

### ### ###


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

# 8
i.pca on

# scaled postfire modis bands 2, 6 and 7
i.pca input=mod07_b2_scaled,mod07_b6_scaled,mod07_b7_scaled
out=pca_mod07_b267_scaled

# eigenvalues + eigenvectors
Eigen values, (vectors), and [percent importance]:
PC1 19.09 (-0.6824,-0.5773,-0.4484) [98.57%]
PC2  0.26 ( 0.6786,-0.2724,-0.6821) [ 1.36%]
PC3  0.01 ( 0.2716,-0.7698, 0.5777) [ 0.07%]


# 9
prcomp() on

# scaled postfire modis bands 2, 6 and 7 with center=FALSE, scale=TRUE
prcomp(mod07_b267, center=FALSE, scale=TRUE)

Standard deviations:
[1] 1.71888790 0.20789109 0.04696487

Rotation:
                PC1         PC2        PC3
mod07_b2 -0.5747633  0.74019177 -0.3489459
mod07_b6 -0.5812894 -0.06916484  0.8107520
mod07_b7 -0.5759772 -0.66882910 -0.4700191

# get eigenvalues
summary(prcomp(mod07_b267, center=FALSE, scale=TRUE))
Importance of components:
                         PC1    PC2     PC3
...
Proportion of Variance 0.985 0.0144 0.00074
...


# 10
prcomp() on
# scaled postfire modis bands 2, 6 and 7
prcomp(mod07_b267_scaled, center=FALSE) ## This is the same as with
i.pca on manually scaled data

Standard deviations:
[1] 7.4449990 0.8749921 0.1925904

Rotation:
                       PC1        PC2        PC3
mod07_b2_scaled -0.6824314  0.6786305 -0.2715660
mod07_b6_scaled -0.5772537 -0.2724487  0.7697726
mod07_b7_scaled -0.4484034 -0.6820795 -0.5776695

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

# 11
i.pca on

# re-scaled to 0 ~ 255 postfire modis bands 2, 6 and 7

###
# first r.rescale to=0,255
# then r.mapcalc to ensure that 0,255 values
###

i.pca
input=mod07_b2_rescaled255,mod07_b6_rescaled255,mod07_b7_rescaled255
out=pca_mod07_b267_rescaled255

# eigenvalues + eigenvectors
Eigen values, (vectors), and [percent importance]:
PC1 15236.52 (-0.5972,-0.6037,-0.5282) [98.52%]
PC2   216.16 ( 0.7340,-0.1457,-0.6634) [ 1.40%]
PC3    13.01 ( 0.3235,-0.7838, 0.5301) [ 0.08%]


# 12
# re-scaled to 0 ~ 254 postfire modis bands 2, 6 and 7
input=mod07_b2_rescaled254,mod07_b6_rescaled254,mod07_b7_rescaled254
out=pca_mod07_b267_rescaled254

# eigenvalues + eigenvectors
Eigen values, (vectors), and [percent importance]:
PC1 15117.39 (-0.5972,-0.6037,-0.5282) [98.52%]
PC2 214.43 ( 0.7341,-0.1458,-0.6632) [ 1.40%]
PC3 12.90 ( 0.3234,-0.7838, 0.5302) [ 0.08%]
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

# 13
i.pca on

# centered and scaled postfire modis bands 2, 6 and 7
i.pca
input=mod07_b2_centered_scaled,mod07_b6_centered_scaled,mod07_b7_centered_scaled out=pca_mod07_b267_centered_scaled
## This is the same as performing prcomp(x, center=TRUE, scaled=TRUE)

# eigenvalues + eigenvectors
Eigen values, (vectors), and [percent importance]:
PC1 0.79 (-0.4907,-0.6521,-0.5779) [76.30%]
PC2 0.23 (-0.8120, 0.1015, 0.5748) [22.52%]
PC3 0.01 ( 0.3162,-0.7513, 0.5793) [ 1.18%]


# 14
prcomp() on

# center=TRUE, scale=TRUE
prcomp(mod07_b267, center=TRUE, scale=TRUE) ## This is the same as
performing i.pca on centered and scaled data

Standard deviations:
[1] 1.5137992 0.8210167 0.1853207

Rotation:
               PC1        PC2        PC3
mod07_b2 0.4896414  0.8145817 -0.3109792
mod07_b6 0.6516766 -0.1049365  0.7512030
mod07_b7 0.5792831 -0.5704779 -0.5822250

# get eigenvalues
summary(prcomp(mod07_b267, center=TRUE, scale=TRUE))
Importance of components:
                         PC1   PC2    PC3
...
Proportion of Variance 0.764 0.225 0.0115
...

# 15
princomp() on

# raw postfire modis bands 2, 6 and 7 based on the covariance matrix
print(loadings(princomp(mod07_b267, cor=TRUE)), digits=6, cutoff=0.001)
## This is the same as performing prcomp(x, center=TRUE, scale=TRUE) or
i.pca on manually centered and scaled data

Loadings:
         Comp.1    Comp.2    Comp.3  
mod07_b2 -0.489641  0.814582  0.310979
mod07_b6 -0.651677 -0.104936 -0.751203
mod07_b7 -0.579283 -0.570478  0.582225

                 Comp.1   Comp.2   Comp.3
SS loadings    1.000000 1.000000 1.000000
Proportion Var 0.333333 0.333333 0.333333
Cumulative Var 0.333333 0.666667 1.000000

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Roger Bivand

Re: Re: [GRASS-user] Testing i.pca (continued...)

Reply Threaded More More options
Print post
Permalink
On Wed, 29 Apr 2009, Nikos Alexandris wrote:

>
> Nikos:
>>   I tested... tested... & tested... (almost all tests below) and I just
>> can't understand WHY i.pca does NOT perform data centering when using
>> MODIS data (raw, rescaled at 0~255 or 0~254).
>
> The attached html file (exported from Tomboy note-taker) is (I suppose)
> easier to read :-)

Not easier to read, please simply post things on a website!


In your previous mail which you deleted here, you asked whether the code
of i.pca shows if the input data is centred. Yes, it is centred, the
function calc_mu(), as its name suggests, does this. The mean values are
then passed to the function calculating the covariance matrix:
calc_covariance(), where the means are subtracted from the input values of
bands before accumulating the covariance matrix. Is is possible that where
you do not see any effect of center=TRUE in prcomp that the means are all
zero? You will find reading imagery/i.pca/main.c interesting and not hard.

Roger

>
> Cheers, Nikos
>

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: [hidden email]

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Roger Bivand
Economic Geography Section
Department of Economics
Norwegian School of Economics and Business Administration
Helleveien 30
N-5045 Bergen, Norway
Nikos Alexandris

Re: Re: [GRASS-user] Testing i.pca (continued...)

Reply Threaded More More options
Print post
Permalink

Nikos:
> >>   I tested... tested... & tested... (almost all tests below) and I just
> >> can't understand WHY i.pca does NOT perform data centering when using
> >> MODIS data (raw, rescaled at 0~255 or 0~254).
> >
> > The attached html file (exported from Tomboy note-taker) is (I suppose)
> > easier to read :-)


Roger:
> Not easier to read, please simply post things on a website!
Hmmm!!? I thought opening this html file in a web-browser will look just
fine. Sorry.


> In your previous mail which you deleted here, you asked whether the code
> of i.pca shows if the input data is centred. Yes, it is centred, the
> function calc_mu(), as its name suggests, does this. The mean values are
> then passed to the function calculating the covariance matrix:
> calc_covariance(), where the means are subtracted from the input values of
> bands before accumulating the covariance matrix.

Good. That's data centering.


> Is is possible that where
> you do not see any effect of center=TRUE in prcomp that the means are all
> zero?

Sure, when I _manually_ center the data using r.mapcalc.

But what happens with the untouched modis bands [1] and i.pca delivers
different numbers is what makes no sense to me. BTW, the data are
available (the same as with the previous problem concerning the NULLs).


> You will find reading imagery/i.pca/main.c interesting and not hard.

:-). I tried once but there are many programming variables which do not
make sense on first sight. And this takes time... :-p

Is there any explanation why i.pca on mod_b2, mod_b6 and mod_b7 gives
the same results as prcomp(mod_b267, center=FALSE, scale=FALSE) when
their mean is _!=0_ ?

Thanks, Nikos
---

[1] r.univar of

# mod_b2
n: 88185
minimum: 0
maximum: 5504
range: 5504
mean: 2686.22
mean of absolute values: 2686.22
standard deviation: 535.612
variance: 286880
variation coefficient: 19.9392 %
sum: 236884367

# mod_b6
n: 88202
minimum: 93
maximum: 5488
range: 5395
mean: 2702.4
mean of absolute values: 2702.4
standard deviation: 645.377
variance: 416511
variation coefficient: 23.8816 %
sum: 238356766

# mod_b7
n: 88148
minimum: 0
maximum: 4133
range: 4133
mean: 1740.07
mean of absolute values: 1740.07
standard deviation: 536.718
variance: 288067
variation coefficient: 30.8447 %
sum: 153383351

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
Nikos Alexandris

Re: Re: [GRASS-user] Testing i.pca (continued...)

Reply Threaded More More options
Print post
Permalink

Roger:
> > You will find reading imagery/i.pca/main.c interesting and not hard.


Nikos:
> :-). I tried once but there are many programming variables which do not
> make sense on first sight. And this takes time... :-p

OK, it's sure not TOO hard. It takes time but if you stick on it you can
figure more or less out what's going-on.

Nikos

_______________________________________________
grass-stats mailing list
[hidden email]
http://lists.osgeo.org/mailman/listinfo/grass-stats
1 2