[Zipfr-users] lexical richness

Mon Mar 4 11:40:55 CET 2013

Hi,

I am not a very big statistic expert and my question is more a  
methodological one than one about ZipfR, but maybe anyone can give me  
a hint?
I want to compare two corpora with respect to their lexical richness.  
I kow there are various ways to measure this. What I did so far is:
- I computed standardised type token ratio on 1000 tokens samples of texts.
- I carried out a non-parametric test (Mann-Whitney) to test for  
significance of the difference between the ratios of the corpora.
The difference between the two corpora does not seem very big  
numerically, but since it is consistent over many data samples, the  
test votes for significance.
Since there are so many discussions about TTR and STTR and I didn't  
even randomize my sample, I used ZipfR to simply plot VGCs of my  
corpora to avoid systematic errors in my measurement. Now this  
confused me. One of my corpora had a very unclean curve, which is not  
really surprising because the corpus is composed of many different  
texts. IHowever, interpolation seems to overgeneralise too much up to  
the point that the resulting curve largely overlaps with the empirical  
vgc of the second corpus (but they obviously don't converge to the  
same value).
So my question is: Is there any way to neatly pin this down, that is,  
to decide whether the differences between the corpora are significant?  
Do I need to be worried about the fact that interpolation changes my  
curve so much? What I want is basically just a statement on whether  
one of the corpora uses a reduced vocabulary, if compared to the other  
(probably due to a higher amount of terminological repetitions). So I  
want to use this statemet as a hint towards regiter specialisation of  
one of the corpora.
Any hint welcome!

Regards,
anne