[Zipfr-users] lexical richness
Anne Schumann
anne.schumann at mx.uni-saarland.de
Mon Mar 4 11:40:55 CET 2013
Hi,
I am not a very big statistic expert and my question is more a
methodological one than one about ZipfR, but maybe anyone can give me
a hint?
I want to compare two corpora with respect to their lexical richness.
I kow there are various ways to measure this. What I did so far is:
- I computed standardised type token ratio on 1000 tokens samples of texts.
- I carried out a non-parametric test (Mann-Whitney) to test for
significance of the difference between the ratios of the corpora.
The difference between the two corpora does not seem very big
numerically, but since it is consistent over many data samples, the
test votes for significance.
Since there are so many discussions about TTR and STTR and I didn't
even randomize my sample, I used ZipfR to simply plot VGCs of my
corpora to avoid systematic errors in my measurement. Now this
confused me. One of my corpora had a very unclean curve, which is not
really surprising because the corpus is composed of many different
texts. IHowever, interpolation seems to overgeneralise too much up to
the point that the resulting curve largely overlaps with the empirical
vgc of the second corpus (but they obviously don't converge to the
same value).
So my question is: Is there any way to neatly pin this down, that is,
to decide whether the differences between the corpora are significant?
Do I need to be worried about the fact that interpolation changes my
curve so much? What I want is basically just a statement on whether
one of the corpora uses a reduced vocabulary, if compared to the other
(probably due to a higher amount of terminological repetitions). So I
want to use this statemet as a hint towards regiter specialisation of
one of the corpora.
Any hint welcome!
Regards,
anne
More information about the Zipfr-users
mailing list