From anne.schumann at mx.uni-saarland.de Mon Mar 4 11:40:55 2013 From: anne.schumann at mx.uni-saarland.de (Anne Schumann) Date: Mon, 04 Mar 2013 11:40:55 +0100 Subject: [Zipfr-users] lexical richness Message-ID: <20130304114055.Horde.VBoaacajPV5RNHo3Qfy0JgA@webmail.uni-saarland.de> Hi, I am not a very big statistic expert and my question is more a methodological one than one about ZipfR, but maybe anyone can give me a hint? I want to compare two corpora with respect to their lexical richness. I kow there are various ways to measure this. What I did so far is: - I computed standardised type token ratio on 1000 tokens samples of texts. - I carried out a non-parametric test (Mann-Whitney) to test for significance of the difference between the ratios of the corpora. The difference between the two corpora does not seem very big numerically, but since it is consistent over many data samples, the test votes for significance. Since there are so many discussions about TTR and STTR and I didn't even randomize my sample, I used ZipfR to simply plot VGCs of my corpora to avoid systematic errors in my measurement. Now this confused me. One of my corpora had a very unclean curve, which is not really surprising because the corpus is composed of many different texts. IHowever, interpolation seems to overgeneralise too much up to the point that the resulting curve largely overlaps with the empirical vgc of the second corpus (but they obviously don't converge to the same value). So my question is: Is there any way to neatly pin this down, that is, to decide whether the differences between the corpora are significant? Do I need to be worried about the fact that interpolation changes my curve so much? What I want is basically just a statement on whether one of the corpora uses a reduced vocabulary, if compared to the other (probably due to a higher amount of terminological repetitions). So I want to use this statemet as a hint towards regiter specialisation of one of the corpora. Any hint welcome! Regards, anne