[datatable-help] Problem(s) finding p-values for numerous spearman correlations

Izzy_M izabellamurphy at aol.co.uk
Mon Nov 27 18:22:51 CET 2017




Hi everyone,                                

                                                                                                      
I am very, very new to R, and I'm trying to work out the p-values for
thousands of spearman correlation scores.

Essentially, I have imported a large dataset from a CSV file (366 obs. of
73775 variables) into R Studio. Along the x-axis, I have a series of words,
the y-axis contains dates, and the data is the relative frequencies of each
of the words on that particular date. Essentially, I am trying to see if the
frequency of any/all of the given words increases significantly over the
course of a year.

After some trial and error (and a lot of Googling!), I have a code which
successfully stores the Spearman Correlation values in a matrix:

x <- my_data[1:73775]
y <- my_data[1]
corrs3 <- round(cor(x, y, method = "spearman", use="complete.obs"), 3)

This code stores the words in one column of the matrix and their Spearman
value in the second column However, what I need to do now is to calculate
the corresponding p-values for each of the variables. I have been able to
this for individual variables by running the following code (although I do
get a warning saying "Cannot compute exact p-value with ties", but I've been
told that this isn't a major problem?):

cor.test(1:73775, my_data$romcom, method = "spearman")


However, what I would ideally like to do is store the p-value next to the
Spearman value in the matrix (if that is possible).

The consensus seems to be that Hmisc is the ideal tool for this kind of
thing, so I installed that library, and I've been attempting to run it as
follows

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
    )
}
x <- my_data[1:73775]
y <- my_data[1]
library(Hmisc)
res2<-rcorr(as.matrix(my_data[x,y]))
flattenCorrMatrix(res2$r, res2$P)



However, I get an error message, stating:

"Unsupported index type: tbl_df".
And I'm unsure how to fix this.

I've also tried bypassing Hmisc and using the following:


x <- my_data[1:73775]
y <- my_data[1]
corrs3 <- round(cor.test(x, y, method = "spearman", use="complete.obs"), 3)

But this returns the error message:

Error in cor.test.default(x, y, method = "spearman", use = "complete.obs") : 

'x' and 'y' must have the same length


More Googling suggested that the "corr.test" function from the psych library
would be better. However, when I use the following code:

x <- my_data[1:73775]
y <- my_data[1]
library("psych")
corr.test(x, y = NULL, use = "pairwise", method="spearman", ci=TRUE)


I get the following error message:

Error: cannot allocate vector of size 40.6 Gb

I'm really out of options now, and I would really appreciate any
suggestions! 

Thanks!




--
Sent from: http://r.789695.n4.nabble.com/datatable-help-f2315188.html


More information about the datatable-help mailing list