eclectic_boy: (Default)
[personal profile] eclectic_boy
You already know about googlewhacking, finding pairs of words that only bring up a single hit when looked for together on google. I decided that's not as good a measure as I'd like of how uncorrelated two words really are, because it doesn't take into account the base rarity of the individual words -- it's not really that odd that two words would only return a single google hit if each of them by themselves are only on a small number of pages. What's more intriguing, to me anyway, are words that by themselves are quite common, but which nevertheless rarely appear together.

So I'm defining a number I call the Google Correlation of two words:

GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )

Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?

Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.

But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.

So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)

Date: 2007-01-13 04:49 am (UTC)
From: [identity profile] eclectic-boy.livejournal.com
[livejournal.com profile] superbacana suggests using logarithms to make the numbers less cumbersome. In that case,

GC(Jotted Cruddiness) = -21.5
GC(Autofocus Creole) = -25.5

How low can you go?

Date: 2007-01-13 05:04 am (UTC)
From: [identity profile] carnap.livejournal.com
Now that "Autofocus Creole" seems to be the name of the game, you need to start tracking the google correlation of those two terms to see how popular the game is. . .

Date: 2007-01-13 05:38 am (UTC)
From: [identity profile] tirerim.livejournal.com
Hmm. Are you sure that multiplying the frequencies of the individual words is the way to go, rather than adding them? I ask because GC(of a) = ln(5.89 * 10^9 / (5.77 * 10^9 * 8.13 * 10^9)) = -22.8. This despite the fact that "of" and "a" occur more frequently together than "of" does by itself!

Date: 2007-01-13 05:59 am (UTC)
From: [identity profile] tirerim.livejournal.com
Alternatively, [livejournal.com profile] carpenter suggests that squaring the numerator might help.

Date: 2007-01-13 06:06 am (UTC)
From: [identity profile] superbacana.livejournal.com
Good point. The correlation is normally defined in terms of real-valued variables, it's not technically correct to call GC a correlation. Nonetheless, modifying the GC to be similar to correlation would require taking the square-root of the denominator, or squaring the numerator. This seems to make sense; then the GC becomes dimensionless: the GC is a ratio of counts (or squared-counts).

If GC = ln( # hits of x,y / sqrt( (# hits x) * (# hits y)) ) then:

GC(jotted cruddiness) = -10.6
GC(autofocus creole) = -9.4887
GC(of a) = -0.1509

(Assuming I didn't screw up the math again.)

It makes sense to interpret the counts as nonnormalized probabilities; if you don't square-root, then you have a leftover normalization, and hence the GC measure is a function of the size of Google's index.

The GC formula is also similar to mutual information. Information-theoretic measures might be the most natural way of expressing how "independent" two words are.

Date: 2007-01-13 06:12 am (UTC)
From: [identity profile] superbacana.livejournal.com
Another nice property of this formula is that the GC is 0 when two words _only_ occur together ("of a" is close).

Date: 2007-01-13 06:30 am (UTC)
From: [identity profile] sildra.livejournal.com
Well, at least through this whole process I learned that quark is a type of cheese. Nonetheless, it does not seem to go well with turmeric (-24.6). Not as good as yours, but still, I thought I'd share.

Date: 2007-01-13 06:30 am (UTC)
From: [identity profile] q10.livejournal.com
but it looks like [personal profile] eclectic_boy's original formula has the nice property that all uncorrelated pairs will have yield the same value, while your new formula doesn't.

Date: 2007-01-13 06:46 am (UTC)
From: [identity profile] q10.livejournal.com
well, keep in mind that, where A and B are independent, freq(A) = P(A)*sizeof(Web), freq(B) = P(B)*sizeof(Web), and freq(AB) = P(A)P(B)*sizeof(Web), so GC(A,B) = ln(1/sizeof(Web)). when that's the baseline, these numbers start to make sense.

a little back-of-the-envelope work suggests that, if we restrict attention to the English web, this puts the baseline someplace between 22.1 and 23.

this means that the example googlewhack is probably at significantly higher correlation than chance, which when you look at the words in question isn't that surprising.

Date: 2007-01-13 06:46 am (UTC)
From: [identity profile] superbacana.livejournal.com
What do you mean by uncorrelated? If you mean (# hits of x,y)=0, then the GC is always the same (ln(0)) for any x,y.

Date: 2007-01-13 06:48 am (UTC)
From: [identity profile] sildra.livejournal.com
dragon superfluid -26.2

Date: 2007-01-13 06:53 am (UTC)
From: [identity profile] q10.livejournal.com
i should have said ‘independent’, where of two events X and Y are independent iff P(X|Y) = P(X), or, equivalently, iff P(X,Y) = P(X)P(Y). my intuition that two different words that have no particular influence on each others' probability of occurrence should give you the same score whether they're both very common, both very rare, or of very different frequency.

Date: 2007-01-13 07:08 am (UTC)
From: [identity profile] sonatanator.livejournal.com
My best so far:
Kodaly: 1,710,000
YouTube: 137,000,000
Together: 990

I don't have a fancy calculator on me, but the original would have been 4.22589 x 10 (-12).

If anyone cares to do the other thing, go for it.

Date: 2007-01-13 07:35 am (UTC)
From: [identity profile] sildra.livejournal.com
-26.2

Google does ln and various other useful functions. Also unit conversions, and it has various useful constants.

Date: 2007-01-13 07:38 am (UTC)
From: [identity profile] q10.livejournal.com
are we allowed to exploit typos? if so, i've got:
strunk yotube 1
yotube 223,000
strunk 4,670,000

ln(1/(4670000* 223000)) ~= -27.67

and of course, there needs to be a rule that using a pair with no cooccurrences isn't allowed.

Date: 2007-01-13 08:21 am (UTC)
From: [identity profile] superbacana.livejournal.com
I think it's a matter of taste. I would think that indepedent pairs with higher frequencies are more interesting, because one might think that, the more a word occurs, the more likely it is to co-occur with any other given word. If one views every web page as just having a bunch of random draws from a shared distribution of words, then this is definitely true; the metric finds the most egregious violations of this simple model.

Date: 2007-01-13 08:35 am (UTC)
From: [identity profile] q10.livejournal.com
i'm not sure i follow you. yes, of course, we should expect P(X,Y) to increase as P(X) does, but why should more common words be more likely to have nontrivial dependencies? in particular, in the hypothetical extreme case were a word is obligatory in every document (that is P(X) = 1) then it'll be trivially independent of everything else, since in that case P(X,Y) = P(Y) = P(X)P(Y)).

if anything, if each webpage is the result of n independent draws from the universal word distribution, we should expect the absence of a high-frequency word to increase the probability of finding any other word, for the simple reason that we know the absent word isn't hogging any slots.

Date: 2007-01-13 10:26 am (UTC)
From: [identity profile] tirerim.livejournal.com
On the other hand, his formula is always the same when two words occur only together—[livejournal.com profile] eclectic_boy's formula will yield different results depending on whether the words are rare or common.

Date: 2007-01-13 08:25 pm (UTC)
From: [identity profile] q10.livejournal.com
but yielding the same result whenever two words only occur together probably isn't desirable. if you have two words that occur with P ~= 1 (‘sex’ and ‘the’, for example), then they can't help always occurring together, so that's not shocking, but if you have two one-in-a-million words that always occur together, that suggests they have some interesting connection.

fundamentally, i don't see what problem we're trying to fix - if you don't like that the baseline number is so high, just add a normalizing factor of 22 to all scores before reporting them.

Date: 2007-01-13 10:30 pm (UTC)
From: [identity profile] creed-of-hubris.livejournal.com
I object, because youtube is not a word!

Date: 2007-01-13 10:35 pm (UTC)
From: [identity profile] creed-of-hubris.livejournal.com
Youtube + escherichia

youtube 140,000,000
escherichia 23,500,000
youtube + escherichia 803

~29

Date: 2007-01-13 10:54 pm (UTC)

Date: 2007-01-13 11:47 pm (UTC)
From: [identity profile] creed-of-hubris.livejournal.com
caenorhabditis + blu-ray = 28.6, caenorhabditis + tekken also around there. those are the only other ones i've found over 28.

it's really hard to generate very large numbers.

Date: 2007-01-16 08:49 am (UTC)
From: [identity profile] q10.livejournal.com
well done, sir.
Page generated Jan. 24th, 2026 07:36 pm
Powered by Dreamwidth Studios