sample survey: If there are 300 billion stars in the galaxy, how many do you have to sample to get a valid survey?

This is a SETI statistics question. If there are 300 billion solar systems in the galaxy, how many solar systems would you have to survey without finding intelligent life to be 95% sure there is no other intelligent life in the galaxy except humans? What is the formula to determine the needed sample size based on a given population if you want your survey to be 95% accurate? Also, in statistics does a 95% confidence level mean the same thing as a +/- 5% margin of error? This is not a homework assignment. I’m just curious.

If there are 300 billion stars in the galaxy, how many do you have to sample to get a valid survey?
This is not such a simple statistics question.

Suppose you want to find out how many people in the U.S. normally vote for Democrats. You can get a very good estimate (to within a few percent) if the following apply:

1) You ask about 1000 people.

2) These 1000 people are truly a random sample. There must not be a selection effect that biases the results.

3) The people answer honestly.

One of the reasons this works is that the number of people who normally vote for Democrats is somewhere around 40%, so a sample of 1000 people will include many Democrats and others (independents and Republicans). If you want to know how many people vote for the Flat Earth Party, which has 5 members, a sample of 1000 won't tell you anything except that the number of Flat Earthers is very small. (This is a hypothetical party, in case you haven't guessed.) In a sample of 1000 in a country of 300 million, it is overwhelmingly likely that you will find no Flat Earthers.

When we search for life with SETI, first we have to be clear that we're searching for life that broadcasts its existence via radio waves. Nineteenth-century earth could not be found, because people did not use radio waves then. Perhaps super-advanced civilizations have found other methods that are superior to radio waves (but this is pure speculation). In short, we are searching for intelligent life similar to our own, or possibly much more advanced. The question then becomes how many stars in the galaxy have this kind of life.

The issue of a random sample is also a problem. It is easiest for us to examine nearby stars. Perhaps other parts of the galaxy are more or less likely to have intelligent life. Let's ignore this issue, and assume that we do a good random sample.

Here's another problem: In searching for radio signals, are we really looking at the correct frequencies? Let's ignore this issue also, and assume that we are.

The next problem is the probability P, the ratio of stars in our galaxy with intelligent life now to the total number of stars in the galaxy. (The word "now" is a bit problematic because of the speed-of-light lookback time. On a cosmic scale, however, the galaxy is quite small. Perhaps we miss some civilizations that have developed recently and we see others that no longer exist, but this doesn't change the statistics significantly.)

Suppose that P is relatively small. In that case, if we search N stars for life, the number in which we find life is governed by a Poisson distribution.

Here's an example: Suppose we hypothesize that P is 5% -- that is, one star out of 20 has the kind of intelligent life that we're seeking. If we look at 20 stars, we might find 1 with life, or maybe 2, or maybe 0. Suppose we search 1000 stars, and find N with life. If N is less than 36 or greater than 64, we can say with 95% certainty that P is *not* 5%. If we find 0, then we can say with almost complete certainty that P is not 5%.

Suppose we search 1000 stars and find N=0. We can say with 95% certainty that P is less than 0.3%. With a total of 300 billion stars in the galaxy, however, there is still a possibility of over 900,000,000 stars with intelligent life.

Suppose we search a million stars and find N=0. Now we can say with 95% certainty that P is less than 0.0003%, but that still leaves the possibility of 900,000 stars with intelligent life.

If we search 200 billion stars and find N=0, we can say with 95% certainty that P is less than 1.5e-11, but that still leaves the possibility of 1 or 2 stars with intelligent life (out of the 100 billion we haven't searched yet).

In other words, to rule out the likelihood of life in the galaxy, we would have to look for signals from the vast majority of stars (and we wouldn't be sure unless we looked at every last star).

The moral for SETI is this: If we look for intelligent civilizations by sampling stars randomly, the search will not succeed unless one of the following is true:

1) There are a substantial number of stars in the galaxy harboring life.

2) We happen to get very lucky.

-- edit

I initially made a mistake in millions vs. billions. That has now been corrected.

If we sample 260 billion stars and find no life, we're 95% certain that P is less than 1.15e-11, which means we are nearly certain that the number of life-bearing stars out of the remaining 40 billion is 0.46 or less. At this point, we might be willing to give up the search, but by now we've searched 87% of the stars in the galaxy! It's no longer a sample, but rather a study of nearly the complete population.

Maybe another important conclusion is this: We can never prove that there is not other intelligent life in the galaxy, but perhaps we can prove that there is; and that's the reason for SETI studies.

-- edit

To improve the chances of finding life, SETI studies might not necessarily look at stars randomly. If you want to pick out particular stars, you'd limit the search to those with a good chance of life (e.g., non-binary dwarf stars of a luminosity similar to or less than that of the sun). There are different kinds of SETI studies. Some target particular stars, while others are "piggybacked" onto a telescope, looking at whatever region of the sky a non-SETI researcher is studying.

-- edit

You asked about level of confidence vs. margin of error. This won't add anything significant to the discussion, but I'll describe it briefly because you asked.

Consider again the poll about Democratic voters. A proper statement of the results (depending on the sample size) might be something like this: The poll shows that 40% of Americans usually vote Democratic. We might say that we have 95% confidence that the actual number is between 38% and 42%, but 40% is our best guess. If we used a larger sample and measured 40%, perhaps we'd say that we have 95% confidence that the actual number is between 39% and 41%. There is a connection between the confidence level and the margin of error and the sample size, but it's not a trivial one.

Now consider the SETI example after examining one thousand stars and finding none with life. We would say that our best guess for P is P=0.0%, and we are 95% confident that P is in the range from 0.0% to 0.3%. (In this case, the lower part of the range equals our best guess, because you can't have a probability less than zero.)

-- edit

I've answered this by sticking to the premise you mentioned -- namely, that we study each star individually. That's one approach, but we'll never study 300 billion stars individually. (We can't even see most of them, and don't know where they are.)

But there's another approach: The sky contains about 41,253 square degrees. Suppose we use a radio telescope whose beam area is A square degrees. Then we can study the entire sky by looking at about 41253/A points; this kind of work is called a "survey." (To survey the entire sky, we need two observatories, one in the northern hemisphere and one in the southern.) At each point, we measure the signal over some length of time, and try to determine whether part of the signal could result from an intelligent civilization. This method is not guaranteed to find every intelligent source; for example, it is possible that the signal from some planet is overwhelmed by a bright natural source that happens to lie in the same direction. My main point is that you don't have to aim your radio telescope at 300 billion points, but a much smaller number. This method gives us hope that we might be able to find intelligent life in the galaxy even if it is very rare.

To summarize, there are at least three different methods that can be used for SETI:

1) Target individual stars, particularly those that seem promising.

2) Use a "piggyback" program that does a SETI analysis of all signals detected for other research programs.

3) Do a partial-sky or whole-sky survey.

-- edit (regarding the recent additions to your question)

"If a sample of 1000 is enough for a good survey in a population of 300 million people in the U.S., does it follow that you can just multiply the numbers in this example by 1000 and say that a sample of 1 million is enough for a good survey in a population of 300 billion solar systems?"

No. For a political survey like the one I mentioned, a sample of 1000 gives you a good result, regardless of whether the population is a million or a billion. The standard deviation for a normal distribution is sqrt(N*p*(1-p)). Here's an example: Suppose you ask some question (e.g., will you vote for Obama?) for which the true answer is 40%. Then p=0.4. If you conduct a poll that asks N people, you will get roughly 400 people who answer the question yes. The standard deviation in this number is sigma=sqrt(400*0.4*0.6)=9.8. For a normal distribution, the 95% certainty level comes at plus or minus 2*sigma. Thus, if 400 people answer yes, we can say that the 95% confidence range is 380 to 420 yes answers; pollsters would express this by saying that the poll yielded a result of 40% with a margin of error of 2%. (It is also likely that the poll would instead get an answer of 41% or 38%, but the margin of error would still be 2%.) Note that the size of the entire population never entered into this discussion. (We have assumed, however, that the population is much larger than the number of people questioned for the poll.) People often don't understand this, and don't realize how a poll of such a small number of people can be accurate.

"Also you say a sample of 1000 out of 300 million will get an answer accurate to within a few percent. Do you know exactly what the margin of error is and how to calculate it?"

(This should be 300 billion, not million.) I mentioned that if we detect no life around 1000 stars, we say with 95% certainty that P is less than 0.3%. This is based on a Poisson distribution. You can find more about this on the web, but I'll describe it briefly.

You've probably seen discussions of the probability of tossing a coin or a die. Suppose we have a die with many faces, and we're looking for some particular number; so the probability of success in each toss is P. We toss the die some huge number of times T, so the expected number of successes is A=P*T. Perhaps A is 10, so there's a good chance of getting 10 successes, but maybe we'll get 9 or 11, or some other nearby number. Note that the number of successes is an integer. What is the probability of getting exactly N successes? The answer is this:

probability = e^(-A) * (A^N) / N!

For small N, this distribution is highly asymmetric. As N gets larger, the distribution becomes increasingly symmetric and begins to resemble a normal distribution.

Now suppose we look at T=1000 stars, and P is the probability of intelligent life on each star. If A=P*T were 3, then the probability of finding 0 stars with life is 0.05%; so if we find 0 stars with life, we can say that we're 95% certain the probability of life is less than 3/1000 = 0.3%. Our best estimate of P is 0% (because we didn't find any life), but our estimate of the probable range is 0% to 0.3%. You might say that we have a 0.3% margin of error, but this phrase is more often used with a normal distribution (which is symmetric) than a Poisson distribution (which is not symmetric).

"Also based on your “Flat Earth” political party example, if intelligent life in the galaxy is extremely rare, does this mean that a sample size of 1 million won’t be big enough to yield a valid survey?"

Let's consider the more typical survey size of 1000 people, and assume that we find no Flat Earthers in that sample. Again, using a Poisson distribution, we can say that the expected number of Flat Earthers in 1000 people is probably less than 3 (with 95% certainty), so we can say that the fraction of the entire population of the country in this party is less than 3/1000 = 0.3%. It doesn't mean that the survey isn't valid. If we're trying to determine whether the Flat Earth candidate for president is a strong contender, the survey has told us convincingly that he is not. If our goal is to find out whether there are *any* people in the country who consider themselves members of this party, then we haven't succeeded (much like a SETI survey of 1000 stars without success leaves the question of life in the galaxy unanswered).
Reply:In my description of the Poisson distribution, I mistakenly substituted N for A. It should read "For small A, this distribution is highly asymmetric. As A gets larger ..." Report Abuse

Reply:In the third paragraph from the end, I should have said that the "probability of finding 0 stars with life is 5%", not 0.05%. I hit the YA length limit, and was unable to proofread my answer after submitting it. Report Abuse

Reply:I said that the standard deviation for a normal distribution is sqrt(N*p*(1-p)). I should have said "binomial distribution", not "normal distribution" (although the distribution of data in this example is essentially normal). I really miss the opportunity to proofread! Report Abuse

Reply:One more goof! In the Obama example, I should have said that the standard deviation is sqrt(1000*0.4*0.6)=15.5, so the margin of error (with 95% confidence) is plus or minus 3% (not 2%, as I said above). The N in sqrt(Npq) is the total sample size. Report Abuse

Reply:These are your odds,

How many stars like our Sun.

How many rocky planets about the size of Earth.

How many planets within the star warm zone, with a molten core to produce a magnetic field.

How many that have survived a collision with a planet the size of Mars that would create a Moon and induce a spin in the planet.

I think the odds are really low. But even out of 300 B there should be a considerable number of prospects.

Now, how many of those are in the same evolutionary time we are now?
Reply:haha, as far as I can tell, there is no intelligent life anywhere.

people are pretty dumb.
Reply:are you counting the suns as stars, cause you'd have to ask survey roughly as many sun stars as you do stars...oh but there are different kinds of stars so...i say atleast survey 4 million
Reply:Good question.

No real answer though.

Haha.
Reply:goooogle =]
Reply:i have no idea what you r asking but im going to say that they found frozen germs on mars so that means that they hade to be born and evalved from something eles but where r you geting these # at ps {i know it is a homework ?}
Reply:Well with something as unique and random as life, it goes with conventional wisdom that you would have to examine 95% of them to be 95% sure there was not life on them.... no a 5% margin of error simply means + or - 5% either way reguardless of the estimated % value
Reply:300 billion

Monday, May 11, 2009

If there are 300 billion stars in the galaxy, how many do you have to sample to get a valid survey?

No comments:

Post a Comment

sample survey

Blog Archive

About Me