I went to my grand nephew’s high school graduation tonight (congrats Brandon!) and like I usually do at such events I started analyzing the program booklet.
What caught my attention was the fact that there was only one student with the last name of Smith. That seemed unusual for a class of just about 300 students, so I decided to do some research.
When I got home I went to Wikipedia and found a listing of the most popular last names in the U.S., and how frequently those last names appear. If you are still reading this post, here is what I found:
Top 5 most popular last names frequency per 100,000
Based on this frequency, I was able to calculate how many students with these last names should be in a class of 300 students. This is shown in the table below, along with the actual number of students there were with that last name:
Top 5 most popular last names expected # per 300 actual number
Smith 2.64 1
Johnson 2.06 0
Williams 1.71 1
Brown 1.54 0
Jones 1.52 1
Total 9.47 3
I’m not going to attempt any further analysis to estimate the probability of having an sample population that seems so different than what the expected population should be, but my guess is that it would be pretty small likelihood for such an outcome shown in the table above.
The more interesting question is “why is the actual sample so different than the expected one?”
The Wikipedia article is based on data from the 2000 census; could it be the population has changed dramatically in 16 years? The Wikipedia article does compare the ranking of names to the 1990 census, and it does not seem to be that different (the same five names are in the top five for each census, although in a slightly different order). So I can’t imagine it would be that different now.
Could it be that where I live is not representative of the general population of the U.S.? It seems like a normal suburb, about 20 minutes from Philadelphia, but then again, I grew up around here. So it’s hard for me to know if this area is similar to other parts of the U.S.
Or could it be that sometimes these are the sorts of results you see when just looking at a subset of the overall population. Perhaps last year’s class was loaded with students with these last names, or maybe next year’s will be. If we were to include the past 10 years of graduates, it may come very close to the expected outcome. Or again, maybe not, if this high school is not representative of the overall U.S. population.
One other item that stood out to me – there were seven sets of twins in a class of 300. I’ll save that statistical analysis for another day.
So this is just a sample of the sort of thoughts that run through my mind when it has a chance to wander.
Maybe someday I’ll write a post about what I am thinking about while I am at church. That post could go for several pages…