Dive In!
Welcome to From the Blocks!
“Great things never come from comfort zones.” – Penny Oleksiak
Hey there! Welcome to “From the Blocks”, a newsletter dedicated to exploring swimming data and statistics. I am a competitive swimmer with an [unhealthy] obsession with NCAA and professional swimming. By day I am a data scientist, and I wanted to combine my love for coding and my love for swimming. Voila! Swim stats newsletter.
This is not my first rodeo. I have always loved looking into swimming data to see what’s what, and any excuse to google Michael Phelps in a LZR Racer is a good use of time in my book. Check out my Pittsburgh Nerd Nite presentation from 2022 for the true start to all of this!
What you can expect:
A quote (which I used to do every practice when I coached my college club team)
Detailed walkthroughs of overcoming data challenges (surprise, there is not a lot of USABLE swimming data out there)
My data analysis approach
Code snippets (and links to my GitHub if you are into that kind of thing)
Swim headlines of my own (fueled by data!)
Why you might be interested:
You are a friend or family member and I have already told you that you would be interested
You are a swimmer and/or love swimming
You love data
There is a lot of swim data that is out there to be analyzed. Is it ready, right now, to load into Python and tackle? Heck no. I have spent the better part of the last 2 months preparing a dataset of all long course meters individual World Records. The data is hard to come by – and the best place I could find for all historical WRs is Wikipedia. I can’t find it anywhere else, all together (I asked World Aquatics, SwimSwam, and the International Swimming Hall of Fame). Wikipedia is not always reliable, which I understand. I fact-checked where I could and made several changes myself based on my own knowledge and assumptions. They are detailed here.
Please remember, as you look at the summaries I present here and going forward, that swimming data is hard to come by. It exists in a few places (the ones I listed above), and those sources do not always agree with each other. You may see headlines claiming slightly different numbers than what I will show you, and this underlines one of my goals – making reliable swim data easier to come by! I want usable, reliable data, and to have some fun with analytics along the way.
Swimming was not always a profession, and the reason it is so hard to come by this data (USABLE data) is because swimming is relatively niche. Do you want to know how many yards the Steelers offense ran last season, or how often the Knicks score three-pointers in away games, or which MLB team hit the most foul balls over the last 30 years? The big three sports have a plethora of available data to answer these questions. Swimming, not so much.
It was not an easy process to get this data into a usable format, and it was a joy (not sarcasm). Just like any self-respecting statistician or data scientist, I have spent my fair share of time over the years doing data entry and cleanup. It’s part of the job, and trust me, it builds character. Now comes the fun part.
Katie Ledecky broke her own WR in the 800 freestyle on May 3rd of this year at the Fort Lauderdale stop of the TYR Pro Swim Series. She had set the previous record almost 9 years ago at the Rio Olympics. To have a 28-year-old still making best times (which means new WRs for her highness the queen of the pool) is pretty wild. How often are WRs broken? How often does the same swimmer break their own WR? They just keep getting faster! To quote my mother, “they still have to get from one end of the pool to the other!!” Which makes tracking WRs all that much more exciting.
I am going to use this first post to tell you about the dataset itself before really diving into it. Feel free to visit my GitHub repo to see code associated with this first post/early exploration! I look forward to sharing more swim stats with you.
Ok… what is in this dataset?
Most of the analysis is done for men and women separately – early in my code I split up the dataset to F and M to do this without having to subset every single code chunk. I also did a significant amount of cleanup to get the swimmer names, nationalities, event times, rankings, ties, etc. into usable format. I use the word usable frequently – data scientists, you get it. Data is not always ready to play with. You don’t swallow steak whole. You heat up the grill, season your steak, cook it to your liking, cut it up into appropriate-sized pieces, and chew it before swallowing. Data science is basically the same thing. I, personally, prefer Montreal steak seasoning. For steak and for data.
(That is actually a joke about Canada being good at swimming, because while writing this I had to update the dataset 3 times in the span of a couple days after Summer McIntosh had the best meet of her life at Canadian Nationals last week. I know, it’s a reach. But here you are.)
In the dataset there are 552 total swimmers, 288 men and 264 women. There are 1,515 rows and 14 columns, a few of which I added or calculated. Not all of these WRs are going to be used in the analysis - refer to my note on the dataset for further clarification.
Which swimmers have the most WRs?
For men, clearly it is Michael Phelps - with our furry friend Mark Spitz not far behind (see below for why I call him furry).
Kornelia Ender (asterisk!!) takes the cake for the women. Unknowingly being given performance enhancing drugs will do that! Anyone remember what happened in East Germany in the 1970s?
Which countries have the most WRs?
The US is obviously at the top of the World when it comes to the most total WRs, with everyone much further behind. We can look at the famous relay in Beijing and pretend the French will catch up someday, but we know the US will still come out ahead.
For the women, East Germany has a number higher than expected. East Germany existed for… how long? 40 years, give or take? And the US has been participating in the modern Olympic Games for 100+ years? Strange.
In all honesty - what they did to those women was horrible. I highly recommend Shirley Babashoff’s book “Making Waves” to explain. Or go back and watch my Nerd Nite presentation (it was so much fun to do I had to plug it twice).
Those are some interesting numbers for the women, and I wonder how they look when we compare the top 4 countries to everyone else.
Which swimmers set WRs in the most different events?
One of the impressive things about Michael Phelps was not so much that he set so many World Records – of course, this is impressive. It’s that he set WRs in multiple different events. And the thing is, so did Mark Spitz, in all of his hairy glory. They both set WRs across 5 different individual events (to say nothing of their relay accomplishments!)
Just look at that mustache!
Enough for now. There will be more! Thank you for tuning in and I am excited to share more swim analytics with you in the future. It won’t all be WRs – but when Summer McIntosh inevitably sets more WRs at World Champs in a few weeks (no pressure), you can expect me to be jumping up and down, and then running to my computer to see just how great she (statistically) is. Until next time!








“The best swimmers are the ones who can swim with their mind.” – Bob Bowman
And with 'From the Blocks', you're going to achieve greatness analyzing the best swimmers who do just that....achieving greatness!