I wanted to be able to apply some of the things I’m learning on the Doing Journalism with Data MOOC to a topical project that interests me and isn’t too heavy. So I decided to look at how international the Tour de France has become. For example, how has the proportion of French riders changed over the years? Which was the first non-European nation to take part? Which European countries have never had a rider in the Tour? With the race starting in Yorkshire this year, it seemed a good time to look at this topic.
But I needed a set of data with the names and nationalities of each of the participants from 1903 to the present day. That should be simple, shouldn’t it?
I started with the official Tour de France site‘s historical archive compiled by the current race organisers, ASO. It looks good and has various search options.
It wasn’t easy to scrape because the nationalities appear as flag jpegs rather than the actual name of the country. So I had a look to see if there were other websites which had a simpler structure. BikeRaceInfo.com in the States had a high Google ranking. It doesn’t look as slick as the official Tour de France site but it was easy to grab the table of participants.
BUT its data didn’t match the official LeTour.com site so my initial reaction was to assume it wasn’t a reliable source. I persevered with scraping the Tour de France site.
At regular intervals, I would cross-reference the list of participants with Wikipedia (and the sources cited in Wikipedia), especially if I came across a non-European participant in those early days. Hmm, some things just weren’t matching up. I started to get concerned about the “official” data. My concerns reached crisis point when I checked the data for 1947. The Wikipedia entry says this was the first year a Polish rider took part – Edward Klabinski. There was absolutely no mention of him in the official Tour de France archives. Nor was he mentioned in 1948 or 1949 when he also, apparently, took part. And he wasn’t a nobody! He came 18th in 1947 and was also the first winner of the Dauphine so there are plenty of records for him out there. I checked to see if there was an alternative spelling of his name (sometimes he was known as Eduard but that didn’t appear either) or if he’d changed nationality. No, nothing. He did appear in BikeRaceInfo.com though…
So I sent an email to Bill McGann who runs that site to see if I could get to the bottom of this. He was kind enough to email straight back and said he too had been surprised about how unreliable a resource the official website was! He suggested a couple of other resources which he had used – the Tour Encyclopedie (out-of-print) and Memoire du Cyclisme which claims to be the most reliable source on the web.
I felt bad for having assumed Bill’s website wouldn’t be as good a source as the official site. It’s actually extremely good! But he didn’t have the list of participants for every year that I needed so I had to carry on looking.
Memoire du Cyclisme requires a small subscription to enter so I actually parted with cash – I was that desperate. I’ve not been sent the password yet so need to chase it up…
Other sites had data but I couldn’t scrape it (this was before I’d discovered OutWit Hub which may have helped).
Then I remembered we had a History of the Tour de France by Geoffrey Wheatcroft on the bookshelf. What did he have to say on the matter?
The sources for the history of the Tour de France are exiguous, inaccessible, and largely corrupt. Plenty of popular books on the subject are riddled with error; and when three different, officially sanctioned reference works, the Tour Encyclopedie, the press office’s Histoire, and the Tour’s website, can’t agree on the number of entrants in the field one year, or on the spelling of a rider’s name, then it’s tempting to echo Sir Francis Hinsley in ‘The Loved One’ – ‘I was always the most defatigable of hacks’ and give up.’
(Wheatcroft, 2003, p329)
I too was on the point of giving up but I shared my frustrations with the MOOC community on one of the discussion boards and was really pleased to get a couple of replies. They spurred me on!
Harness the power of Wikipedia
I trawled through the sources mentioned in Wikipedia references – they can be a really useful resource. I found CyclingArchives.com which matched well with Bill’s data.
I started scraping it (them??) into my spreadsheet – still cross referencing. By now I’d come across a couple of other amateur archives, in particular TourFacts.DK There were still a few anomalies with the number of participants but only a few either way.
I emailed Cycling Archives and TourFacts.DK to ask about their sources. Svend has been working on TourFacts.DK for seven years! He used mainly CyclingArchives.com plus a lot of googling. He said he’d emailed the Tour de France a long time ago about their errors but never got a reply. (Alarm bells were raised for him when a rider listed as being born in 1956 rode the race in 1923!!)
I got a reply from CyclingArchives today. They directed me to their page about the sources they use. It’s pretty comprehensive (including Memoire du Cyclisme) but very interestingly, their list does NOT include LeTour.com.
So, I may not have found the PERFECT data. I don’t think the perfect data exists for the Tour de France which was so chaotic in its early years. But the community of archivists I’ve been able to speak to seem to agree that I’m probably using the least unreliable one. I think that’s the best I can hope for.
What have I learnt?
- Never underestimate the reliability of an amateur archivist. These people are passionate about what they do and very diligent. They’re also very well-connected and rely on a network of fellow experts who are equally enthusiastic about maintaining reliable, historical data. Chapeau!
- Never assume official data is the best. Just because a site looks fancy, doesn’t mean it’s using reliable data.
- Cross-reference all the time.
- Talk to people. In my experience, they are extremely happy to help. Let them know how much you appreciate the work that’s gone into their archive and ask if they can tell you more about the sources they’ve used, how they’ve dealt with discrepancies etc. Become part of the community!
If you’ve had similar experiences trying to get hold of seemingly straightforward data, let me know. I’d love to hear your stories and share your frustrations.