Monthly Archives: May 2014

Tour de France – Doing Journalism with Data – Beware “Official” sources

Bike [Die, France]By Biphop Licensed under Creative Commons (CC BY-NC-SA 2.0)

I wanted to be able to apply some of the things I’m learning on the Doing Journalism with Data MOOC to a topical project that interests me and isn’t too heavy. So I decided to look at how international the Tour de France has become. For example, how has the proportion of French riders changed over the years? Which was the first non-European nation to take part? Which European countries have never had a rider in the Tour? With the race starting in Yorkshire this year, it seemed a good time to look at this topic.

But I needed a set of data with the names and nationalities of each of the participants from 1903 to the present day. That should be simple, shouldn’t it?

I started with the official Tour de France site‘s historical archive compiled by the current race organisers, ASO. It looks good and has various search options.

Screenshot - Historique de Tour de France

Screenshot – Historique de Tour de France

It wasn’t easy to scrape because the nationalities appear as flag jpegs rather than the actual name of the country. So I had a look to see if there were other websites which had a simpler structure. BikeRaceInfo.com in the States had a high Google ranking. It doesn’t look as slick as the official Tour de France site but it was easy to grab the table of participants.

Screenshot - BikeRaceInfo.com

Screenshot – BikeRaceInfo.com

BUT its data didn’t match the official LeTour.com site so my initial reaction was to assume it wasn’t a reliable source. I persevered with scraping the Tour de France site.

Cross-reference

At regular intervals, I would cross-reference the list of participants with Wikipedia (and the sources cited in Wikipedia), especially if I came across a non-European participant in those early days. Hmm, some things just weren’t matching up. I started to get concerned about the “official” data. My concerns reached crisis point when I checked the data for 1947. The Wikipedia entry says this was the first year a Polish rider took part – Edward Klabinski. There was absolutely no mention of him in the official Tour de France archives. Nor was he mentioned in 1948 or 1949 when he also, apparently, took part. And he wasn’t a nobody! He came 18th in 1947 and was also the first winner of the Dauphine so there are plenty of records for him out there. I checked to see if there was an alternative spelling of his name (sometimes he was known as Eduard but that didn’t appear either) or if he’d changed nationality. No, nothing. He did appear in BikeRaceInfo.com though…

So I sent an email to Bill McGann who runs that site to see if I could get to the bottom of this. He was kind enough to email straight back and said he too had been surprised about how unreliable a resource the official website was! He suggested a couple of other resources which he had used – the Tour Encyclopedie (out-of-print) and Memoire du Cyclisme which claims to be the most reliable source on the web.

I felt bad for having assumed Bill’s website wouldn’t be as good a source as the official site. It’s actually extremely good! But he didn’t have the list of participants for every year that I needed so I had to carry on looking.

Memoire du Cyclisme requires a small subscription to enter so I actually parted with cash – I was that desperate. I’ve not been sent the password yet so need to chase it up…

Other sites had data but I couldn’t scrape it (this was before I’d discovered OutWit Hub which may have helped).

Then I remembered we had a History of the Tour de France by Geoffrey Wheatcroft on the bookshelf. What did he have to say on the matter?

The sources for the history of the Tour de France are exiguous, inaccessible, and largely corrupt. Plenty of popular books on the subject are riddled with error; and when three different, officially sanctioned reference works, the Tour Encyclopedie, the press office’s Histoire, and the Tour’s website, can’t agree on the number of entrants in the field one year, or on the spelling of a rider’s name, then it’s tempting to echo Sir Francis Hinsley in ‘The Loved One’ – ‘I was always the most defatigable of hacks’ and give up.’

(Wheatcroft, 2003, p329)

I too was on the point of giving up but I shared my frustrations with the MOOC community on one of the discussion boards and was really pleased to get a couple of replies. They spurred me on!

Harness the power of Wikipedia

I trawled through the sources mentioned in Wikipedia references – they can be a really useful resource. I found CyclingArchives.com which matched well with Bill’s data.

Scrrenshot - CyclingArchives.com

Scrrenshot – CyclingArchives.com

I started scraping it (them??) into my spreadsheet – still cross referencing. By now I’d come across a couple of other amateur archives, in particular TourFacts.DK There were still a few anomalies with the number of participants but only a few either way.

I emailed Cycling Archives and TourFacts.DK to ask about their sources. Svend has been working on TourFacts.DK for seven years! He used mainly CyclingArchives.com plus a lot of googling. He said he’d emailed the Tour de France a long time ago about their errors but never got a reply. (Alarm bells were raised for him when a rider listed as being born in 1956 rode the race in 1923!!)

I got a reply from CyclingArchives today. They directed me to their page about the sources they use. It’s pretty comprehensive (including Memoire du Cyclisme) but very interestingly, their list does NOT include LeTour.com.

So, I may not have found the PERFECT data. I don’t think the perfect data exists for the Tour de France which was so chaotic in its early years. But the community of archivists I’ve been able to speak to seem to agree that I’m probably using the least unreliable one. I think that’s the best I can hope for.

 What have I learnt?

  • Never underestimate the reliability of an amateur archivist. These people are passionate about what they do and very diligent. They’re also very well-connected and rely on a network of fellow experts who are equally enthusiastic about maintaining reliable, historical data. Chapeau!
  • Never assume official data is the best. Just because a site looks fancy, doesn’t mean it’s using reliable data.
  • Cross-reference all the time. 
  • Talk to people. In my experience, they are extremely happy to help. Let them know how much you appreciate the work that’s gone into their archive and ask if they can tell you more about the sources they’ve used, how they’ve dealt with discrepancies etc. Become part of the community!

If you’ve had similar experiences trying to get hold of seemingly straightforward data, let me know. I’d love to hear your stories and share your frustrations.

 

Get ready for Doing Journalism with Data MOOC

Exit Festival 2012 by Bernard Bodo. Creative Commons (CC BY-NC-SA 2.0)

Exit Festival 2012 by Bernard Bodo. Creative Commons (CC BY-NC-SA 2.0)

We’re excited! This MOOC from the European Journalism Centre  – “A free online data journalism course with 5 leading experts” – starts on Monday 19th May.

….and it’s not too late to join the party! I’m doing it because I want to keep building my Data Journalism skills and find out how data journalism is developing around the world. But, as an educator studying for a Masters in Blended/Online Learning, I’m also interested in the whole MOOC phenomenon.

So what’s the best way to get ready for this MOOC? Here are a few ideas.

Understand how Massive Online Learning works

Watch this for a quick familiarisation from Dave Cormier.

There are different platforms for MOOCs. Coursera is probably the best known, but there’s also Udacity and a platform created by the UK’s Open University called FutureLearn.  The Doing Data with Journalism MOOC will be using Canvas. I recommend adding the Canvas bookmarklet for this DJ course to your browser so it’s really easy to get to and a constant reminder you have work to do!

What kind of MOOC course are you doing?

Screenshot of Dave Cormier's MOOC video

Screenshot of Dave Cormier’s MOOC video

  • xMOOCs refers to the Coursera-type model where a teacher-expert transmits knowledge through carefully packaged videos and checks that knowledge has been acquired through computer-graded quizzes. Support comes from occasional tutor-participation in discussion forums. But support is also encouraged through students organising face-to-face meet-ups in their locality. There’s probably a Coursera meet-up group in your area!.
  • cMOOCs rely on a more connectivist approach, making use of the networked web 2.0 technology. There can be an emphasis on content creation, for example, as a way of building knowledge. They are more student-centred in that there is no set route through the course and a limited structure so there’s more learner autonomy. Webinars with guest speakers, blogs and online facilities for students to connect with each other are a strong feature. Support comes from peers and is facilitated by networked technology.
  • quasi-MOOCs are not much more than Open Educational Resources tutorials such as the Khan Academy and, more recently, Codecademy. The learning resources are asynchronous and don’t really offer social interaction unless students generate it themselves. quasi-MOOCs are not packaged as a course but as a series of standalone tutorials. So support would be minimal here.
  • Dead MOOCs OK, so this is my own category. I use it to describe archived MOOCs. So the actual MOOC is no longer running – the tutors aren’t around and the submission deadlines have passed – but you can still watch the video lectures and do the online quizzes. You won’t get a badge or certificate at the end but you can still learn.

I don’t know which model Doing Journalism with Data will follow so it’ll be interesting to find out.

You can read this article by George Siemens to learn more about the MOOC phenomenon.

Statistics: Making Sense of Data

By Ainali. Creative Commons CC-BY-SA 3.0

By Ainali. Creative Commons CC-BY-SA 3.0

Without a basic understanding of statistics, data doesn’t mean much. I’ve been taking this Statistics MOOC from the University of Toronto for the last couple of months. It uses the Coursera platform and, thanks to the video lectures from Jeffrey Rosenthal and Alison Gibbs, I can now only talk about data in a Canadian accent. It’s a great example of a dead, xMOOC! Even though the discussion forums are a year old, they’re still really useful when I get stuck (often.) The submission deadlines are long gone so I’m not going to get any badges or certificates, but I still get marks for the multiple choice quizzes I do and I can even do my assignments because the lecturers have supplied a “model answers” sheet for me to check against. It’s been a great way of brushing up my A-Level Statistics and putting it into a more practical context and I highly recommend adding some statistics to your Data Journalism skill set.

Let’s kill the myth that journalists can’t do maths here and now!

Explore examples of Data Journalism

This is a great way to get in the zone. Once you start looking, you’ll find loads of examples. Think about the kind of data that was used and where it came from. What journalistic processes were added to the raw data to make it journalism – e.g. context, visualisation, interactivity?

And a note of caution, just because it’s data doesn’t mean it’s journalism. You still have to check your facts, the data source and make sure you’re not asking your data to do more than it’s capable of. Here’s a cautionary tale you should read before embarking on your Data Journalism MOOC. It’s about this map of kidnappings in Nigeria produced by FiveThirtyEight.

See you there….

So, if you’re one of the 20 393 people already registered for the MOOC, I hope to see you online and share some learning. Do drop by and say hello!

Next job – I’ll be putting together a list of Top Tips for learning with MOOCs in the next day or so.