World War II Visualization

I have always been interested in how history can talk to the quantitative…so as part of a small data science project I thought it was a good idea to put together a visualization of World War II using R. After some searching, I found several interesting data sources:

  1. Wikipedia: Most of the WWII battles in Wikipedia have a conflict infobox with the date the coordinates of the battle. Naval loss tables can also be scraped from various pages. It would be rather nice to visualize them in time and space using Leaflet
  2. Wikidata & DBpedia: These two databases have structured data that saves us from scraping and cleaning raw data from HTML Wikipedia pages.
  3. Naval History and Heritage Command: Almost all of the losses of the Imperial Japanese Navy during World War II can be found here, including
  4. Data.World: Data of all the bombs dropped by the Allies since WWI.

We can put then together four main datasets from these sources 1. land warfare with the date and location 2. naval loss data 3. aerial bombing data 4. command network data

Land Warfare

Fortunately, there is a list that links us to most of the WWII battles on Wikipedia (saves us from crawling). We scrap, clean and put the links into a data frame:

We can now follow the link to each battle and scrap the data from military infoboxes on the right-hand side of each page. An Xpath locator is particularly handy here to extract particular elements. We can write a function that does all the work for us. Some of the pages do not have any data and will crash the function. To make it work, a try() function can be used to skip the faulty pages and move on to the next:

However, the list is not very comprehensive and missing battles here and there. If we used SPARQL to query structured data from Wikidata and DBpedia, we would be able to get a more comprehensive list of all the battles…

Naval Loss

Aerial Bombing

Command Network

Visualization

…to be continued