A Cheap World War II Visualization

I have always been interested in how history can talk to the quantitative, so as part of a small data science project, I thought it was a good idea to put together a visualization of World War II using R. It would be a sorry state of affairs if I had to salvage WWII data from library archives and digitize them, but fortunately, there are already enough interesting WWII data sources floating around the Internet that can be made good use of:

  1. Wikidata & DBpedia: These two queryable databases have structured data of WWII battles compiled from Wikipedia, mostly from Wikipedia infoboxes.
  2. Naval History and Heritage Command: Almost all of the losses of the Imperial Japanese Navy during World War II can be found here, including both commercial shipping and naval vessels.
  3. Wikipedia: Wikipedia also has many WWII info lists that can be used.
  4. Data.World: Data of all the bombs dropped by the Allies since WWI.

Out of these data, we can make a very simple (i.e. cheap) visualization of WWII. 

Data Cleaning

SPARQL Land Warfare Data

Using SPARQL we can easily query all WWII battles along with the dates of the battle and the coordinates. For some reason, DBpedia has some data that Wikidata does not, and vice versa, so we query both. The rest is some ugly wrangling work to clean the data and join the two datasets together.

The final dataset looks something like this:

Making a Command Network

The more interesting thing we can do with DBpedia is to query for all the commanders in WWII and then use the data to create a paired tuple of commanders who have engaged each other in battles, whether as allies or enemies. 

Creating a pairwise combination of all commanders who shared a battle is more tricky. We first perform a cross-product of the dataframe on itself to get all combinations based on shared battles.

We would then want to remove pairs that are reversed duplicates of each other. For instance, a pair such as (Erwin Rommel, Gerd von Runstedt) are the same as (Gerd von Runstedt, Erwin Rommel), and needs to be filtered out. To do this we sort the elements in each pair by alphabetical order, and combine them (separated by a comma) so that both pairs above become a string “Erwin Rommel, Gerd von Runstedt.” Then we remove the duplicates and separate the remaining strings back into pairs. 

Compiling Naval & Aerial Data

For naval data, I was a bit lazy and only scrapped naval loss data for the US Navy, the Royal Navy, and the IJN. Here is a good tutorial on how to scrap online tables using Rvest. To be fair these data are not very comprehensive, nor are they very accurate, for I had to geocode many of the locations into rough coordinates using Google Maps after some ugly wrangling. For the aerial bombing data, all the other fields are filtered out except the date, the coordinates and the tonnage dropped. All of the datasets can be found here.

Visualization

Leaflet

For land, naval and aerial data, since we have the coordinates, making a Shiny Leaftlet App out of the data is pretty simple. 

It appears from the data that the Rhur Valley was the main target of the Allied bombing towards the end of the war, which is not surprising since the Rhur region was where most of the Gerrman war industry was based.

The Americans appear to be pretty busy in China as well, supporting their allies in China by providing air reinforcements. Most of the tonnage seemed to be dropped on Hunan, the most heavily contested area in the Chinese theatre from 1941 onwards. For some reason, the data also shows Americans dropping bombs in Kham Tibet. Starting in 1944, P40s and B25s would regularly come to Garzê County (31°33’58.6″N 100°07’59.8″E) and drop several tons of payload. Since there are no reasons why the US airforce would drop bombs in a region that is not occupied by enemies and where the bombs can at most upset a few yaks, I am inclined to believe that the data miscoded the coordinates. 

Command Network Analysis

The following Shiny code was used to generate a command network for the date specified by user input. This is certainly not the most efficient way of generating a network since a new one has to be generated every time the user input dates change, and because of the large amount of memory required to calculate the network, Shiny often explodes.

The final network visualization looks something like the one below. Each node is a commander and the edges an engagement. The size of each node represents the degree of the node while the color represents the betweenness centrality score. I have also inserted two extra side plots, one displaying betweenness centrality and the other eigencentrality. Because a SharedDataclass is used to store the network data, each time we click a node on the main graph, the corresponding node is highlighted in the side plots. Interestingly, the commander with the highest betweenness score over the entire scope of WWII is not who we would think, such as Rommel or Zhukov, but a certain Alexander Patch. 

It seems like Patch fought in multiple theatres, including the Philipines, the Mediterranean, and later Normandy, so he is a central connecting point for the different theatres in WWII. Playing around the graph also reveals that Macarthur has the highest degree (151) but a terribly low eigenvector centrality score (0.127). This indicates that he was connected to many other commanders but that most of these commanders were not well-connected themselves. In other words, he was very busy dealing with non-celebs. 

The full visualization is hosted on shinyapps.io can be found here. Note that shinyapp.io limits the amount of memory used to 1GB so engaging with large timeframes will crash the app (I have yet to find a solution for this) 🙁