Return to site

6 Degrees Of Wikipedia

broken image


There is a popular hypothesis, known as six degrees of separation, holding that any two people are separated by a chain of no more than six acquaintances. Studying the characters of Wikipedia to see whether something similar obtains among its articles, Six Degrees of Wikipedia aims to be a compendium of the following things. Six Degrees of Wikipedia is a side project I've been sporadically hacking on over the past few years. It was an interesting technical challenge and it's fun to play with the end result. The six degrees of freedom: forward/back, up/down, left/right, yaw, pitch, roll Six degrees of freedom(6DoF) refers to the freedom of movement of a rigid bodyin three-dimensional space. Six degrees of separation is the idea that all people on average are six, or fewer, social connections away from each other. Also known as the 6 Handshakes rule. As a result, a chain of 'a friend of a friend' statements can be made to connect any two people in a maximum of six steps.

  1. 6 Degrees Of Wikipedia
  2. Longest 6 Degrees Of Wikipedia
  3. 6 Degrees Of Wikipedia Game

This may or may not have come up recently due to my circumstances; I don't think me blogging about my efforts is going to significantly improve anyone else's chances when it comes to the moment.

The idea in principal is dead simple: Given a page on Wikipedia how many steps/links/pages does it take to get to a given second page?

We don't care how we get there, there is no need to record the route, just count how many steps. The only other rule is you have to go from A -> B, you can't start at both ends and converge somewhere in the middle, i.e. A -> C <- B.

I really thought I'd have this nailed. I've done so much with Ruby and Mechanize in the past and this was an ideal fit. But I found this quite a humbling experience. For whatever reason (could have perhaps been the cat relentlessly pawing at the door) I lost my cool in the last thirty minutes of the ninety minutes allotted time, shut down a bit and stopped thinking/communicating out loud, and just couldn't get my brain to work: I knew what I was trying to do, but couldn't even communicate that to myself, let alone anyone else.

Here's exactly what I had when time was up: Bible verses about plumb line.

Not pretty is it? Doesn't work, at all. It's more like a collection of random thoughts. Like I said: a humbling experience. But how it nagged me. Handwriting help teach to be happy birthday card. I couldn't leave it like that when I knew I had the right idea somewhere in my brain.

This is what I had a few hours later:

6 Degrees Of Wikipedia

Which worked in principal, but was extremely slow and memory inefficient. I had to kill it before I got a result as it was using too much of the 1GB total I had on Linode.

You can see (maybe) what I was trying to with my initial attempt, i.e:

  1. Keeping track of the previous levels urls
  2. Keeping track of the current levels urls
  3. Keeping track of urls checked so time isn't wasted checking them again

The current level becomes the next previous level when we reach the end, etc.

You can also see bits that I knew were a problem and that were inefficient, but that I didn't really care about because I was sure I could figure them out later if need be (I.e. I didn't see them as the main problem):

  • Filtering out irrelevant links. I.e. all the header and footer links on a Wikipedia page which aren't anything to do with the article itself.
  • Checking all the links at the end of a step - slower than checking on the fly.

I decided to use Redis as a more memory efficient alternative to Ruby arrays for my final attempt for that day (by that point it was purely an effort for my own self-esteem). Redis is really awesome, by the way:

This was good enough to get it to run reasonably ok on the resources I had, but I knew improvements could still be made. That would have to wait for another day though as it was bedtime; I hadn't worked on this constantly that evening, I'd had other stuff to do.

Next, I figured out what I'd said to myself in a comment and managed to used Redis's built in databases for both the previous and current levels meaning I could easily switch between them. I also made it command line callable to make it a bit easier to use.

The next attempt was to use threading. I realised that, since I would have concurrent threads going on, I couldn't use multiple Redis databases as each thread could override the database selection made in another thread and things would get very messy. I decided to use Redis Sets instead in just one database. The mechanize bit became a separate function to make threading it easier - I.e. I could extend this to three instances/threads easily. I also added in better feedback (counting down urls remaining).

In making the threaded version I discovered the need for better error handling, e.g. checks for images and checks for a successful page load: On a single thread I never had failure of the program to work; However, after adding threads (before I put in the page load check) it would blitz through, but not always find the result. I.e. some page loads must have failed, but been lost in the begin/rescue. This approach made sure to re-add the page to the list to be checked if the page failed to load.

Longest 6 Degrees Of Wikipedia

I also realised I had lied to myself when I'd commented 'No way not to avoid an array at some point'.

However, GIL is still a thing so threading was not providing any performance benefits which prompted a fairly trivial switch from threads to processes instead. This represents the final state of the program; Obviously further improvements are possible, but I was happy enough with everything in principal.

6 Degrees Of Wikipedia

I decided to chuck some resources at the processes version and see if I could speed things up. The original task I had been set was to go from the Wikipedia Ruby page to the David Bowie page. But whilst developing this program I checked for a known single step of Ruby to Metastable (The Ruby page links to Diamond, which links to Metastable).

On a two-core system:

On a four-core system:

Longest 6 degrees of wikipedia

I decided to chuck some resources at the processes version and see if I could speed things up. The original task I had been set was to go from the Wikipedia Ruby page to the David Bowie page. But whilst developing this program I checked for a known single step of Ruby to Metastable (The Ruby page links to Diamond, which links to Metastable).

On a two-core system:

On a four-core system:

6 Degrees Of Wikipedia Game

So the additional processes are definitely helping. Mind you, doing six levels deep would be crazy on a single machine as it turns out David Bowie is only two levels deep and on a four-core with four processes this still took: 108m6.189/173m5.299/7m24.900.





broken image