/ wordpress

Lost your WordPress blog? Fear not.

My girlfriend asked me a while ago to try and resurrect her old WordPress-based website, Bioinformatyk (which is Polish for Bioinformatician).

The tricky parts are:

  • the domain is not available anymore
  • original server is no longer there
  • database backups are lost

Tough luck? Not exactly.

Necromancy

What happens on the Internet stays on the Internet. There is a giant archive of old pages where you can browse old version of basically everything.

And Bioinformatician is also there.

A lot of blogs have HTML sitemaps which list all articles. Choosing the newest working version I was able to find this:

https://web.archive.org/web/20161024080036/http://www.bioinformatyk.eu/index.php/mapa-strony

Bioinformatician's archive

Armed with PUP (command line HTML parser) I was able to extract all the links. How?

Sprinkling a little command line magic.

curl https://web.archive.org/web/20161024080036/http://www.bioinformatyk.eu/index.php/mapa-strony | pup 'ul > li > ul > li > a attr{href}'

First I want to download the page and pass it into next command:

curl https://web.archive.org/web/20161024080036/http://www.bioinformatyk.eu/index.php/mapa-strony |

Then I want to find particular nesting structure that I've found in Chrome Developer Tools. This might be very different in your case. Your CSS selector skills will be put to test here. ;)

Getting your CSS selector

pup 'ul > li > ul > li > a attr{href}' 

Last part gives us an attribute href.

The result looks like this:

List of urls

I already went through a couple of those URLs and I know that inside each and every one of them there is a <div class="post"> which contains article content.

So I've added a few things to my one-liner:

curl https://web.archive.org/web/20161024080036/http://www.bioinformatyk.eu/index.php/mapa-strony | pup 'ul > li > ul > li > a attr{href}' | xargs -I {} sh -c 'curl -L {} | pup "div.post" >> posts.html'

The new part being:

xargs -I {} sh -c 'curl -L {} | pup "div.post" >> posts.html'

xargs takes our list of pages and run for each of them curl -L {} | pup "div.post" >> posts.html. {} is the placeholder for the URL and -L flag will take care of following redirects. Convoluted? Yes. Works? You bet.

So the result is a file with all Bioinformatyk's posts, including some metadata about authors, categories and tags. It's a mess and all links are proxied through archive.org. All the images are linking there too.

posts.html

We got our data for necromancy, and it's something.

A new body

I've installed a fresh copy of WordPress on a new domain, bioinformatyk.pl.

I've pulled password for the admin user, grepped through my posts.html file to get authors and created their accounts manually.

Basic HTML structure for author looks like this:

HTML fragment with author

cat posts.html \ # prints entire file
 | pup '.p-who text{}' \ # Selects relevant text in tag
 | grep Zamieszczony \ # Removes lines without "Posted by"
 | cut -d: -f2 \ # cuts string by : to remove "Posted by:"
 | sed 's/.\{2\}$//' \ # remove last two characters after name
 | sort | uniq # sort and remove duplicates

Result:

List of names

I also took a look at which id is assigned to which author. I'll use XML-RPC API to put articles back in WordPress and I want to keep authors.

This is also the place where I have to use something more powerful than Bash. My tool of choice here is Ruby with a few gems.

  • nokogiri - which will help me parse posts.html
  • rubypress - which will allow me to use WordPress XML-RPC
  • mime-types - which will help us with media upload

Install them with gem install nokogiri rubypress mime-types.

Uploading media

I extracted all the images with Bash:

cat posts.html | pup 'img attr{src}' | xargs wget -P images {}

All the images are now safely stored in images/.

I've created a short Ruby program:

require 'nokogiri'
require 'date'
require 'json'
require 'rubypress'
require 'mime/types'

wp = Rubypress::Client.new(
  host:     "bioinformatyk.pl",
  use_ssl:  true,
  username: "Username here",
  password: "Password here")

 # Hardcoded path, not my proudest moment ;)
path = '/Users/hasik/Projects/bioinformatyk-uploader/images'
Dir.foreach(path) do |fname|
  # Skip directories, we need just files
  next if ['.','..'].include? fname
  name_with_path = File.realdirpath(fname, path)

  # Send file to WordPress
  wp.uploadFile(data: {
    name: fname,
    type: MIME::Types.type_for(name_with_path).first.to_s,
    bits: XMLRPC::Base64.new(IO.read(name_with_path))
  })
end

After a few minutes, I had all the images uploaded into WordPress.

Media are back

So far so good. A few of them are missing because Archive.org doesn't save everything, but it's much better than nothing. And you have to remember - we didn't have the database or any backup.

Note also that all the files are named the same. I've used a RegExp to replace https://web.archive.org/something-something prefix with new https://bioinformatyk.pl/wp-content/images/ from the URLs in posts.html.

Now for the articles.

Grande finale - Articles

I did some RegExp-supported find & replace on posts.html to remove old JavaScript and a few broken articles. I won't list those regular expressions here, but rest assured the basic structure remained intact.

I've written a little bit longer Ruby script this time. It also tries to parse our Polish date format and extract tags. I've skipped categories since that wasn't required.

require 'nokogiri'
require 'date'
require 'json'
require 'rubypress'

# Custom date parser
def parse_date(polish_date)
  months = {
    'sty' => 1,
    'lut' => 2,
    'mar' => 3,
    'kwi' => 4,
    'maj' => 5,
    'cze' => 6,
    'lip' => 7,
    'sie' => 8,
    'wrz' => 9,
    'paź' => 10,
    'lis' => 11,
    'gru' => 12,
  }
  arr    = polish_date.split(' ')
  day    = arr[0].to_i
  month  = months[arr[1].delete(',')].to_i
  year   = arr[2].to_i
  DateTime.new(year, month, day)
end

# Name lookup for new ids
def author_id(name)
  {
    'Justi'              => 1,
    'Konrad Dębski'      => 2,
    'dr Krystian Rother' => 3,
    'Fryderyk'           => 4,
    'Bless'              => 5,
    'Jakub'              => 6,
    'Jarek'              => 7,
    'Wojciech Czarnecki' => 8,
    'Mateusz Koryciński' => 9,
    'Piotr'              => 10,
  }.fetch(name)
end

posts = []

document = Nokogiri::HTML(File.open('posts.html', 'rb').read)

# This actually extracts data from HTML
document.search('.post').each do |post|
  title   = post.search('.p-head h2').text.strip
  date    = parse_date(post.search('.p-date').text.split("\n")[1].strip)
  author  = post.search('.p-who').text.split("\n")[1]
            .gsub('Zamieszczony przez:', '')
            .gsub('Zamieścił:', '')
            .gsub(' w:', '')
            .strip
  tags    = post.search('.p-tag a').map { |tag| tag.text.gsub("\n", '').strip }
  content = post.search('.p-con').to_s

  # And puts it into posts array
  posts << {
    title:     title,
    date:      date,
    author_id: author_id(author),
    tags:      tags,
    content:   content,
  }
end

wp = Rubypress::Client.new(
  host:     "bioinformatyk.pl",
  use_ssl:  true,
  username: "Username here",
  password: "Password here")

# Uniq here removes duplicate articles
posts.uniq{ |post| post[:title] }.each do |post|
  params = {
    blog_id: '0',
    content: {
      post_status: 'draft', # we don't want to publish everything straight away
      post_date:    post[:date],
      post_content: post[:content],
      post_title:   post[:title],
      post_author:  post[:author_id],
      terms_names: {
        post_tag: post[:tags]
      }
    }
  }
  wp.newPost(params)
end

I've run this script and voilà!

Articles

Justyna has now a lot of work to do fixing styles and external links, but her content is back from the dead.

Next time you remember your old website think about a possibility of getting it back. It takes a bit of hacking, but it is both doable and fun to do. :)