V7: Beginning data migration

Prepping hundreds of tiny blog posts for republishing

January 24, 2021

Apropos of nothing, I decided that the first of the old entries I’d bring over to V7 would be granular ones:

Daily Haiku: A section of the fourth version of my site, beginning back in 2005. As the name suggests, I wrote a haiku every weekday based on the Dictionary.com Word of the Day. Each haiku was originally its own entry, but when I brought them over to V6 a few years ago, I consolidated them into monthly digests. I’m breaking them back out into individual entries for V7. There are 238 total.
Letterboxd film diary: I’ve been using Letterboxd for nearly 10 years to log every single movie I watch. Some of the reviews I’ve written have appeared on previous versions of my site, but I’m bringing the entire diary over to V7, including tiny entries that merely consist of the name of the movie and a star rating (which is most of them). There are currently 1,348 Letterboxd entries.

Daily Haiku hearkens back to the days my site ran on Movable Type, but the CMS and database are long gone. All that remains is the static site they generated, which means there’s no especially good way to export the site’s content in another format. I did a little bit of research on web scrapers, which would surely be up to the task, but before I got too far down that path, I decided instead to just spend a few hours clicking through the entries and manually copying and pasting them into Markdown files (which is the format Jekyll and other static site generators typically use for blog posts). There can be a certain zen to data entry once you hit a good rhythm, and the absorbing mindlessness of the work was a welcome reprieve from the feverish news cycle of the past few weeks.

As for Letterboxd, with nearly six times more entries than Daily Haiku, copying and pasting was a non-starter. But luckily Letterboxd’s data export is outstanding, and it only took a minute for me to get all my data formatted into a handful of tidy CSV files. I opened the diary file in Google Sheets, deleted the columns I didn’t need, renamed the headers and rearranged the columns for my purposes, and exported an updated CSV. From there, I made a few modifications to a handy Python script by Evan Lovely that takes a CSV file and turns each of its rows into a Markdown file with YAML front matter. Running the script gave me the 1,348 Markdown files I needed, and after some batch find/replace cleanup (e.g. removing non-ASCII characters from filenames, putting quotes around title strings, etc.), they were done! Here’s an example:

---
layout: post
date: 2020-02-13 23:59:00
title: "Portrait of a Lady on Fire"
year: 2019
rating: 0.9
tags_letterboxd: narrative, theater, angelika film center, nyc, film club
tags: film
category: Letterboxd
canonical: https://boxd.it/ZkLA5
---

This movie is so goddamned good, I’m not even mad that it neglects to incorporate Van Halen’s “On Fire.”

Neither the Daily Haiku nor the Letterboxd posts appear on V7 yet, but once I have some template architecture set up to support them, they should be ready to go.

Wrangling my Letterboxd data was easier than I expected, which is encouraging. I should be able to take a similar approach with my Twitter data, which is by far the largest single data source I’ll be dealing with (6,000+ entries). But that data is also more complex, and my batch find/replace tricks will probably only get me so far. It may be time for me to start properly learning regular expressions.

21 posts in this series

January 1, 2020

V7: Introduction

Redesigning my site in public

Part of a series

Welcome to RobWeychert.com V7! There are a number of new things I want to try with my site, from structure to aesthetics to code, and so it’s time to begin a fresh redesign. Inspired by my friends Jonnie and Frank, I’ve decided to do it in public from the ground up. I’m starting with bare-bones HTML and as the design process unfolds, each step will be reflected on the site in real time and documented… See more →

21 posts in this series

January 1, 2020

V7: Introduction

January 4, 2020

V7: The “viewport” meta tag

January 8, 2020

V7: Content priorities

January 14, 2020

V7: Structural challenges

February 9, 2020

V7: Timeline section inventory

March 3, 2020

V7: The timeline is taking shape

June 24, 2020

V7: On dependency

December 5, 2020

V7: Choosing a CMS

January 24, 2021

V7: Beginning data migration

November 25, 2022

V7: Renewed purpose

May 4, 2023

V7: The Procrastination Destination

May 24, 2023

V7: Eleventy it is

June 1, 2023

V7: Expanding scope

August 8, 2023

V7: Metadata structure and sitemap

Metadata structure

July 26, 2024

V7: The Great Data Migration

September 10, 2024

V7: The Great Data Migration, Part 2

September 9, 2025

V7: Launch day

October 13, 2025

V7: Video Killed the Web Browser Star

January 4, 2026

V7: Typographic scales and technical pens

June 22, 2026

V7: Backfilling metadata

July 1, 2026

V7: Say hello to my listening diary