The Great Data Migration
Bringing it all home
I’ve done a lot of work on the site in the last two months, and a launch date, while still a ways off, is finally coming into focus. I’ve been working on this redesign very intermittently for over four years now, but at this point I expect to keep at it until it’s done, with as little interruption as possible.
Among other recent advances, I’ve moved the site from Jekyll to Eleventy, chosen a font family, and designed and built out the front end for several core components and templates, all of which I’ll hopefully discuss in more depth soon. But today’s milestone is about content: At long last, I’ve finished reformatting all remaining external data! Decades of social media posts and other assorted falderal are all neatly packaged into thousands of Markdown files, ready to be published all in one place for the first time. It wasn’t so long ago that I had no idea how I was going to accomplish this, so I’m pretty stoked.
All the reformatting was done locally with Node.js, using my established metadata structure and site map as a guide for the finished files. I did my best to avoid replicated content, so for anything that was cross-posted between, say, Twitter and Instagram, the primary post was kept and the dupes removed. The structure and quality of the various platforms’ exported data varied quite a bit, so I had to create unique processes for reformatting each one (although there was a lot of code they shared). Some notes about how it went down:
- Dribbble: Downloading your Dribbble data gets you a single JSON file and no images, which is disappointing to say the least. The data does include the URLs for the images, and if I had been a heavier Dribbble user, I would have used this opportunity to finally become proficient at web scraping. But since there were only a few dozen images, I downloaded them manually. The data export also didn’t include anything about rebounds (which are essentially threads), but that was easy enough to clean up by hand too. Otherwise this one was pretty straightforward.
- Flickr: Rich data, easy to work with, and included everything I needed except the image dimensions, which I was able to get easily with the image-size Node module. Flickr’s export gives you the original high-res media you uploaded, and while I expect eleventy-img to handle processing for my static images, I’ll probably have to compress the videos myself using Adobe Media Encoder or some such. There’s only about 50 of them and they’re all short, so that shouldn’t be a big deal.
- Goodreads: Lots of data I didn’t really need (around preferences, followers, etc.), and the stuff I did need was missing some core things, like authors and publication dates. Like Dribbble, I didn’t use Goodreads that much, so I was able to handle that stuff (and download book cover images) manually.
- Google Reader: I stumbled on this data last year, a decade after I downloaded it just before Reader shut down, and realized I could turn it into link posts on my site. Of the close to 1,000 things I shared on Reader, I decided only to migrate the 224 I added notes to, and the data was easy to work with. My notes abruptly stop in October of 2011, nearly two years before Reader’s demise, so I have to wonder if a bunch of stuff is missing, but if so, I don’t suppose there’s much to be done about it at this point. Thanks to some sites’ use of feed proxies and redirects, a lot of the links I shared with Reader are now broken, which is a bummer.
- Instagram: Ugh. Death by a thousand paper cuts with this one. I had the option to get my data in JSON and/or HTML, and each one contained information the other didn’t, so I needed both of them to get it all (which still wasn’t everything). Inconsistent formatting between IGTV videos, reels, posts, and stories; convoluted Unicode entities I couldn’t decode; tagged users omitted from posts with multiple images/videos; NO FUCKING PERMALINKS?! I had to jump through so many hoops to get everything to a decent place. The only other data source I had to spend more time with was Twitter, and that was only because that was my first Node project. At least this experience was consistent with my extremely low opinion of Meta and Instagram.
- iTunes: The desktop app formerly known as iTunes has suffered greatly in many ways in the nine years since Apple Music started up, especially for people like me who still maintain a local music library, but luckily you can still easily export a very detailed XML file of all your data. For me, that data goes back to the very beginning of iTunes in 2001, and in 2004 I finally ripped all my CDs to MP3. This means I have reliable data about when albums were originally added to my music library from 2005 on, which I’m happy to be able to put on my site. The process got a little messy (especially when non-ASCII characters were for some reason encoded differently in directory names than they were in the MP3s’ ID3 tags), but went fairly quickly, and generating JPGs from Base64 album cover data embedded in the MP3s (using jsmediatags) was especially satisfying.
- Letterboxd: I took an initial stab at reformatting my Letterboxd film diary with Python awhile back, but that was before I decided to include directors and film posters. Letterboxd’s data export is great, but it doesn’t include directors or posters, so I had my friend Jon help me use the TMDB API to get them, which we were able to do over a weekend.
- Twitter: Twitter’s data export is pretty fantastic (or at least it used to be—Elon has probably ruined it by now), with two exceptions: it doesn’t give you the highest quality versions of your media files, and it doesn’t include alt text. Some blessed soul wrote a Python script I’ve since lost track of that gets the media files for you, which I used right after I quit Twitter, so that was great. As for getting the alt text, there’s surely a better method than mine, but I filtered my timeline by media, went through every post since the platform added alt text capabilities in March of 2016, and copied and pasted. (All told, I had 164 images with alt text.) My final files only include tweets that aren’t retweets and aren’t part of a conversation.