Downloading AO3 bookmarks

Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.

Blaise Pascal, Lettres Provinciales, 1657

I have made this longer than usual because I have not had time to make it shorter.

Background

I gave a talk in 2017 called “Digital Packratting of Tumblr”, about my system for saving Tumblr posts for my old age.

This is a brief look at a system I'm using to save works from Archive Of Our Own (often called “AO3”).

The Archive Of Our Own

The Archive Of Our Own is a project of the Organization for Transformative Works, a “nonprofit organization established by fans to serve the interests of fans by providing access to and preserving the history of fanworks and fan culture in its myriad forms.”

The purpose of AO3 is to provide a location where fans can archive fanfiction and other fan-created works, which will (hopefully) not be subject to the kinds of drama and legal issues that have affected other fandom sites (see the Fanlore.org article on Archive Of Our Own for some more history and context).

(The Archive Of Our Own won a Hugo award in 2019!)

A (hopefully) brief digression

Yes, this is about a system for downloading (a lot of) fanfiction. Fanfiction in the popular culture for much of the last 70-odd years has had a complicated reputation (especially in relation to the “slash” phenomenon).

Although I admit that some of my interest in fanfiction is related to slash romance fiction, there's other draws as well:

There are many other kinds of fanfiction as well; the Wikipedia article discusses the major varieties and terminology.

* Interestingly, if you like Leverage, your crossovers are also likely to be fix-its, according to this Tumblr article.

Bookmarking

I re-read. A lot. There are books I've read at least 10 times, and will probably read again (and again). I've been asked before why I would re-read something, and there are several reasons. Partly, if something is really well written, it is a joy to read again — it isn't only poetry that can capture you with word choice. Partly, I don't remember all the details, so I can still discover new things. Partly, when I do remember, there is less stress about the ending being unsatisfactory or overly depressing. And, partly, in the words of Jason Todd per shoalsea: “Well, yeah. But you’re different,” he told her. “So you read it differently.” (from Into the Brighter Night)

If I read a work on AO3 that I like, I may want to read it again later. So I use the bookmark feature; possibly to excess. I currently have 2570 items bookmarked in my AO3. Some of these are series — expanded, I have about 4300 stories bookmarked.

The problems with bookmarks

I started using the “bookmark” feature on AO3 in 2014. In that time, I've had a few issues:

All of these are related to the general issue (also the impetus behind my Tumblr download program) — if you don't have a local copy of something, it can go away!

A local copy

The original issue was that when a work was deleted from AO3, the bookmark just showed as a “deleted work” — no information about what was lost. So I started to write a program to download my list of bookmarks, so I would know in the future what works I lost access to. But then, I realized, I might as well just download everything.

There are some existing AO3 archivers out there (such as ao3downloader, AO3 bulk downloader, FanFicFare (a Calibre plug-in), and several other tools I've seen referenced over the years). However, by writing my own, I am in control!

In particular:

Overview

This is more of a “system” than a program. Everything lives in a parent folder (~/Documents/ao3/ on my computer).

There are four programs:

In addition, there are two helper files:

Finally, there are four important folders:

Confession

When I did my talk on downloading Tumblr content, I wound up rewriting the program as a part of creating the talk. I had intended to do something similar for this talk; to use the process of reviewing the code to improve it. Sadly, work (and some other things) have conspired against me. The programs discussed here work (I've been using them for over a year), but they definitely need improvement.

(There are also known omissions and desired improvements, which will be discussed later on.)

download.pl

download.pl code

download.pl is the heart of the system. At the same time, it's not that complicated:

For each type of file saved (bookmarks, series, works), the program also processes them to clean them up. In particular, for works, there are parts that change a lot, which make comparisons more complicated, so they need to be removed.

download.pl — things to note

download.pl — known issues and improvements

There are some things I know I need to fix with download.pl.

compare_bookmarks.pl

compare_bookmarks.pl code

compare_bookmarks.pl could be considered the reason I created this whole system. It takes the two most recent bookmarks files and compares them, listing any new or, crucially, deleted bookmarks.

However, for all that it's the raison d'être of the project, there's not much to say about it that the code doesn't cover?

It has one flaw: right now, it is hard-coded to only compare the two most recent bookmark downloads. It should be enhanced to allow comparing any two bookmark files.

find_double_bookmarked.pl — the triple-file problem

Occasionally, I have bookmarked both a particular work, and a series it belongs to.

This usually happens when I bookmarked the work first, and then it turned into a series.

If such a work is updated, things get a little wonky. When download.pl first finds that it has changed, it renames the extant work and saves the new work in such a way that they are easy to find and compare later on. However, this means that the “plain” file ([work_id].html) has been moved, and thus no longer exists. So when it is processed the second time, it looks like a new work, and is just saved. This causes the "triple-file" problem, where [work_id]-[previous runtime].html, [work_id]-[current runtime].html, and [work_id].html all exist. Usually, [work_id]-[current runtime].html and [work_id].html are both the same, but it did once happen that a work updated during the run, which caused them to be different. (If something updates several times during the run, and is included in multiple series, I might have a quadruple problem, which will cause something to be lost; I have no good ideas here...)

The manage_duplicates.pl script cannot handle the triple-file problem, so I need a tool to help me find these double-bookmarked works.

find_double_bookmarked.pl

find_double_bookmarked.pl code

In order to delete a bookmark, you have to manage it from your bookmarks page — there is no way to delete a bookmark directly from a work (or series) that I have found. (This is fine when you have 10 bookmarks; less so when you have 1000.) So this tool also provides the necessary URL to take you to the bookmark management screen that has the delete button.

Much like compare_bookmarks.pl, this program is fairly straightforward.

It does not capture the case where a work is part of two bookmarked series, only where both a work and an enclosing series are both bookmarked.

manage_duplicates.pl

manage_duplicates.pl code

Having downloaded the works, I now need to deal with the history aspect. I want to keep older versions where there are changes that matter. The definition, though, of changes that matter, is somewhat vague. Some of the ones that I want to preserve are:

Changes I don't particularly care about include:

(Some changes I have already removed, such as the set of collections including this work, or the number of bookmarks, etc.)

Initially, the process was to run download.pl, and then manually find changed works, and compare them using BBEdit's "find differences" functionality, and then either get rid of the old work or move it to the old/ folder, and then rename the new work version.

This started to get old, though, so I wrote manage_duplicates.pl to manage the process for me.

manage_duplicates.pl — pure Perl?

The reason I was using BBEdit to compare the files (other than it being handy) is that BBEdit displays the character level changes, not just the line-level changes. In my roundup of the 2018 Perl Mongers Advent Calendar, I noted day 15, App::ccdiff, which does character-level diffs.

Now that I can calculate the diff, I need a way to tell the program what to do with each work. After reading about various prompt modules, I chose IO::Prompt::Simple; in large part because it allowed me to easily specify the color of the prompt — after the display from ccdiff, I needed something of a different color to see the prompt.

Then I discovered that many of the diffs were long, and I needed to use a pager to let me see the start of the diff. Jarred has talked about IO::Pager. I also had to play around with the LESS environment variable to make things work the way I wanted.

manage_duplicates.pl — or not…

If you've looked at the code though, you'll have noticed that we're not using the pure Perl solution. There are two reasons for this:

So, in the end, I went back to using Perl to glue together some external programs.

Links

The code:

Some fun fanfiction:

Questions?

fin