Downloading AO3 bookmarks

Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.
Blaise Pascal, Lettres Provinciales, 1657
I have made this longer than usual because I have not had time to make it shorter.

Background

I gave a talk in 2017 called “Digital Packratting of Tumblr”, about my system for saving Tumblr posts for my old age.

This is a brief look at a system I'm using to save works from Archive Of Our Own (often called “AO3”).

The Archive Of Our Own

The Archive Of Our Own is a project of the Organization for Transformative Works, a “nonprofit organization established by fans to serve the interests of fans by providing access to and preserving the history of fanworks and fan culture in its myriad forms.”

The purpose of AO3 is to provide a location where fans can archive fanfiction and other fan-created works, which will (hopefully) not be subject to the kinds of drama and legal issues that have affected other fandom sites (see the Fanlore.org article on Archive Of Our Own for some more history and context).

(The Archive Of Our Own won a Hugo award in 2019!)

A (hopefully) brief digression

Yes, this is about a system for downloading (a lot of) fanfiction. Fanfiction in the popular culture for much of the last 70-odd years has had a complicated reputation (especially in relation to the “slash” phenomenon).

Although I admit that some of my interest in fanfiction is related to slash romance fiction, there's other draws as well:

Fix-it fiction: maybe you don't like the “serious fiction” ending where everyone is unhappy?
Diverging from canon: or maybe a different decision earlier on could make things better?
Alternate universes: perhaps you just want to see your favorite characters interact in a coffee shop? Or bakery?
Cross-over: we can mash two of your favorite (or least favorite) canons together for fun! *

There are many other kinds of fanfiction as well; the Wikipedia article discusses the major varieties and terminology.

* Interestingly, if you like Leverage, your crossovers are also likely to be fix-its, according to this Tumblr article.

Bookmarking

I re-read. A lot. There are books I've read at least 10 times, and will probably read again (and again). I've been asked before why I would re-read something, and there are several reasons. Partly, if something is really well written, it is a joy to read again — it isn't only poetry that can capture you with word choice. Partly, I don't remember all the details, so I can still discover new things. Partly, when I do remember, there is less stress about the ending being unsatisfactory or overly depressing. And, partly, in the words of Jason Todd per shoalsea: “Well, yeah. But you’re different,” he told her. “So you read it differently.” (from Into the Brighter Night)

If I read a work on AO3 that I like, I may want to read it again later. So I use the bookmark feature; possibly to excess. I currently have 2570 items bookmarked in my AO3. Some of these are series — expanded, I have about 4300 stories bookmarked.

The problems with bookmarks

I started using the “bookmark” feature on AO3 in 2014. In that time, I've had a few issues:

authors decide to delete works due to harassment, fits of pique, fandom drama, or just not knowing about (or predating) the “orphan_account” feature
at least once, someone decided the didn't like a story they had written and replaced the content of the work with a completely different story

All of these are related to the general issue (also the impetus behind my Tumblr download program) — if you don't have a local copy of something, it can go away!

A local copy

The original issue was that when a work was deleted from AO3, the bookmark just showed as a “deleted work” — no information about what was lost. So I started to write a program to download my list of bookmarks, so I would know in the future what works I lost access to. But then, I realized, I might as well just download everything.

There are some existing AO3 archivers out there (such as ao3downloader, AO3 bulk downloader, FanFicFare (a Calibre plug-in), and several other tools I've seen referenced over the years). However, by writing my own, I am in control!

In particular:

my program keeps the old bookmark lists, so I can compare and find deleted works
my program doesn't overwrite changed works, so I can keep the history of the work

Overview

This is more of a “system” than a program. Everything lives in a parent folder (~/Documents/ao3/ on my computer).

There are four programs:

download.pl — the program that interacts with AO3 to download bookmarks, series, and works
compare_bookmarks.pl — the program that compares the two most recent bookmarks listings to show the differences
find_double_bookmarked.pl — looks for works that I have bookmarked both the work itself and the containing series (often because I bookmarked it before it was part of a series)
manage_duplicates.pl — helps me manage changed works

In addition, there are two helper files:

rundates.txt — a list of each time the program has been run
style.css — a (very) basic CSS file

Finally, there are four important folders:

bookmarks/ — a listing of all bookmarked works from each time the program was run
series/ — series listing pages
works/ — each individual work
works/old/ — older versions of works

Confession

When I did my talk on downloading Tumblr content, I wound up rewriting the program as a part of creating the talk. I had intended to do something similar for this talk; to use the process of reviewing the code to improve it. Sadly, work (and some other things) have conspired against me. The programs discussed here work (I've been using them for over a year), but they definitely need improvement.

(There are also known omissions and desired improvements, which will be discussed later on.)

`download.pl`

download.pl code

download.pl is the heart of the system. At the same time, it's not that complicated:

log into AO3
download the user's bookmarks
save the bookmarks to a file
for each bookmark:
- if it's a series, save the series page, and add the constituent works to the download queue
- if it's a work, save the work
when saving a work, if the work has already been saved, compare the downloaded work with the saved work; if they are the same, do nothing; if they differ, save the existing work and save the new work for later comparison

For each type of file saved (bookmarks, series, works), the program also processes them to clean them up. In particular, for works, there are parts that change a lot, which make comparisons more complicated, so they need to be removed.

`download.pl` — things to note

AO3 doesn't provide an API, which requires us to do this by “screen scraping”. In addition, because bookmarks point to “locked” works, the program needs to be logged in.
Thus, in order to manage the session handling, and some issues with the forms, I wound up using WWW::Mechanize instead of HTTP::Tiny as I originally intended to.
AO3 rate limits access, especially to expensive pages like bookmarks. So download.pl has sleep() statements in several places to avoid the rate limiter. This means that the program takes a long time to run — currently 6–8 hours for me.
In order to make sure that we download the entire work, I append ?view_full_work=true to the URL; otherwise it would randomly flip back and forth.
In order to handle changed works, every time the program runs, it appends the current run time to rundates.txt. It also extracts the previous rundate. When a changed work is found, the old file is moved to [work_id]-[previous rundate].html, and the new work is saved as [work_id]-[current rundate].html. This does introduce a potential issue, see the discussion at find_double_bookmarked.pl.
We had a talk in the last year about Mojo::DOM, but even with that discussion, I'm still using Mojo::DOM58.
Using Mojo::DOM58 is different than most Perl programming — it feels very much like using jQuery in Javascript.

Some useful Mojo::DOM/58 tricks:

Select a particular node:
```
$dom->at('<selector>')
```

Create a new DOM tree from a node:

my $dom2 = Mojo::DOM58->new($dom->at('<selector>'));

Remove a bunch of items from the dom:

$dom->find('<selector>')->map('remove');

Modify a node:

# Use one of the following:
$dom->find('<selector>')->map( sub { ...; } );
$dom->find('<selector>')->each( sub { ...; } );
$dom->at('<selector>')->tap( sub { ...; } );

Sadly, you cannot use XPath selectors, but it does support some fairly modern CSS selectors.

`download.pl` — known issues and improvements

There are some things I know I need to fix with download.pl.

Downloading images — several works either include external images, or are entirely external images. The complications here involve handling changes to works.
Author styles — authors can add their own stylesheets to works; right now, my program doesn't handle these at all.
Reformatting HTML — I would like to “pretty-print” the HTML, which I'm having difficulty doing well.
Fix links to other works, so that they work.
Support series history — right now, although changes to works can be saved, changes to series get lost. It would be nice to add this, although for various reasons I don't worry as much about this.
Enhance the style.css file — it's very bare-bones (actually, it's less than bare-bones), and at some point I should make things look nicer.

`compare_bookmarks.pl`

compare_bookmarks.pl code

compare_bookmarks.pl could be considered the reason I created this whole system. It takes the two most recent bookmarks files and compares them, listing any new or, crucially, deleted bookmarks.

However, for all that it's the raison d'être of the project, there's not much to say about it that the code doesn't cover?

It has one flaw: right now, it is hard-coded to only compare the two most recent bookmark downloads. It should be enhanced to allow comparing any two bookmark files.

`find_double_bookmarked.pl` — the triple-file problem

Occasionally, I have bookmarked both a particular work, and a series it belongs to.

This usually happens when I bookmarked the work first, and then it turned into a series.

If such a work is updated, things get a little wonky. When download.pl first finds that it has changed, it renames the extant work and saves the new work in such a way that they are easy to find and compare later on. However, this means that the “plain” file ([work_id].html) has been moved, and thus no longer exists. So when it is processed the second time, it looks like a new work, and is just saved. This causes the "triple-file" problem, where [work_id]-[previous runtime].html, [work_id]-[current runtime].html, and [work_id].html all exist. Usually, [work_id]-[current runtime].html and [work_id].html are both the same, but it did once happen that a work updated during the run, which caused them to be different. (If something updates several times during the run, and is included in multiple series, I might have a quadruple problem, which will cause something to be lost; I have no good ideas here...)

The manage_duplicates.pl script cannot handle the triple-file problem, so I need a tool to help me find these double-bookmarked works.

`find_double_bookmarked.pl`

find_double_bookmarked.pl code

In order to delete a bookmark, you have to manage it from your bookmarks page — there is no way to delete a bookmark directly from a work (or series) that I have found. (This is fine when you have 10 bookmarks; less so when you have 1000.) So this tool also provides the necessary URL to take you to the bookmark management screen that has the delete button.

Much like compare_bookmarks.pl, this program is fairly straightforward.

It does not capture the case where a work is part of two bookmarked series, only where both a work and an enclosing series are both bookmarked.

`manage_duplicates.pl`

manage_duplicates.pl code

Having downloaded the works, I now need to deal with the history aspect. I want to keep older versions where there are changes that matter. The definition, though, of changes that matter, is somewhat vague. Some of the ones that I want to preserve are:

new text, changed text, or deleted text
changes to the author's account name
changes to the series order (although this might better be preserved by keeping the series history)

Changes I don't particularly care about include:

new translations
new works inspired by this one
changes in whitespace

(Some changes I have already removed, such as the set of collections including this work, or the number of bookmarks, etc.)

Initially, the process was to run download.pl, and then manually find changed works, and compare them using BBEdit's "find differences" functionality, and then either get rid of the old work or move it to the old/ folder, and then rename the new work version.

This started to get old, though, so I wrote manage_duplicates.pl to manage the process for me.

`manage_duplicates.pl` — pure Perl?

The reason I was using BBEdit to compare the files (other than it being handy) is that BBEdit displays the character level changes, not just the line-level changes. In my roundup of the 2018 Perl Mongers Advent Calendar, I noted day 15, App::ccdiff, which does character-level diffs.

Now that I can calculate the diff, I need a way to tell the program what to do with each work. After reading about various prompt modules, I chose IO::Prompt::Simple; in large part because it allowed me to easily specify the color of the prompt — after the display from ccdiff, I needed something of a different color to see the prompt.

Then I discovered that many of the diffs were long, and I needed to use a pager to let me see the start of the diff. Jarred has talked about IO::Pager. I also had to play around with the LESS environment variable to make things work the way I wanted.

`manage_duplicates.pl` — or not…

If you've looked at the code though, you'll have noticed that we're not using the pure Perl solution. There are two reasons for this:

first, the terminal window interface just wasn't a nice as a windowed interface, with side-by-side comparison, that I had in BBEdit. Crucially, there is a BBEdit command-line tool that lets me use its diff interface as a part of the program (bbdiff --wait --resume);
and second, something is broken. When the diff is large (especially when the lines are very long), ccdiff would sometimes crash. And if it didn't crash, and you didn't go page-by-page all the way to the end of the diff in less, sometimes IO::Pager would crash. (I could never fully reproduce the problem, so I don't think I ever filed a bug report — sometimes it worked, sometimes it didn't…)

So, in the end, I went back to using Perl to glue together some external programs.

Links

The code:

Some fun fanfiction:

DC Comics/Batman:
- Able to Succeed by Betty — Batman given Tim Drake a written test.
- Into the Brighter Night by shoalsea — Tim Drake is kidnapped off planet and comes back.
Harry Potter:
- Petrification Proliferation by White_Squirrel — Wizarding Britain reacts appropriately to a basilisk in school.
- There May Be Some Collateral Damage by metisket — A cross over with the manga “Bleach”, a grim reaper is sent to act as Harry Potter's bodyguard.
Star Wars:
- Remedial Jedi Theology by MarbleGlove — a different path for Obi-Wan Kenobi to teach Anakin Skywalker.
Avengers/MCU:
- Infinite Coffee and Protection Detail by owlet — “The mission resets abruptly, from objective: kill to objective: protect”; I can't do much better than that…
Valdemar:
- Friends Across Borders by MueraRashaye — “Two long-time enemy nations can't become meaningful allies overnight”, but maybe there are people helping…
A host of Lord Peter Wimsey:
- Green Ice by Adina
- Amidst a Tumultuous Sea by Adina
- Dramatis personae by butterflymind
- The Frivolous Fable of a Spinster's Suspicions by Nineveh_uk
- The Elevated Investigation of the Empty Gondola by Nineveh_uk
- Oh, It's a Lovely War! by Azdak
- The Conscience of the Queen by Ione
- The Royal Society by copperbadge

Downloading AO3 bookmarks

Background

The Archive Of Our Own

A (hopefully) brief digression

Bookmarking

The problems with bookmarks

A local copy

Overview

Confession

download.pl

download.pl — things to note

download.pl — known issues and improvements

compare_bookmarks.pl

find_double_bookmarked.pl — the triple-file problem

find_double_bookmarked.pl

manage_duplicates.pl

manage_duplicates.pl — pure Perl?

manage_duplicates.pl — or not…