Digital pack-ratting for Tumblr

Introduction

Ephemeral

I like to re-read books. And re-watch movies. It should be no surprise, then, that I also like to re-view things on the internet. Jokes, memes, inspiring pictures — these are all things that I like to go back to from time to time. However, things on the internet are temporary, because I don’t control the server they are on.

A common source of things that I want to save, especially images, in Tumblr. See, for example, Ruined Childhood, Star Trek Ships, Colour my world, and 99 % Invisible.

This talk describes a program I wrote to help me save some of these Tumblr posts for when I’m bored, depressed, or 90.

My goal

I want to keep some Tumblr posts for looking back on
As much as possible, I want everything to be in one file
- data URIs (Wikipedia)
I want to simplify and unify the presentation of posts, because some Tumblrs are impossible to read

But how?

I tried to write something to grab the data from HTML
This sucked
Tumblr has an API
It requires an account, and an API key
I accidentally discovered that you can postpend ?format=json to any Tumblr post
Examples:
- One photo: this shot of the USS Enterprise (D), as JSON(-ish)
- Multiple photos: really cool photos of weather, as JSON(-ish)
- Video: this really amazing flautist, as JSON(-ish)

Perl 5 Code

A rewrite

I started to write this talk, but then I realized that the program was poorly written
So, what I'm presenting here is a complete rewrite, almost from scratch
Here is my actual original program
Here is the rewrite in toto

Prolog

This is straightforward
I use /usr/bin/env so that I can use a locally-installed (ie, Perlbrew-based) perl
(But see the note when used with my RSS reader)
Because I have embedded a template further down, and it has UTF-8 characters, I need to tell Perl that the source file is UTF-8

#!/usr/bin/env perl
use Modern::Perl '2012';
use experimental 'switch', 'postderef';
use utf8;

Modules

# handle command line parameters
use Getopt::Long;

# retrieve network data
use HTTP::Tiny;
use JSON::Tiny 'decode_json';

# template
use Data::Section -setup;
use Template::Simple;
use Encode;

# transcode images into data urls
use MIME::Base64 'encode_base64';
use Mojo::DOM58;

(I'm really annoyed by the Mojo::DOM58 dependency)

Command-line

This is very basic command-line support
I'm the primary, and only, customer, so I know what's expected

# get command line options
my $url        = '';
my $basedir    = '.';
my $only_photo = 0;
GetOptions( "url=s" => \$url, "basedir=s" => \$basedir, "photo" => \$only_photo )
  or die("Could not parse options.\n");

# make sure we have a url
if ( $url eq '' ) { die("You must provide a url.\n"); }

# normalize the basedir
if ( $basedir !~ m[/$] ) { $basedir .= '/'; }

RSS

This can be tied into my RSS reader (Vienna) with a plug-in:

tell application "Vienna"
    try
        set article_url to link of current article
        do shell script "/Users/pukku/Pictures/tumblr/download_tumblr.pl -url=" & article_url
    on error errMsg
        display dialog "An error occurred: " & errMsg
    end try
end tell

(On my computer, I have preset the $basedir variable to the correct path)
(Also, you can't just use /usr/bin/env, but must actually have the path to the correct perl, because the environment on Mac OS X for GUI applications in complicated to correctly configure)

Main flow

# get the tumblr data and the common elements
my $tumblr_data = get_tumblr_data($url);
my $post_data = get_common_data( $tumblr_data->{'tumblelog'}, $tumblr_data->{'posts'}[ 0 ] );

# if we can, or want to, download only the photo, short circuit the rest of the program
if ( ( $post_data->{'_type'} eq 'photo' ) and ( $only_photo or should_download_only_photo($post_data) ) ) {
    download_only_photo($post_data);
}

# specialize on the tumblr type
given ( $post_data->{'_type'} ) {
    ...;
}

# render the data
my $output = create_output_html($post_data);

# make all images data urls
convert_imgs($output);

# save the data
save_post($output);

I'll expand on that given statement later
Also, the “short circuit” bit…

`get_tumblr_data`

You can append ?format=json to Tumblr post to get back a JSON-ish version
- (if you prefer, you can append ?format=xml, but I'm using JSON)
I originally did this using LWP::Simple, but later needed to use LWP::UserAgent as well, so I scrapped both for HTTP::Tiny
The “JSON” returned by Tumblr isn't really JSON; it's actually JavaScript, so I need to massage it
I originally used Carp (well, croak()) instead of die(), but it doesn't respect the "\n" at the end of the error message, which was ugly

# given a url, presume it is a tumblr-compliant url.
# get the data in json format and decode it
sub get_tumblr_data {
    my ($url) = @_;
    $url .= '?format=json';    # get data as json

    my $data = HTTP::Tiny->new->get($url);
    if ( !$data->{'success'} ) {
        die( "Error retrieving post: " . $data->{'status'} . ' ' . $data->{'reason'} );
    }
    if ( $data->{'content'} !~ m/^var tumblr_api_read = / ) {
        die("Error retrieving data: doesn't appear to be a tumblr.");
    }

    # tumblr returns Javascript, not really JSON
    my $content = ( $data->{'content'} =~ s/^var tumblr_api_read = //r );
    $content =~ s/;$//;
    my $json = decode_json($content);

    # make sure it looks right, in case we got some other sort of json
    # or for some reason retrieved too many posts
    if ( !defined( $json->{'tumblelog'} ) ) {
        die("Malformed JSON data received.\n");
    }
    if ( !defined( $json->{'posts'} ) ) {
        die("Not enough posts received.\n");
    }
    if ( scalar( $json->{'posts'}->@* ) != 1 ) {
        die( "Too many posts received (" . scalar( $tumblr_data->{'posts'}->@* ) . ").\n" );
    }

    return $json;
}

the s///r construct is easier to read in terms of parenthesis than the old version:

( my $content = $data->{'content'} ) =~ s/^var tumblr_api_read = //;

`get_common_data`

“does what it says on the tin”

# take the tumblr json and extract the elements we want to use
sub get_common_data {
    my ( $t, $p ) = @_;

    my $post = {
        '_tumblr' => $t,
        '_post'   => $p,
        '_type'   => $p->{'type'},

        'tumblr_key'   => $t->{'name'}  || 'unknown',
        'tumblr_title' => $t->{'title'} || 'Unknown',

        'post_id'   => $p->{'id'},
        'post_slug' => $p->{'slug'} || $p->{'type'},
        'post_url'  => $p->{'url-with-slug'} || $p->{'url'},
        'post_date' => $p->{'date-gmt'},
        'post_tags' => '',

        'post_body' => '',
    };

    # fix tags
    if ( defined( $p->{'tags'} ) and scalar( $p->{'tags'}->@* ) ) {
        $post->{'post_tags'} = { 'tags' => [ map { { 'tag' => $_ }; } $p->{'tags'}->@* ] };
    }

    return $post;
}

the extra wrapper object on the tags is required to make Template::Simple work properly

specialize on the type

back to the given statement from before

# specialize on the tumblr type
given ( $post_data->{'_type'} ) {
    when ('regular') { add_regular_data($post_data); }
    when ('answer')  { add_answer_data($post_data); }
    when ('quote')   { add_quote_data($post_data); }
    when ('photo')   { add_photo_data($post_data); }
    when ('video')   { add_video_data($post_data); }

    default {
        die( "Unknown post type: " . $post_data->{'_type'} . "\n" );
    }
}

the straight-forward ones

“regular”, “answer”, and “quote” post types are fairly similar and straightforward

sub add_regular_data {
    my ($post) = @_;
    my $source = $post->{'_post'};

    if ( defined( $source->{'regular-title'} ) and ( $source->{'regular-title'} ne '' ) ) {
        $post->{'post_body'} = '<h2>' . $source->{'regular-title'} . '</h2>' . "\n\n";
    }

    $post->{'post_body'} .= $source->{'regular-body'};
}

sub add_answer_data {
    my ($post) = @_;
    my $source = $post->{'_post'};

    $post->{'post_body'} = ''
      . '<div class="answer-q">' . "\n" . $source->{'question'} . "\n" . '</div>' . "\n"
      . '<div class="answer-a">' . "\n" . $source->{'answer'} . "\n" . '</div>';
}

sub add_quote_data {
    my ($post) = @_;
    my $source = $post->{'_post'};

    $post->{'post_body'} = ''
      . '<div class="quote-text">' . "\n" . $source->{'quote-text'} . "\n" . '</div>' . "\n"
      . '<div class="quote-source">' . "\n" . $source->{'quote-source'} . "\n" . '</div>';
}

`add_photo_data`

sub add_photo_data {
    my ($post) = @_;
    my $source = $post->{'_post'};

    $post->{'post_photos'} = [];

    # if there is only one photo, tumblr provides the photo data at the top level
    # however, if there are multiple photos, it uses the photos array.
    if ( !scalar( $source->{'photos'}->@* ) ) {
        push $post->{'post_photos'}->@*, {
            'caption' => $source->{'photo-caption'},
            'url'     => $source->{'photo-url-1280'},
        };
    }
    else {
        foreach my $ph ( $source->{'photos'}->@* ) {
            push $post->{'post_photos'}->@*, {
                'caption' => $ph->{'caption'},
                'url'     => $ph->{'photo-url-1280'},
              }
        }
    }
}

`add_video_data`

I hate this sub
It's a hack on top of a hack on top of an atrocity
Mostly, it's because the Tumblr data sucks; they don't make the URL of the video easily accessible
Because even I don't want to handle putting a video into a data URI, I had to introduce downloading the videos
I've never seen a post with more than one video though, so naming isn't (yet) an issue
I needed to be able to parse a DOM, so I use Mojo::DOM58; I wish I didn't need to use it though

sub add_video_data {
    my ($post) = @_;
    my $source = $post->{'_post'};

    $post->{'post_videos'} = [];

    # although this could be false, I assume that there is only one video
    # and only one source
    my $dom           = Mojo::DOM58->new( $source->{'video-player'} );
    my $video_wrapper = $dom->find('video[data-crt-options]')->[ 0 ];
    my $video_source  = $dom->find('video source')->[ 0 ];

    if ( $video_wrapper and $video_source ) {
        my $video_options = decode_json( $video_wrapper->{'data-crt-options'} );
        if ( $video_options->{'hdUrl'} ) {
            push $post->{'post_videos'}->@*, {
                'url'     => $video_options->{'hdUrl'},
                'name'    => $post->{'post_id'} . '.mp4',
                'source'  => $source->{'video-source'} . "\n" . $source->{'video-player'},
                'caption' => $source->{'video-caption'},
            };
        }
        elsif ( ( $video_source->{'type'} eq 'video/mp4' ) and ( $video_source->{'src'} =~ m/video_file/ ) ) {
            push $post->{'post_videos'}->@*, {
                'url'     => $video_source->{'src'},
                'name'    => $post->{'post_id'} . '.mp4',
                'source'  => $source->{'video-source'} . "\n" . $source->{'video-player'},
                'caption' => $source->{'video-caption'},
            };
        }
        else {
            warn( "Unsupported video type.\n" . $source->{'video-player'} . "\n" );
            push $post->{'post_videos'}->@*, {
                'url'     => '',
                'name'    => $post->{'post_id'},
                'source'  => $source->{'video-source'} . "\n" . $source->{'video-player'},
                'caption' => $source->{'video-caption'},
            };
        }
    }
    else {
        warn( "Video data does not seem to make sense:\n" . $source->{'video-player'} . "\n" );
        push $post->{'post_videos'}->@*, {
            'url'     => '',
            'name'    => $post->{'post_id'},
            'source'  => $source->{'video-source'} . "\n" . $source->{'video-player'},
            'caption' => $source->{'video-caption'},
        };
    }

    foreach my $v ( $post->{'post_videos'}->@* ) {
        if ( $v->{'url'} ) {
            push $post->{'_download'}->@*, {
                'url'  => $v->{'url'},
                'file' => $v->{'name'},
            };
        }
        $v->{'controller'} = '<video src="' . $v->{'name'} . '" width="" height="" controls preload allowfullscreen></video>';
    }
}

`create_output_html`

I use Template::Simple here, mostly because I know it fairly well
I've looked at Mustache, but I can't find a module that seems right
Data::Section

# given post data, create an "output" collection with the rendered HTML
sub create_output_html {
    my ($post) = @_;

    my $tmpl     = Template::Simple->new();
    my $template = ${ main->section_data('wrapper') };
    my $rendered = ${ $tmpl->render( \$template, $post ) };

    # Template::Simple tends to leave behind whitespace
    $rendered =~ s/\n\t\n/\n/g;
    $rendered =~ s/\n\n\n+/\n\n/g;

    my $output = {
        'type' => 'html',
        'dir'  => $basedir . $post->{'tumblr_key'},
        'file' => $post->{'post_id'} . '--' . $post->{'post_slug'} . '.html',
        'data' => $rendered,
    };

    # copy over information on downloads, if they exist
    if ( $post->{'_download'} ) {
        push $output->{'download'}->@*, $post->{'_download'}->@*;
    }

    return $output;
}

the template

The #bbinclude should, in theory, allow me to change the CSS after the fact (using BBEdit), but I've never tested this
The extra <style> section is so that I can manually add rules if required

__DATA__

__[wrapper]__
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1.0" />
	<meta name="origin" content="[% post_url %]" />
	<meta name="tumblr-type" content="[% _type %]" />
	[% START post_tags %]<meta name="keywords" content="tumblr[% START tags %], [% tag %][% END tags %]" />[% END post_tags %]
	<title>[% tumblr_title %] — [% post_id %] — [% post_slug %]</title>
<!-- #bbinclude "../tumblr.css" -->
	<style type="text/css">
		body {
			margin: 0;
			padding: 0;
			font-family: "Hoefler Text", serif;
		}
		article {
			margin: 0;
			padding: 0;
			padding: 0.5rem;
			background: #EEEEEE;
			border: 0.25rem solid #DDDDDD;
		}
		article header {
			margin: 0;
			margin-bottom: 1rem;
			padding: 0;
			border-bottom: thin solid #BBBBBB;
		}
		article header h1 {
			margin: 0;
			padding: 0;
			margin-top: 1rem;
			margin-bottom: 0.25rem;
			font-size: 1.2rem;
		}
		article header p.date {
			margin: 0;
			padding: 0;
			margin-top: 0.5rem;
			margin-bottom: 0.5rem;
			font-size: 0.8rem;
		}
		article header p.tags {
			margin: 0;
			padding: 0;
			margin-top: 0.5rem;
			margin-bottom: 0.5rem;
			font-size: 0.8rem;
			font-style: italic;
		}
		article header p.tags span.tag {
			padding-right: 1rem;
		}
		article header p.tags span.tag:before {
			content: '# ';
			color: gray;
		}
		article figure img {
			max-width: 100%;
		}
		article blockquote {
			margin-left: 1rem;
			border-left: thin dashed #CCCCCC;
			padding-left: 0.5rem;
			margin-right: 0;
		}
		video {
			max-width: 100%;
		}
	</style>
<!-- end bbinclude -->
	<style type="text/css">
	</style>
</head>
<body>
<article>
<header>
	<h1><a href="[% post_url %]">[% tumblr_title %] — [% post_id %] — [% post_slug %]</a></h1>
	<p class="date">[% post_date %]</p>
	[% START post_tags %]<p class="tags">[% START tags %]<span class="tag">[% tag %]</span>[% END tags %]</p>[% END post_tags %]
</header>
[% START post_photos %]
<figure>
	<a href="[% url %]"><img src="[% url %]" /></a>
	<figcaption>[% caption %]</figcaption>
</figure>
[% END post_photos %]
[% START post_videos %]
<!-- [% source %] -->
[% controller %]
[% caption %]
[% END post_videos %]
[% post_body %]
</article>
</body>
</html>

`convert_imgs` (version 1)

Basically, take the rendered HTML, and change all <img>s to use data URIs
Since I already had Mojo::DOM58 as a dependency, I figured this would be easy…
(This sub is why I swapped LWP::Simple for HTTP::Tiny — to get the Content-Type header required adding LWP::UserAgent)

# go through the rendered HTML and find any img tags, and change
# them into data urls
sub convert_imgs {
    my ($output) = @_;

    my $html = $output->{'data'};
    my $dom  = Mojo::DOM58->new($html);
    my $ht   = HTTP::Tiny->new();
    $dom->find('img')->each(
        sub {
            my $src  = $_->{'src'};
            my $data = $ht->get($src);
            if ( $data->{'success'} ) {
                my $ct = $data->{'headers'}->{'content-type'};
                $_->{'src'} = 'data:' . $ct . ';base64,' . encode_base64( $data->{'content'}, '' );
            }
            else {
                $_->{'src'} = 'error: ' . $data->{'status'} . ' ' . $data->{'reason'};
            }
        }
    );
    $output->{'data'} = $dom->to_string();
}

Sadly, that last line has a hidden problem: when it renders back to a string, the attributes on the tags are alphabetized
That is, you wind up with things like <meta content="photo" name="tumblr-type" />
Although legal as HTML, this really bugs me, so…

`convert_imgs` (version 2)

Although I'm not Tom Christiansen, my task isn't to parse all HTML
I just want to find the src attribute of an <img> tag
Regular expressions?

# go through the rendered HTML and find any img tags, and change
# them into data urls
sub convert_imgs {
    my ($output) = @_;

    my $ht = HTTP::Tiny->new();

    my $_convert_imgs_helper = sub {
        my ($url) = @_;
        my $data = $ht->get($url);
        if ( $data->{'success'} ) {
            my $ct = $data->{'headers'}->{'content-type'};
            return 'data:' . $ct . ';base64,' . encode_base64( $data->{'content'}, '' );
        }
        else {
            return 'error: ' . $data->{'status'} . ' ' . $data->{'reason'};
        }
    };

    $output->{'data'} =~ s/( <img [^>]* src=") ([^"]+) (" [^>]* > )/$1 . $_convert_imgs_helper->($2) . $3/gex;
}

I use an anonymous internal sub, and then evaluate the replacement using the s///e operator

`save_post`

Write the data out
Create a directory if necessary
Download any extra content (basically, videos)

# write the data out to the file system. if there are any downloads (ie, videos)
# download them as well
sub save_post {
    my ($output) = @_;

    if ( !-d $output->{'dir'} ) {
        mkdir $output->{'dir'} or die("Could not create output directory '$output->{'dir'}': $!\n");
    }
    my $out;
    if ( $output->{'type'} eq 'html' ) {
        open( $out, '>:encoding(utf-8)', $output->{'dir'} . '/' . $output->{'file'} )
          or die("Could not create output file '$output->{'dir'}/$output->{'file'}': $!\n");
    }
    else {
        open( $out, '>', $output->{'dir'} . '/' . $output->{'file'} )
          or die("Could not create output file '$output->{'dir'}/$output->{'file'}': $!\n");
    }
    print $out $output->{'data'};
    close($out) or die("Could not close output file '$output->{'dir'}/$output->{'file'}': $!\n");

    if ( $output->{'download'} ) {
        my $ht = HTTP::Tiny->new();
        foreach my $dl ( $output->{'download'}->@* ) {
            my $result = $ht->mirror( $dl->{'url'}, $output->{'dir'} . '/' . $dl->{'file'} );
            if ( !$result->{'success'} ) {
                warn("download failed for '$dl->{'url'}' to $dl->{'file'}: $result->{'status'} $result->{'reason'}\n");
            }
        }
    }
}

The reason for the "if" statement around the encoding of the file is …

Downloading only the photos…

All of this is great, but data URIs are not space-efficient
If there are no tags, no caption, and just one photo, why not just download the photo?
So, let's bring back that if statement from the “main flow”

# if we can, or want to, download only the photo, short circuit the rest of the program
if ( ( $post_data->{'_type'} eq 'photo' ) and ( $only_photo or should_download_only_photo($post_data) ) ) {
    download_only_photo($post_data);
}

`should_download_only_photo`

# check to see if this post is sufficiently unencumbered
# that we can download just the photo data
sub should_download_only_photo {
    my ($post_data) = @_;
    my $p = $post_data->{'_post'};
    if (1
        and ( !defined( $p->{'photo-caption'} ) or ( $p->{'photo-caption'} eq '' ) )    # there is no caption
        and ( !scalar( $p->{'photos'}->@* ) )                                           # there is only one photo
        and ( !defined( $p->{'tags'} ) or !scalar( $p->{'tags'}->@* ) )                # there are no tags
      )
    {
        return 1;
    }
    else {
        return 0;
    }
}

`download_only_photo`

# download the photo directly to disk, skipping the whole HTML template stuff
sub download_only_photo {
    my ($p) = @_;

    # make sure there is an image to download
    if ( !defined( $p->{'_post'}->{'photo-url-1280'} ) ) {
        die("Could not find an image URL to download.\n");
    }

    # retrieve the image
    my $ht   = HTTP::Tiny->new();
    my $data = $ht->get( $p->{'_post'}->{'photo-url-1280'} );
    unless ( $data->{'success'} ) {
        die( "Error retrieving photo only: " . $data->{'status'} . ' ' . $data->{'reason'} . "\n" );
    }

    # get the file name from the url
    my $filename = ( $data->{'url'} =~ s{^.+/([^/]+)$}{$1}r );

    # create an "output" object
    my $output = {
        'type' => 'img',
        'dir'  => $basedir . $p->{'tumblr_key'},
        'file' => $p->{'post_id'} . '--' . $filename,
        'data' => $data->{'content'},
    };

    save_post($output);
    exit(0);
}

Perl 6 code

A challenge

Since I had an extra month, I decided to try re-writing the program in Perl 6
(I then procrastinated for three weeks…)

Great community!

But I couldn't get everything working
However, the #perl6 IRC channel (see here for information) is really helpful!
I asked them many questions, and got a lot of help
I also asked them to look at the whole program, and make suggestions for ways to be more idiomatic
The two biggest things they suggested were changing {' '} to < >, and removing unnecessary parentheses
Also, using “kabob-case” instead of “snake_case”

Prolog

Perl 6's die doesn't support the trick with "\n"
So I had to create my own…

#!/usr/bin/env perl6
use v6;
sub croak { note $^msg; exit(1); } # because Perl 6 doesn't have the Perl 5 "\n" magic for die

Modules

# retrieve network data
use HTTP::UserAgent;
use JSON::Tiny;

# template
use Template::Mustache;

# transcode images into data urls
use MIME::Base64;

Command-line & Main flow

Perl 6 has support for command-line arguments built in
There are two special functions: MAIN and USAGE
Even if you don't implement USAGE, you get a basic usage message

sub MAIN (Str :$url!, Str :$base-directory = './', Bool :$photo = False) {

    # change directory to the appropriate base directory
    (try chdir $base-directory) orelse croak("Couldn't change to the base directory '$base-directory': $!");

    # retrieve the data from the web and extract the common elements
    my %tdata = get-tumblr-data($url);
    my %pdata = get-common-data(%tdata);

    # @NOTE not implemented: handling download only photo

    # add the special elements for the particular post type
    add-tumblr-data(%pdata<_type>, %pdata);

    # create an "output" object(ish)
    my %output = create-output-html(%pdata);

    # turn all imgs into data URIs
    convert-imgs(%output);

    # save the output
    save-post(%output);
}

(another error handling syntax)

can specialize on types of errors, or just use default
different variable ($_ vs. $!)

try {
    chdir $base-directory;
    CATCH {
        when X::IO { croak("Couldn't change to the base directory '$base-directory': $_") }
    }
}

`get-tumblr-data`

:s (or :sigspace) regex adverb
:v.elems needed to get a count due to itemization

# retrieve the JSON(ish) data for the post
sub get-tumblr-data (Str $url) {
    my $data = HTTP::UserAgent.new.get($url ~ '?format=json');
    $data.is-success                               or croak("HTTP error retrieving post: {$data.status-line}.");
    $data.content ~~ m:s/^var tumblr_api_read \= / or croak("Error retrieving data: doesn't appear to be a tumblr.");

    # Tumblr actually returns a JavaScript snippet, which we want to turn into JSON
    my $content = $data.content.chomp;
    $content ~~ s:s/^var tumblr_api_read \= //;
    $content ~~ s/\;$//;

    my %json = from-json($content);

    %json<tumblelog>          or croak("Malformed JSON data received.");
    %json<posts>              or croak("Not enough posts received.");
    %json<posts>:v.elems == 1 or croak("Too many posts received ({%json<posts>:v.elems}).");

    return %json;
}

`get-common-data`

# extract the data pieces that we will want for all post types
sub get-common-data (%tdata) {
    my %t = %tdata<tumblelog>;
    my %p = %tdata<posts>[0];

    my %post = (
        _tumblr => %t,
        _post   => %p,
        _type   => %p<type>,

        tumblr_key   => %t<name>  || 'unknown',
        tumblr_title => %t<title> || 'unknown',

        post_id   => %p<id>,
        post_slug => %p<slug>          || %p<type>,
        post_url  => %p<url-with-slug> || %p<url>,
        post_date => %p<date-gmt>,
        post_tags => {},

        post_body => '',
    );

    # fix tags
    %p<tags> and %post<post_tags><tags> = [ %p<tags>.list.map: { tag => $_ } ];

    return %post;
}

specializing on the type

Instead of having a given statement, we can use multimethods
We set up a generic handler for any un-known (or un-implemented) post types

# generic handler for unknown post types
multi add-tumblr-data ($type, %pdata) {
    croak("Unimplemented post type: {%pdata<_type>}.");
}

the straight-forward ones

Because we don't actually care about the type as a variable, we don't need to give it a name
But we can use constraints to get the effect of the given statement
Note how we can use the “pointy-arrow” in the if statement to make variable names a bit shorter!

multi add-tumblr-data ($ where 'regular', %pdata) {
    my %s = %pdata<_post>;

    if %s<regular-title> -> $title {
        %pdata<post_body> = "<h2>$title</h2>\n\n";
    }

    %pdata<post_body> ~= %s<regular-body>;
}

(The rest of the regular post types are left as exercises to the reader to implement)

Photos

multi add-tumblr-data ($ where 'photo', %pdata) {
    my %s = %pdata<_post>;

    %pdata<post_photos> = [];

    if !%s<photos> {
        %pdata<post_photos>.push: {
            caption => %s<photo-caption>,
            url     => %s<photo-url-1280>,
        };
    }
    else {
        for |%s<photos> -> %ph {
            %pdata<post_photos>.push: {
                caption => %ph<caption>,
                url     => %ph<photo-url-1280>,
            };
        }
    }
}

Video

multi add-tumblr-data ($ where 'video', %pdata) {
    croak("I'm not implementing video for this talk.");
}

`create-output-html`

# create a spec for saving the data to disk, rendering off the template
sub create-output-html (%pdata) {
    my %output = (
        type => 'html',
        dir  => %pdata<tumblr_key> ~ '/',
        file => %pdata<post_id> ~ '--' ~ %pdata<post_slug> ~ '.pl6.html',
        data => Template::Mustache.render($=finish, %pdata),
    );

    # @NOTE not implemented download handling

    return %output;
}

The `mustache` template

=finish
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1.0" />
	<meta name="origin" content="{{ post_url }}" />
	<meta name="tumblr-type" content="{{ _type }}" />
	{{# post_tags }}<meta name="keywords" content="tumblr{{# tags }}, {{ tag }}{{/ tags }}" />{{/ post_tags }}
	<title>{{ tumblr_title }} — {{ post_id }} — {{ post_slug }}</title>
<!-- #bbinclude "../tumblr.css" -->
	<style type="text/css">
		body {
			margin: 0;
			padding: 0;
			font-family: "Hoefler Text", serif;
		}
		article {
			margin: 0;
			padding: 0;
			padding: 0.5rem;
			background: #EEEEEE;
			border: 0.25rem solid #DDDDDD;
		}
		article header {
			margin: 0;
			margin-bottom: 1rem;
			padding: 0;
			border-bottom: thin solid #BBBBBB;
		}
		article header h1 {
			margin: 0;
			padding: 0;
			margin-top: 1rem;
			margin-bottom: 0.25rem;
			font-size: 1.2rem;
		}
		article header p.date {
			margin: 0;
			padding: 0;
			margin-top: 0.5rem;
			margin-bottom: 0.5rem;
			font-size: 0.8rem;
		}
		article header p.tags {
			margin: 0;
			padding: 0;
			margin-top: 0.5rem;
			margin-bottom: 0.5rem;
			font-size: 0.8rem;
			font-style: italic;
		}
		article header p.tags span.tag {
			padding-right: 1rem;
		}
		article header p.tags span.tag:before {
			content: '# ';
			color: gray;
		}
		article figure img {
			max-width: 100%;
		}
		article blockquote {
			margin-left: 1rem;
			border-left: thin dashed #CCCCCC;
			padding-left: 0.5rem;
			margin-right: 0;
		}
		video {
			max-width: 100%;
		}
	</style>
<!-- end bbinclude -->
	<style type="text/css">
	</style>
</head>
<body>
<article>
<header>
	<h1><a href="{{ post_url }}">{{ tumblr_title }} — {{ post_id }} — {{ post_slug }}</a></h1>
	<p class="date">{{ post_date }}</p>
	{{# post_tags }}<p class="tags">{{# tags }}<span class="tag">{{ tag }}</span>{{/ tags }}</p>{{/ post_tags }}
</header>
{{# post_photos }}
<figure>
	<a href="{{ url }}"><img src="{{ url }}" /></a>
	<figcaption>{{& caption }}</figcaption>
</figure>
{{/ post_photos }}
{{# post_videos }}
<!-- {{& source }} -->
{{ controller }}
{{& caption }}
{{/ post_videos }}
{{& post_body }}
</article>
</body>
</html>

`convert-imgs`

subs are lexical, so don't need a my variable
need to “stringify” the $url variable because of type system
using lookbehind assertions
matches start with $0

# change all img tag sources to use data URIs
sub convert-imgs (%output) {
    my $ua = HTTP::UserAgent.new();

    sub convert-imgs-helper ($url) {
        my $r = $ua.get: ~$url;
        if $r.is-success {
            return "data:{$r.content-type};base64," ~ MIME::Base64.encode($r.content, :oneline);
        }
        else {
            return "error: {$r.response-line}";
        }
    }

    %output<data> ~~ s:g/ <?after \<img .+? src\=\"> (<-["]>+) /{ convert-imgs-helper($0) }/;
}

or going whole-hog on Perl 6

As I mentioned, I asked the #perl6 IRC channel to review the code
ZoffixZnet sent me a rewritten (but untested) version of the convert-imgs function
It really does a lot of Perl 6 stuff…

sub convert-imgs (% (:$data! is rw)) {
    $data .= subst: :g,
        / <?after \<img .+? src\=\"> (<-["]>+) /,
        -> $ ( Str() $url ) {
            with HTTP::UserAgent.new.get: $url {
                when .is-success {
                    "data:{.content-type};base64,"
                        ~  MIME::Base64.encode: .content, :oneline
                }
                "error: {.response-line}"
            }
        }
}

`save-post`

# write the file to the disk, in the appropriate directory
sub save-post (%output) {
    if !%output<dir>.IO.d {
        (try mkdir %output<dir>) orelse croak("Could not create output directory '%output<dir>': $!");
    }
    with %output<dir file>.join -> $f {
        (try spurt $f, %output<data>) orelse croak("Could not write output file '$f': $!");
    }

    # @NOTE not implemented download handling
}

Questions?

Code highlighting

Perltidy can create HTML output
Vim can highlight code (and supports Perl 6), and then export it to HTML (with :TOhtml)
Text::VimColor is a Perl module to automate this
hilite.me is another web-based highlighter, from which I borrowed some styling ideas

BBEdit inline Perl to HTML

Create this as a file in the ~/Library/Application Support/BBEdit/Text Filters/ folder

#!/usr/bin/env perl
use Modern::Perl '2014';
use Perl::Tidy;

# grab the input data
my $src = do { local $/; <> };

# beautify it
my $beauty = undef;
Perl::Tidy::perltidy(
    source      => \$src,
    destination => \$beauty,
    argv        => [
        '--noprofile',                      # ignore a .perltidyrc; all settings come from this file
        '--quiet',                          # don't mess things up if something goes wrong
        '--maximum-line-length=0',          # I will take care about splitting lines where I want them...
        '--ignore-side-comment-lengths',    # don't include the length of the comments in the line length
        '--indent-columns=4',               # use 4 columns for a tab stop
        '--nooutdent-long-quotes',          # long quotes should not be outdented, even if they exceed line length
        '--nooutdent-long-comments',        # leave long comment lines alone
        '--add-semicolons',                 # make sure all blocks are terminated by a semi-colon
        '--keep-interior-semicolons',       # there's usually a really good reason for multiple commands on one line
        '--stack-opening-tokens',           # I prefer opening tokens to be inline
        '--square-bracket-tightness=0',     # I like single items in square brackets to have spaces
        '--nodelete-old-newlines',
        '--keep-old-blank-lines=1',
    ],
);

# html format it
my $html = undef;
Perl::Tidy::perltidy(
    source      => \$beauty,
    destination => \$html,
    argv        => [
        '-html',                            # produce html output
        '-pre',                             # but only the actual code, not the entire page
    ],
);

# print the result
print $html;

The same, for Perl 6

The equivalent for Perl 6 is much simpler
There isn't yet a Perl::Tidy equivalent, so we don't get the code reformatted
Also, the parsing isn't perfect, so stuff gets mis-colored…

#!/usr/bin/env perl
use Modern::Perl '2015';
use Text::VimColor;

# grab the input data
my $src = do { local $/; <> };

my $syntax = Text::VimColor->new(
    string            => $src,
    filetype          => 'perl6',
    all_syntax_groups => 1,
);

say '<pre>' . $syntax->html . '</pre>';

The presentation code

I tried a variety of web-presentation projects, but all of them want the slides to be a fixed height. For this presentation, I wanted to be able to have code snippets of varying length, but not have to worry about either having to scroll a tiny window, or having the font made too small.

Instead, I spent a tiny bit of time developing a presentation “framework” in JavaScript.

Because this is not Perl, and not the focus of the talk, I'm not going to go through it, but here is the HTML required for the framework (minus some of the unimportant styles).

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8" />
    <title>{title}</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />

    <style type="text/css">
        /* set some sensible defaults */
        * { box-sizing: border-box; }
        html {
            font-size: 20px;
        }
        body {
            margin: 0; padding: 0;
            font-size: 1rem; line-height: 1.5;
        }

        /* all headings should have the same height, even as the font sizes get smaller */
        h1, h2, h3, h4, h5, h6 {
            margin: 0; padding: 0;
            line-height: 3rem;
        }
        h1 { font-size: 2rem; }
        h2 { font-size: 1.7rem; }
        h3 { font-size: 1.5rem; }
        /* ... */

        /* fix the buttons based on the viewport, and make them almost invisible */
        button {
            padding: 0;
            border: thin solid gray;
            background-color: white;
            font-size: 2rem;
            line-height: 1;
            opacity: 0.1;
            text-align: center;
            -webkit-appearance: none;
        }
        button#dtoc {
            position: fixed;
            bottom: 0;
            left: 0;
            width: 2rem;
            height: 2rem;
        }
        button#next {
            position: fixed;
            bottom: 0;
            right: 0;
            width: 6rem;
            height: 2rem;
        }
        button#prev {
            position: fixed;
            bottom: 0;
            right: 6.5rem;
            width: 6rem;
            height: 2rem;
        }

        /* the nav needs to display items slightly differently, and make the headings indent */
        nav {
            padding: 0.5rem;
            position: fixed;
            top: 0;
            bottom: 2rem;
            left: 0;
            width: 25%;
            overflow-x: hidden;
            overflow-y: auto;
            background-color: #EEEEEE;
            border-right: thin solid black;
            border-bottom: thin solid black;
            white-space: nowrap;
        }
        nav { display: none; }
        nav.visible { display: block; }
        nav h1, nav h2, nav h3, nav h4, nav h5, nav h6 {
            font-size: 0.6rem;
            line-height: 1.5;
            font-weight: normal;
            font-style: normal;
        }
        nav h2 { padding-left: 1rem; }
        nav h3 { padding-left: 2rem; }
        /* ... */

        /* styles for the slides */
        main {
            margin: 0;
            padding: 0;
        }
        main section {
            margin: 1rem;
            margin-bottom: 2.5rem;
        }

        code {
            padding: 0.1rem 0.5rem;
        }

        div.code {
            margin-left: 1.5rem;
            max-width: 100%;
            overflow: auto;
            border: thin solid #DDDDDD;
        }
        div.code pre {
            margin: 0;
            font-family: monospace;
        }
    </style>

    <style type="text/css">
        /* the styles that make this a presentation vs just a webpage */
        main.slides section.slide {
            display: none;
        }
        main.slides section.slide.active {
            display: block;
        }
    </style>

    <script type="text/javascript">
        /* show or hide the table of contents */
        function toggle_toc () {
            document.getElementById("toc").classList.toggle("visible");
        }

        /* go forward one slide */
        function go_next () {
            var currelt = document.querySelector(".active");
            var nextelt = currelt.nextElementSibling;
            if (nextelt != null) {
                currelt.classList.remove("active");
                nextelt.classList.add("active");
                nextelt.scrollIntoView();
            }
        }

        /* go backward one slide */
        function go_prev () {
            var currelt = document.querySelector(".active");
            var prevelt = currelt.previousElementSibling;
            if (prevelt != null) {
                currelt.classList.remove("active");
                prevelt.classList.add("active");
                prevelt.scrollIntoView();
            }
        }

        /* make a particular slide (specified by id) the active one */
        function go_slide (sid) {
            var currelt = document.querySelector(".active");
            var destelt = document.getElementById(sid);
            if ((currelt != null) && (destelt != null)) {
                currelt.classList.remove("active");
                destelt.classList.add("active");
                destelt.scrollIntoView();
                document.getElementById("toc").classList.remove("visible");
            }
        }

        /* set up everything */
        function initialize_presentation () {
            // allow to display as one page, for easier development
            if (window.location.search == '?showall') {
                document.getElementById("presentation").classList.remove("slides");
            }

            // set the first slide to be the active one
            var first_slide = document.querySelector("main#presentation section.slide");
            first_slide.classList.add("active");

            // ad-hoc class to generate unique ids
            var idgen = {
                "curr": 0,
                "id": function (prefix) {
                    this.curr += 1;
                    return prefix.toString() + this.curr.toString();
                }
            };

            // go through the slides and assign each a unique id
            // also, create the table of contents
            var nav = document.getElementById("toc");
            var slides = document.querySelectorAll("main#presentation section.slide");
            slides.forEach(function (val, idx, list, t) {
                // assign a unique id to each slide
                if (val.id == "") { val.id = idgen.id("s"); }

                // get the first heading in each slide and add it to the table of contents
                var heading = val.querySelector("h1, h2, h3, h4, h5, h6");
                if (heading != null) {
                    var node = heading.cloneNode(true);
                    node.dataset.target = val.id;
                    node.addEventListener("click", function () { go_slide(val.id); });
                    nav.appendChild(node);
                }
            });
        }
    </script>

</head>
<body>
<button id="dtoc" onclick="toggle_toc();">❖</button>
<button id="next" onclick="go_next();">➡︎</button>
<button id="prev" onclick="go_prev();">⬅︎</button>

<nav id="toc"></nav>

<main id="presentation" class="slides">

    <section class="slide">
        <h1>{title}</h1>
    </section>

    <section class="slide">
        <h2>{another title}</h2>

        <p>{your text here}</p>
    </section>

    <!-- etc... -->

</main>

<script type="text/javascript">
initialize_presentation();
</script>

</body>
</html>

Digital pack-ratting for Tumblr

Introduction

Ephemeral

My goal

But how?

Perl 5 Code

A rewrite

Prolog

Modules

Command-line

RSS

Main flow

get_tumblr_data

get_common_data

specialize on the type

the straight-forward ones

add_photo_data

add_video_data

create_output_html

the template

convert_imgs (version 1)

convert_imgs (version 2)

save_post

Downloading only the photos…

should_download_only_photo

download_only_photo

Perl 6 code

A challenge

Great community!

Prolog

Modules

Command-line & Main flow

(another error handling syntax)

get-tumblr-data

get-common-data

specializing on the type

the straight-forward ones

Photos

Video

create-output-html

The mustache template

convert-imgs

or going whole-hog on Perl 6

save-post

Questions?

Code highlighting

BBEdit inline Perl to HTML

The same, for Perl 6

The presentation code

Fin

`get_tumblr_data`

`get_common_data`

`add_photo_data`

`add_video_data`

`create_output_html`

`convert_imgs` (version 1)

`convert_imgs` (version 2)

`save_post`

`should_download_only_photo`

`download_only_photo`

`get-tumblr-data`

`get-common-data`

`create-output-html`

The `mustache` template

`convert-imgs`

`save-post`