Pywb migration notes

2020-02-25

We recently migrated the Australian Web Archive (AWA) from OpenWayback to Pywb in order to take advantage of Pywb’s better replay fidelity, particularly for JavaScript heavy websites. The Trove-branded user interface to the web archive is a separate web application which displays Wayback in an iframe.

The old architecture

We originally choose to write a separate application rather than customising OpenWayback’s banner templates for a couple of reasons:

We thought it would make updating wayback and the UI independently of each other easier. This is particularly important as the web archive backend and the Trove user interface are managed and developed by different teams with different development processes and release cycles.
It allows for replayed content to live on a different origin (domain) to the UI. This is important security measure to prevent archived content from being able to interfere with the UI. While we didn’t take advantage of this in the original release it’s something I’ve been long wanting to implement.
We had in mind from the beginning that we may eventually want to swap OpenWayback out for another replay tool and it’d be nice not to have to rewrite our UI in order to do it.

While it has caused a few problems in the past (redirect loops, PDF plugins) this architecture made the transition to Pywb straightforward. Pywb out of the box renders its own UI with an iframe so it was close to a drop in replacement for us. There were a few small problems we encountered along the way, most of our own making rather than Pywb’s though. ;-)

The AWA’s initial release, archived content and UI were both served from the same domain name. This meant that the browser allowed the UI’s JavaScript to reach inside the iframe and access the archived page. The Trove-Web UI therefore was able to listen to the iframe’s load event and even intercept click events. When the iframe loads we can inspect the page’s title to update it in the UI and extract the current URL and timestamp from the iframe’s URL.

While this is convenient and would have worked with Pywb if we kept it on the same domain it also means archived content could do the same to our UI! We never encountered anyone doing this in the wild but we’ve always been a little worried that the web archive could abused for attacks like phishing.

This means we needed a replacement way for the UI to get information about what was happening inside the replay iframe. Pywb fortunately has already solved this as it uses the Window.postMessage() to send a message like this when the archived page loads.

{
    "wb_type":"load",
    "url":"http://www.example.com/",
    "ts":"20060821035730",
    "title":"Example Web Page",
    "icons":[],
    "request_ts":"20060821035730",
    "is_live":false,
    "readyState":"interactive"
}

One gotcha I encountered though was that Pywb doesn’t send a postMessage when displaying error pages. Trove-Web intercepts OpenWayback’s not found errors in order to display a more detailed message suggesting alternative ways to find the content or an explanatory message about restricted content.

I worked around that by including a custom templates/not_found.html which sends the load message:

<script>
    parent.postMessage({
        'wb_type': 'load',
        'url': '{{ url }}',
        'ts': location.href.split('/')[4].replace('mp_',''),
        'title': 'Webpage snapshot not found',
        'status': 404
    }, '*');
</script>

Problem 2: Accessing HTML meta tags

Trove’s user interface (and this is not specific to the web archive) has a tab which shows how to cite the item you’re looking at when referencing it in an academic context or on Wikipedia. The original implementation of this would on the client side inspect the contents of the iframe for HTML meta tags to pull out information such as an author or publisher’s name. This of course also broke when we moved wayback to a separate security origin and unlike the URL and page title Pywb doesn’t include the page’s meta tags in its load message.

Rather trying to provide JavaScript access to the page content and potentially risking undoing some of the isolation we were trying to introduce, I moved the meta tag extraction server side translating the code from JavaScript/DOM to Java/JSoup.

Problem 3: Multiple access points

The library has a take-down procedure for restricting access to content under certain circumstances. Restricted content can have several different policies applied to it. Content can be fully public, accessible only to staff, accessible on-site in the reading room or fully restricted.

OpenWayback also enables a different URL to configured for routing incoming requests (accessPointPath) as to the one that’s generated when rewriting links (replayPrefix). So our original implementation simply configured three access points under paths like /public, /onsite and /staff but with the generated links all at /wayback.

<beans>
     <bean name="publicaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/public/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="publiccollection" />
     </bean>
    
     <bean name="onsiteaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/onsite/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="onsitecollection" />
    </bean>
    
    <bean name="staffaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/staff/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="staffcollection" />
    </bean>
</beans>

Our frontend webserver (nginx) has a set of IP range rules which map the different access locations and rewrote the path accordingly:

rewrite /wayback/(.*) /$webarchive_access_point/$1 break;
proxy_pass http://openwayback;

I couldn’t find a way to replicate this configuration as Pywb appears to only have one parameter - the collection name - used for both routing and link generation. (Aside: perhaps it’s possible by using uwsgi and overriding SCRIPT_NAME but I couldn’t figure it out.) Therefore rather than running a single instance of Pywb with multiple collections configured we ended up with a separate instance of Pywb for each access point and have nginx route requests to the appropriate port. It’s a little more complex to deploy than I’d like but works well enough.

upstream pywb-public { server backend.example:8080; }
upstream pywb-staff  { server backend.example:8081; }
upstream pywb-onsite { server backend.example:8082; }

proxy_pass http://pywb-$webarchive_access_point;

Problem 4: Hiding the UI for thumbnails

Trove’s collection browse path displays thumbnails of the archived sites. These are generated by a web service that wraps Chromium’s headless mode. Obviously we don’t want the UI of the archive to be visible in the thumbnails. But if Chromium loads the URL of Pywb directly there’s a JavaScript redirect back to the Trove UI. This exists so that if a user opens an archived link in a new tab they get the web archive’s UI in that new tab too rather than just the contents of the replay frame.

In our original implementation we kind of hacked around this for screenshots by passing a magic flag as a URL fragment that the JavaScript redirect looked for as indicator not to redirect to the UI. This time though that redirect was being done by Pywb itself rather than our template customisations. Plus we don’t really want to expose a way of hiding the UI entirely again due to risk of the archive being abused for phishing.

I hoped to setup a second Pywb collection with framed_replay: false and a blank banner, but was thwarted as that can only be specified at the top level. So yes, you guessed it we’re now up to four, instances of Pywb. ¯\_(ツ)_/¯

Problem 5: The PDF workaround that broke

In the original iframe implementation we encountered problems with displaying PDFs in an iframe in some browsers. The developers ended up working around this by embedding PDF.js rather than relying on the browser’s rendering. This broke when switching to Pywb on an isolated domain as the interception was based on a onClick handler injected into the iframe and also the PDF.js viewer can’t load documents from a different origin without special configuration. Fortunately it seems either Pywb does something differently or more likely browser’s now handle PDFs in iframes better so I was able to just disable the whole PDF interception thing.

Funnily enough we do still use PDF.js for generating thumbnails though. Chromium’s PDF viewer doesn’t work in headless mode. While it probably would be more efficient to use some native PDF viewer just using PDF.js let’s us reuse the same thumbnail generation logic and also piggy back on the browser’s security sandbox.

Problem 6: Reusing OpenWayback’s manhattan graph

I discovered to my surprise that Trove’s visualisation of captures over time was actually using OpenWayback’s server-side manhattan graph renderer. There’s no exact equivalent of this in Pywb and I don’t want to keep a running instance OpenWayback for that alone. Fortunately the graph renderering code is standalone and could be incorporated directly into Trove-Web.

Screenshot of manhattan graph