Full-Text RSS 3.4

The new version of our Full-Text RSS application is now available. Full-Text RSS is our article extraction and partial-to-full-text-feed conversion application. You can try it out now, or read on to find out what’s new.

Native-ad blocking

The site configuration files which Full-Text RSS relies on to determine how to extract content from web pages can now also be used to tell it how to identify if an article is in fact a native ad. If you don’t know what a native ad is, here’s John Oliver with a nice explanation:

Articles identified as native ads will now get flagged with <dc:type>Native Ad</dc:type> or, depending on your configuration, can be stripped from the RSS output altogether.¹

We will soon start adding rules from Ian Webster’s excellent AdDetector project to our site configuration files to increase our coverage of sites carrying native ads. If you’d like to add such rules yourself, you will need to create or edit a site configuration file and use the following new directive:

native_ad_clue: [XPath]

Let’s use The Guardian as an example. Their native ads contain a meta tag with the content “Partner zone”. Here’s an example:

<meta property="article:tag" content="Partner zone Unilever" />

To make Full-Text RSS check for this tag, we can edit our site configuration file theguardian.com.txt and add the following rule:

native_ad_clue: //meta[@property="article:tag" and contains(@content, "Partner zone")]

Now when Full-Text RSS processes an article from The Guardian, it will also look for an element matching our XPath. If it finds one, the article will be marked as a native ad and, depending on your configuration, removed from the feed.

Per-request extraction rules

It’s now possible to submit extraction rules directly in your request to Full-Text RSS. We’ve added a new request parameter called siteconfig which can take whatever you’d normally put inside a site config file. We’ll then merge this with other site config files matching the page being processed.²

Accept parameter

Full-Text RSS is typically used with two types of input:

Web feeds (usually partial feeds)
Web pages (usually a single news story or blog post)

We don’t ask you to specify what the input type is because we can guess from the response. Having said that, it’s sometimes a good idea to tell Full-Text RSS what it should expect. Doing so explicitly ensures that if the response type changes in the future, Full-Text RSS will reject the new response, rather than treating it differently.

To be explicit about the response type, you can use the new accept request parameter. accept=feed tells Full-Text RSS the response should always be a feed. accept=html tells Full-Text RSS that the response should always be treated as a regular web page.³

For example, let’s say you generate a full-text feed from a partial feed found at http://example.org/feed/ and subscribe to it in your feed reading application. A month or two later, the site changes and the URL no longer returns a feed, but redirects to some other page on the site. Without accept=feed, Full-Text RSS will pull in the web page and assume that it was a web page all along, so it will try to extract its contents and return a 1-item feed.

Subscription link (+self, +original)

We are now including 3 additional pieces of information in our generated feed output:

A subscription URL using subtome.com – if you open this URL in your browser, you’ll be given a list of feed reading services you can use to subscribe to the generated feed. This appears as <atom:link rel="related" ...>. We use this URL in the feed preview to offer a subscription link you can click.⁴
The URL of the generated feed. Although you can see this in your address bar, if you save the feed on disk somewhere, you’ll now have a reference to the URL that generated the file. This appears as <atom:link rel="self" ...>.
The URL of the original feed or web page used as input to Full-Text RSS. This URL is actually also embedded in the URL of the generated feed, but because it’s URL-encoded, it can be hard to get to. This appears as <atom:link rel="alternate" ...>.

Full changelog

New request parameter: siteconfig lets you submit extraction rules directly in request
New request paramter: accept=(auto|feed|html) determines what we’ll accept as a response (deprecates html=1 parameter)
New request parameter: key_redirect=0 to prevent HTTP redirect to hide API key
Site config files can now contain native_ad_clue: [xpath] to check for elements which signify that the article is a native ad
New config option: remove_native_ads – set to true and when we notice native ads (see above) we’ll remove them from the output (only when processing feeds, doesn’t affect output when input URL points to an HTML page).
Feed output will include <dc:type>Native Ad</dc:type> for articles which appear to be native ads.
New config option: user_submitted_config to determine whether the siteconfig parameter is enabled or not
Feed output now includes <atom:link rel="self"...> with URL of the generated feed
Feed output now includes <atom:link rel="alternate"...> with URL of the original (input) URL
Feed output now includes <atom:link rel="related"...> with URL to subscribe to the generated feed (using subtome.com)
Feed preview stylesheet (feed.xsl) now presents a subscribe to feed link
Fixed character encoding issue for certain texts
Fixed character encoding issue for certain characters in HTML5 parsing mode
Use base element, if present in HTML, when rewriting URLs
HTML5-PHP library updated
Other minor fixes/improvements

Available to buy

Full-Text RSS 3.4 is now available to buy. If you’re an existing customer, please wait for an email from us with an upgrade link.

We’ve added a new option to config.php called $options->remove_native_ads which you can enable to have native ads removed from feeds. We’ve enabled this on our hosted service, so as we start adding rules identifying native ads, they’ll stop appearing in your feeds. ↩︎
To avoid your request-specific site config rules being merged with those Full-Text RSS has on file, add the rule autodetect_on_failure: no to the site config you’re submitting in the request. ↩︎
accept=html is now the preferred way to indicate that the response should be HTML. Previously we used html=1, but this has now been deprecated in favour of the new parameter (we’ll still treat html=1 as accept=html, so there’s no need for existing users to change URLs just yet). ↩︎
Not all browsers use our XSL stylesheet for the feed preview, so for those that don’t, you won’t see our clickable subscription link. Firefox does its own feed rendering, so you won’t see it there. But Chrome will use our stylesheet. ↩︎