Oct 16, 2014
Extraction tests for Full-Text RSS

As many of our users know, Full-Text RSS, our article extraction tool, relies on a number of site-specific extraction rules which we maintain in our GitHub repository.

These extraction rules were initially imported from Instapaper (before it was sold, when they were still publicly available). And since then we’ve done our best to update them and have received many contributions from our users (thank you!).

But while the repository has grown over time, we’ve had no system in place to check whether the rules it contains are still effective. Or if the sites they were created for still alive.

Well, now we have such a system. It’s fairly basic and still experimental, but we think it’s a good start. It runs a number of checks periodically, using the site configuration files in our repository, and produces a report listing the ones which need attention. Here’s what it checks for:

Presence of at least one test URL per site

To determine if Full-Text RSS can extract content, site config files should contain at least one test URL. So if one is not present, we’ll include it in the report.

Match between test URL hostname and site config filename

A site config file called example.org.txt is only loaded when we process URLs from example.org. So this test is to make sure that the test URL inside the site config file is one which will actually load the site config file in which it appears.1

Valid response from test URLs (200 status code)

This test is used to make sure the test URL is not dead. If it is, it could mean one of a few things:

The site has been redesigned, breaking old URLs. If so, the extraction rules might need to be updated, and a new test URL entered.
The site is the same as before, but the page in question is no longer available at the URL. If that’s the case, a new working test URL should be entered.
The site is no longer alive. In which case the site config file should be deleted.
The site is not dead, but it’s temporarily unavailable.
Expected content

Let’s say we got a valid response from the test URL in the test above. This doesn’t actually tell us if Full-Text RSS can successfully extract content from that URL, only that the request has succeeded (the server has returned a 200 OK status code). So now we look for a new site config directive called test_contains. This should contain a small chunk of text that we expect to find in the article. If this directive is present in the site config file, we will pass the test URL to Full-Text RSS for it to extract the article content from the page. We’ll then check the extracted content to see if the chunk of text contained in our test_contains directive appears. If it does not, it will be flagged in the report.

As this is a new directive, most of the site config files do not contain it. But we’ve started adding them to a few sites. Here’s what it looks like (example from theguardian.com.txt):

test_url: http://www.theguardian.com/world/2013/oct/04/nsa-gchq-attack-tor-network-encryption
test_contains: The National Security Agency has made repeated attempts to develop
test_contains: The agency did not directly address those questions, instead providing a statement.

test_url: http://www.theguardian.com/world/2013/oct/03/edward-snowden-files-john-lanchester
test_contains: In August, the editor of the Guardian rang me up and asked if I would spend a week in New York
test_contains: As the second most senior judge in the country, Lord Hoffmann, said in 2004 about a previous version of our anti-terrorism laws




The test_contains directive should appear after a test_url directive. It can appear multiple times. If it does, we’ll associate each one with the test URL that appears above it. So you can, for example, take a sentence from the beginning of the article and one from the end. Our tests will look for both of these in the extracted content and warn you if one does not appear.

How does this work?

The tests are automated. We grab the latest set of site configuration files from our GitHub repository roughly every 48 hours and then start looking through the files in batches to identify problems.

Problems are reported in 3 categories:

Content fail: extracted content does not contain expected text
HTTP fail: the test URLs could not be retrieved
Warnings: no test URL present for site, or a possible mismatch between test URLs and site config hostname
In each category you will see a list of site configuration files. Clicking on one will give you more information on what exactly failed. Next to that link you will also find a ‘Fix on GitHub’ link which will take you directly to the GitHub page in our repository associated with that site configuration file. This allows you to edit the file and suggest a fix using GitHub’s web interface.

Try it out

You will find test results here: Full-Text RSS site config tests.

We’d very much appreciate fixes to these site config files. And of course the whole repository of extraction rules is in the public domain - free for anyone to use.

If you have any feedback, we’d love to hear it. Thanks!



For feed URLs, the hostname test might produce false positives. For example, the feed for example.org might be hosted at feed.example.org or feedburner.com. These are still valid test URLs, even though there’s a mismatch between the test URL hostname and the site config filename. For that reason, we currently do not run this test on test URLs which contain ‘rss’ or ‘feed’ in the hostname. In the future, we’ll try to separate the web page URLs from feed URLs in the site config files so we can handle these cases appropriately. ↩

Extraction tests for Full-Text RSS

As many of our users know, Full-Text RSS, our article extraction tool, relies on a number of site-specific extraction rules which we maintain in our GitHub repository.

These extraction rules were initially imported from Instapaper (before it was sold, when they were still publicly available). And since then we’ve done our best to update them and have received many contributions from our users (thank you!).

But while the repository has grown over time, we’ve had no system in place to check whether the rules it contains are still effective. Or if the sites they were created for still alive.

Well, now we have such a system. It’s fairly basic and still experimental, but we think it’s a good start. It runs a number of checks periodically, using the site configuration files in our repository, and produces a report listing the ones which need attention. Here’s what it checks for:

Presence of at least one test URL per site

To determine if Full-Text RSS can extract content, site config files should contain at least one test URL. So if one is not present, we’ll include it in the report.

Match between test URL hostname and site config filename

A site config file called example.org.txt is only loaded when we process URLs from example.org. So this test is to make sure that the test URL inside the site config file is one which will actually load the site config file in which it appears.1

Valid response from test URLs (200 status code)

This test is used to make sure the test URL is not dead. If it is, it could mean one of a few things:

  • The site has been redesigned, breaking old URLs. If so, the extraction rules might need to be updated, and a new test URL entered.
  • The site is the same as before, but the page in question is no longer available at the URL. If that’s the case, a new working test URL should be entered.
  • The site is no longer alive. In which case the site config file should be deleted.
  • The site is not dead, but it’s temporarily unavailable.

Expected content

Let’s say we got a valid response from the test URL in the test above. This doesn’t actually tell us if Full-Text RSS can successfully extract content from that URL, only that the request has succeeded (the server has returned a 200 OK status code). So now we look for a new site config directive called test_contains. This should contain a small chunk of text that we expect to find in the article. If this directive is present in the site config file, we will pass the test URL to Full-Text RSS for it to extract the article content from the page. We’ll then check the extracted content to see if the chunk of text contained in our test_contains directive appears. If it does not, it will be flagged in the report.

As this is a new directive, most of the site config files do not contain it. But we’ve started adding them to a few sites. Here’s what it looks like (example from theguardian.com.txt):

test_url: http://www.theguardian.com/world/2013/oct/04/nsa-gchq-attack-tor-network-encryption
test_contains: The National Security Agency has made repeated attempts to develop
test_contains: The agency did not directly address those questions, instead providing a statement.

test_url: http://www.theguardian.com/world/2013/oct/03/edward-snowden-files-john-lanchester
test_contains: In August, the editor of the Guardian rang me up and asked if I would spend a week in New York
test_contains: As the second most senior judge in the country, Lord Hoffmann, said in 2004 about a previous version of our anti-terrorism laws


The test_contains directive should appear after a test_url directive. It can appear multiple times. If it does, we’ll associate each one with the test URL that appears above it. So you can, for example, take a sentence from the beginning of the article and one from the end. Our tests will look for both of these in the extracted content and warn you if one does not appear.

How does this work?

The tests are automated. We grab the latest set of site configuration files from our GitHub repository roughly every 48 hours and then start looking through the files in batches to identify problems.

Problems are reported in 3 categories:

  • Content fail: extracted content does not contain expected text
  • HTTP fail: the test URLs could not be retrieved
  • Warnings: no test URL present for site, or a possible mismatch between test URLs and site config hostname

In each category you will see a list of site configuration files. Clicking on one will give you more information on what exactly failed. Next to that link you will also find a ‘Fix on GitHub’ link which will take you directly to the GitHub page in our repository associated with that site configuration file. This allows you to edit the file and suggest a fix using GitHub’s web interface.

Try it out

You will find test results here: Full-Text RSS site config tests.

We’d very much appreciate fixes to these site config files. And of course the whole repository of extraction rules is in the public domain - free for anyone to use.

If you have any feedback, we’d love to hear it. Thanks!


  1. For feed URLs, the hostname test might produce false positives. For example, the feed for example.org might be hosted at feed.example.org or feedburner.com. These are still valid test URLs, even though there’s a mismatch between the test URL hostname and the site config filename. For that reason, we currently do not run this test on test URLs which contain ‘rss’ or ‘feed’ in the hostname. In the future, we’ll try to separate the web page URLs from feed URLs in the site config files so we can handle these cases appropriately. 

Full-Text RSS 3.4

Sep 8, 2014

The new version of our Full-Text RSS application is now available. Full-Text RSS is our article extraction and partial-to-full-text-feed conversion application. You can try it out now, or read on to find out what’s new.

Native-ad blocking

The site configuration files which Full-Text RSS relies on to determine how to extract content from web pages can now also be used to tell it how to identify if an article is in fact a native ad. If you don’t know what a native ad is, here’s John Oliver with a nice explanation:

Articles identified as native ads will now get flagged with <dc:type>Native Ad</dc:type> or, depending on your configuration, can be stripped from the RSS output altogether.1

We will soon start adding rules from Ian Webster’s excellent AdDetector project to our site configuration files to increase our coverage of sites carrying native ads. If you’d like to add such rules yourself, you will need to create or edit a site configuration file and use the following new directive:

native_ad_clue: [XPath]


Let’s use The Guardian as an example. Their native ads contain a meta tag with the content “Partner zone”. Here’s an example:

<meta property="article:tag" content="Partner zone Unilever" />


To make Full-Text RSS check for this tag, we can edit our site configuration file theguardian.com.txt and add the following rule:

native_ad_clue: //meta[@property="article:tag" and contains(@content, "Partner zone")]


Now when Full-Text RSS processes an article from The Guardian, it will also look for an element matching our XPath. If it finds one, the article will be marked as a native ad and, depending on your configuration, removed from the feed.

Per-request extraction rules

It’s now possible to submit extraction rules directly in your request to Full-Text RSS. We’ve added a new request parameter called siteconfig which can take whatever you’d normally put inside a site config file. We’ll then merge this with other site config files matching the page being processed.2

Accept parameter

Full-Text RSS is typically used with two types of input:

  1. Web feeds (usually partial feeds)
  2. Web pages (usually a single news story or blog post)

We don’t ask you to specify what the input type is because we can guess from the response. Having said that, it’s sometimes a good idea to tell Full-Text RSS what it should expect. Doing so explicitly ensures that if the response type changes in the future, Full-Text RSS will reject the new response, rather than treating it differently.

To be explicit about the response type, you can use the new accept request parameter. accept=feed tells Full-Text RSS the response should always be a feed. accept=html tells Full-Text RSS that the response should always be treated as a regular web page.3

For example, let’s say you generate a full-text feed from a partial feed found at http://example.org/feed/ and subscribe to it in your feed reading application. A month or two later, the site changes and the URL no longer returns a feed, but redirects to some other page on the site. Without accept=feed, Full-Text RSS will pull in the web page and assume that it was a web page all along, so it will try to extract its contents and return a 1-item feed.

Subscription link (+self, +original)

We are now including 3 additional pieces of information in our generated feed output:

  1. A subscription URL using subtome.com - if you open this URL in your browser, you’ll be given a list of feed reading services you can use to subscribe to the generated feed. This appears as <atom:link rel="related" ...>. We use this URL in the feed preview to offer a subscription link you can click.4

  2. The URL of the generated feed. Although you can see this in your address bar, if you save the feed on disk somewhere, you’ll now have a reference to the URL that generated the file. This appears as <atom:link rel="self" ...>.

  3. The URL of the original feed or web page used as input to Full-Text RSS. This URL is actually also embedded in the URL of the generated feed, but because it’s URL-encoded, it can be hard to get to. This appears as <atom:link rel="alternate" ...>.

Full changelog

  • New request parameter: siteconfig lets you submit extraction rules directly in request
  • New request paramter: accept=(auto|feed|html) determines what we’ll accept as a response (deprecates html=1 parameter)
  • New request parameter: key_redirect=0 to prevent HTTP redirect to hide API key
  • Site config files can now contain native_ad_clue: [xpath] to check for elements which signify that the article is a native ad
  • New config option: remove_native_ads - set to true and when we notice native ads (see above) we’ll remove them from the output (only when processing feeds, doesn’t affect output when input URL points to an HTML page).
  • Feed output will include <dc:type>Native Ad</dc:type> for articles which appear to be native ads.
  • New config option: user_submitted_config to determine whether the siteconfig parameter is enabled or not
  • Feed output now includes <atom:link rel="self"...> with URL of the generated feed
  • Feed output now includes <atom:link rel="alternate"...> with URL of the original (input) URL
  • Feed output now includes <atom:link rel="related"...> with URL to subscribe to the generated feed (using subtome.com)
  • Feed preview stylesheet (feed.xsl) now presents a subscribe to feed link
  • Fixed character encoding issue for certain texts
  • Fixed character encoding issue for certain characters in HTML5 parsing mode
  • Use base element, if present in HTML, when rewriting URLs
  • HTML5-PHP library updated
  • Other minor fixes/improvements

Available to buy

Full-Text RSS 3.4 is now available to buy. If you’re an existing customer, please wait for an email from us with an upgrade link.


  1. We’ve added a new option to config.php called $options->remove_native_ads which you can enable to have native ads removed from feeds. We’ve enabled this on our hosted service, so as we start adding rules identifying native ads, they’ll stop appearing in your feeds. 

  2. To avoid your request-specific site config rules being merged with those Full-Text RSS has on file, add the rule autodetect_on_failure: no to the site config you’re submitting in the request. 

  3. accept=html is now the preferred way to indicate that the response should be HTML. Previously we used html=1, but this has now been deprecated in favour of the new parameter (we’ll still treat html=1 as accept=html, so there’s no need for existing users to change URLs just yet). 

  4. Not all browsers use our XSL stylesheet for the feed preview, so for those that don’t, you won’t see our clickable subscription link. Firefox does its own feed rendering, so you won’t see it there. But Chrome will use our stylesheet. 

Jun 23, 2014
Visual content block selector for Full-Text RSS

Our Full-Text RSS application can extract article content in web pages through a combination of automatic content detection and site-specific extraction rules.

When a user encounters a problem extracting content from a particular site, we usually point them to our help page for writing custom extraction rules. That can not only be quite time consuming, but to do it, we assume users have an understanding of HTML and XPath. That&#8217;s a big assumption. If users can&#8217;t create extraction rules themselves, they&#8217;ll have to rely on us to do it. We&#8217;d much rather make the process easier and let users contribute their extraction rules to our repository so all users of Full-Text RSS benefit.

To that end, we&#8217;ve started work on a tool, still in the early phases, to let users create extraction rules through a simple point-and-click interface. At the moment the focus is on letting the user visually select the main content block that Full-Text RSS should extract. The tool will then generate the required XPath selector and offer a download link so the result can be saved to the appropriate Full-Text RSS folder.

Try it out

If you&#8217;re a Full-Text RSS user, please try it out and let us know what you think.

A few things to bear in mind:

Try this on a recent version of Firefox, Chrome, or Opera.
Content generated by Javascript will not be shown or accessible because Javascript from the source site is disabled (this is the same in Full-Text RSS).
We are not yet testing the generated XPath with Full-Text RSS, so there is a small chance that it will not actually match the desired element due to differences in parsing. We plan on including such a test, and also testing the generated selector against the site&#8217;s feed to see if it finds matches on different articles.
Compared to browser developer tools

Many browsers already offer similar element selectors in their developer tools. The closest we came to finding what we were looking for is Firefox Developer Tools&#8217; &#8220;Copy Unique Selector&#8221;. There are a few differences, however, in how we produce the CSS selector string. The aim of our tool is to create a unique selector which matches the content element not only on the current page, but also similar pages from the same site (e.g. a different article from the same news site). So, compared to Firefox Developer Tools&#8217; Unique Selector:

We do not use :nth-child selectors as we think they can be quite brittle for the use we have in mind. If we can&#8217;t create a unique selector using the element name, class attribute, id attribute, or ancestor elements, we&#8217;ll simply give up and alert the user.
We ignore id or class attribute values which contain a sequence of 2 or more numbers, or a hyphen followed by a number. These often indicate article-specific ID numbers. For example, #post-1117 is a unique selector, but the 1117 is most likely an ID number associated with the article currently displayed on the page. A different article from the same site will likely have a different ID number, which means our selector will not match it.
Future

We hope to eventually reach a point where this can be integrated not only with Full-Text RSS, but with applications which rely on Full-Text RSS. Applications such as Push to Kindle, PDF Newspaper, and Wallabag. At that point we should be able to hide the CSS/XPath selectors from users and simply ask them to select the content block and click a button to have the extraction rules generated and saved.

Credits

We used a number of free software tools to create this. Most of the credit goes to the following projects:

Andrew Childs&#8217; DOM Outline
Joss Crowcroft&#8217;s Simple JavaScript DOM Inspector
Andrea Giammarchi&#8217;s CSS to XPath

Visual content block selector for Full-Text RSS

Our Full-Text RSS application can extract article content in web pages through a combination of automatic content detection and site-specific extraction rules.

When a user encounters a problem extracting content from a particular site, we usually point them to our help page for writing custom extraction rules. That can not only be quite time consuming, but to do it, we assume users have an understanding of HTML and XPath. That’s a big assumption. If users can’t create extraction rules themselves, they’ll have to rely on us to do it. We’d much rather make the process easier and let users contribute their extraction rules to our repository so all users of Full-Text RSS benefit.

To that end, we’ve started work on a tool, still in the early phases, to let users create extraction rules through a simple point-and-click interface. At the moment the focus is on letting the user visually select the main content block that Full-Text RSS should extract. The tool will then generate the required XPath selector and offer a download link so the result can be saved to the appropriate Full-Text RSS folder.

Try it out

If you’re a Full-Text RSS user, please try it out and let us know what you think.

A few things to bear in mind:

  • Try this on a recent version of Firefox, Chrome, or Opera.
  • Content generated by Javascript will not be shown or accessible because Javascript from the source site is disabled (this is the same in Full-Text RSS).
  • We are not yet testing the generated XPath with Full-Text RSS, so there is a small chance that it will not actually match the desired element due to differences in parsing. We plan on including such a test, and also testing the generated selector against the site’s feed to see if it finds matches on different articles.

Compared to browser developer tools

Many browsers already offer similar element selectors in their developer tools. The closest we came to finding what we were looking for is Firefox Developer Tools’ “Copy Unique Selector”. There are a few differences, however, in how we produce the CSS selector string. The aim of our tool is to create a unique selector which matches the content element not only on the current page, but also similar pages from the same site (e.g. a different article from the same news site). So, compared to Firefox Developer Tools’ Unique Selector:

  • We do not use :nth-child selectors as we think they can be quite brittle for the use we have in mind. If we can’t create a unique selector using the element name, class attribute, id attribute, or ancestor elements, we’ll simply give up and alert the user.
  • We ignore id or class attribute values which contain a sequence of 2 or more numbers, or a hyphen followed by a number. These often indicate article-specific ID numbers. For example, #post-1117 is a unique selector, but the 1117 is most likely an ID number associated with the article currently displayed on the page. A different article from the same site will likely have a different ID number, which means our selector will not match it.

Future

We hope to eventually reach a point where this can be integrated not only with Full-Text RSS, but with applications which rely on Full-Text RSS. Applications such as Push to Kindle, PDF Newspaper, and Wallabag. At that point we should be able to hide the CSS/XPath selectors from users and simply ask them to select the content block and click a button to have the extraction rules generated and saved.

Credits

We used a number of free software tools to create this. Most of the credit goes to the following projects:

Full-Text RSS 3.3

May 13, 2014

The new version of Full-Text RSS, our tool to extract article content from web pages and transform partial web feeds into full-text feeds, is now available to try and buy. A few of the main changes are described below.

New endpoint for simpler JSON results

If you’re a developer, it’s now even easier to use Full-Text RSS for content extraction. We’ve added a new endpoint called extract.php. It returns results in a much simpler JSON representation. Although we’ve supported JSON output for a while, it has always been a JSON representation of our regular RSS output, which can be quite cumbersome to navigate if you have no need for the other RSS elements.

To give you an example, here’s the JSON produced using the new endpoint (the input URL was a Chomsky article):

{
    "title": "De-Americanizing the World",
    "excerpt": "During the latest episode of the Washington farce that has astonish…",
    "date": null,
    "author": "Noam Chomsky",
    "language": "en",
    "url": "http://chomsky.info/articles/20131105.htm",
    "effective_url": "http://chomsky.info/articles/20131105.htm",
    "content": "<p>During the latest episode of the Washington farce that has aston…"
}

Note: For brevity, the output above is truncated.

Input HTML

Another feature of this new endpoint is the ability to submit your own HTML in the request, rather than have Full-Text RSS fetch it for you from the URL. We’ve had requests from users who already have large collections of HTML documents and simply want to use Full-Text RSS for article extraction, not retrieval. That’s now possible using the new inputhtml parameter.1

New HTML5 parser

We’ve replaced HTML5Lib with HTML5-PHP. The old one had problems parsing a lot of pages. If you relied on HTML5Lib in the last version, we’ll automatically use HTML5-PHP in this version. But you should note that HTML5-PHP requires at least PHP 5.3. So if you’re still running Full-Text RSS on PHP 5.2, the new parser won’t be available (we’ll use libxml for everything).

It’s also now possible to request HTML5-PHP parsing using a new request parameter: &parser=html5php.

Schema.org articleBody extraction

When we’re figuring where the article lies, we now also look for HTML elements marked with Schema.org’s articleBody property.

Proxy support

Full-Text RSS now supports routing requests through proxy servers. There are a number of reasons why you might want to do this:

  1. On some company networks you might have to use a proxy server to be able to reach external sites.
  2. While we can cache results produced by Full-Text RSS, we do not yet cache HTTP responses we receive from servers. If this is something you need, you can set up a caching proxy and route requests through that.
  3. Bypassing IP blocks. This might sound contentious, but today there are sites which pre-emptively block IP ranges belonging to popular web hosts (Linode, AWS, etc.) to avoid potential abuse. So even if your own application is not behaving abusively, you might still find that requests to certain sites are blocked simply based on where you’ve chosen to host Full-Text RSS. If this is something that affects you, you might want to consider paying for private proxies and routing your requests through those proxies.

You can specify one or more proxy servers in the config file. You can then tell Full-Text RSS to randomly pick a proxy in each request, or specify a proxy server via a request parameter: &proxy=my-proxy.

Mashape API listing

We’ve now made the service available on Mashape for developers. You’ll find API documentation on there, sample code in a number of programming languages, and an easy way to test the service. More information in our previous blog post.

Request parameters for article extraction (new endpoint)

You can pass the following parameters to extract.php in a GET or POST HTTP request. Note: for many of these parameters, the configuration file will ultimately determine if and how they can be used.

Parameter Value Description
url string (URL) This is the only required parameter. It should be the URL to a standard HTML page. You can omit the ‘http://’ prefix if you like.
inputhtml string (HTML) If you already have the HTML, you can pass it here. We will not make any HTTP requests for the content if this parameter is used. Note: The input HTML should be UTF-8 encoded. And you will still need to give us the URL associated with the content (the URL may determine how the content is extracted, if we have extraction rules associated with it).
content 0, 1 (default) If set to 0, the extracted content will not be included in the output.
links preserve (default), footnotes, remove Links can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
xss 0, 1 (default)

Use this to enable/disable XSS filtering. It is enabled by default, but if your application/framework/CMS already filters HTML for XSS vulnerabilities, you can disable XSS filtering here.

If enabled, we’ll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: when enabled this will remove certain elements you may want to preserve, such as iframes.

lang 0, 1 (default), 2, 3

Language detection. If you’d like Full-Text RSS to find the language of the articles it processes, you can use one of the following values:

0
Ignore language
1
Use article metadata (e.g. HTML lang attribute) (Default value)
2
As above, but guess the language if it’s not specified.
3
Always guess the language, whether it’s specified or not.
debug [no value], rawhtml, parsedhtml

If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems.

If the parameter value is rawhtml, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects.

If the parameter value is parsedhtml, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (rawhtml) output. If your extraction rules are not picking out any elements, this will likely help identify the problem.

Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.

parser html5php, libxml The default parser is libxml as it’s the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It’s slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter.
proxy 0, 1, string (proxy name) This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Request parameters for feed conversion

You can pass the following parameters to makefulltextfeed.php in a GET HTTP request. Most of these parameters have default values suitable for news enthusiasts who simply want to subscribe to a full-text feed in their news reading application. If that’s what you’re doing, you can safely ignore the details here. Note: for many of these parameters, the configuration file will ultimately determine if and how they can be used.

Parameter Value Description
url string (URL) This is the only required parameter. It should be the URL to a partial feed or a standard HTML page. You can omit the ‘http://’ prefix if you like.
format rss (default), json The default Full-Text RSS output is RSS. The only other valid output format is JSON. To get JSON output, pass format=json in the querystring. Exclude it from the URL (or set it to ‘rss’) if you’d like RSS.
summary 0 (default), 1 If set to 1, an excerpt will be included for each item in the output.
content 0, 1 (default) If set to 0, the extracted content will not be included in the output.
links preserve (default), footnotes, remove Links can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
exc 0 (default), 1 If Full-Text RSS fails to extract the article body, the generated feed item will include a message saying extraction failed followed by the original item description (if present in the original feed). You ask Full-Text RSS to remove such items from the generated feed completely by passing 1 in this parameter.
html 0 (default), 1

Treat input source as HTML (or parse-as-html-first mode). To enable, pass html=1 in the querystring. If enabled, Full-Text RSS will not attempt to parse the response as a feed. This increases performance slightly and should be used if you know that the URL is not a feed.

Note: If excluded, or set to 0, Full-Text RSS first tries to parse the server’s response as a feed, and only if it fails to parse as a feed will it revert to HTML parsing. In the default parse-as-feed-first mode, Full-Text RSS will identify itself as PHP first and only if a valid feed is returned will it identify itself as a browser in subsequent requests to fetch the feed items. In parse-as-html-first mode, Full-Text RSS will identify itself as a browser from the very first request.

xss 0 (default), 1

Use this to enable XSS filtering. We have not enabled this by default because we assume the majority of our users do not display the HTML retrieved by Full-Text RSS in a web page without further processing. If you subscribe to our generated feeds in your news reader application, it should, if it’s good software, already filter the resulting HTML for XSS attacks, making it redundant for Full-Text RSS do the same. Similarly with frameworks/CMSs which display feed content - the content should be treated like any other user-submitted content.

If you are writing an application yourself which is processing feeds generated by Full-Text RSS, you can either filter the HTML yourself to remove potential XSS attacks or enable this option. This might be useful if you are processing our generated feeds with JavaScript on the client side - although there’s client side xss filtering available too.

If enabled, we’ll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.

callback string This is for JSONP use. If you’re requesting JSON output, you can also specify a callback function (Javascript client-side function) to receive the Full-Text RSS JSON output.
lang 0, 1 (default), 2, 3

Language detection. If you’d like Full-Text RSS to find the language of the articles it processes, you can use one of the following values:

0
Ignore language
1
Use article metadata (e.g. HTML lang attribute) or feed metadata. (Default value)
2
As above, but guess the language if it’s not specified.
3
Always guess the language, whether it’s specified or not.

If language detection is enabled and a match is found, the language code will be returned in the <dc:language> element inside the <item> element.

debug [no value], rawhtml, parsedhtml

If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems.

If the parameter value is rawhtml, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects.

If the parameter value is parsedhtml, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (rawhtml) output. If your extraction rules are not picking out any elements, this will likely help identify the problem.

Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.

parser html5php, libxml The default parser is libxml as it’s the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It’s slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter.
proxy 0, 1, string (proxy name) This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Feed-only parameters — These parameters only apply to web feeds. They have no effect when the input URL points to a web page.

Parameter Value Description
use_extracted_title [no value] By default, if the input URL points to a feed, item titles in the generated feed will not be changed - we assume item titles in feeds are not truncated. If you’d like them to be replaced with titles Full-Text RSS extracts, use this parameter in the request (the value does not matter). To enable/disable this for for all feeds, see the config file - specifically $options->favour_feed_titles
max number The maximum number of feed items to process. (The default and upper limit will be found in the configuration file.)

Full changelog

  • Content extractor now looks for Schema.org articleBody elements
  • New endpoint extract.php for developers looking for simpler JSON results (no RSS as input/output)
  • New endpoint extract.php accepts POST requests and HTML as input (inputhtml request parameter)
  • Proxy support added (proxy servers can now be added to the config file, see $options->proxy_servers, ->proxy and ->allow_proxy_override)
  • New HTML5 parser: HTML5Lib has been replaced by HTML5-PHP (the old one had too many problems)
  • New config option: cache time ($options->cache_time)
  • New config option: enable/disable single-page retrieval ($options->singlepage)
  • New config option: allow HTML parser override through querystring ($options->allow_parser_override)
  • New request parameter: parser - use it to force new HTML5 parser to be used, &parser=html5php (it will be slower)
  • Expanded debug request parameter: &debug=rawhtml (shows original response headers and body), &debug=parsedhtml (shows response body after parsing)
  • APC stats page now expects APCu (older version of APC still supported, but stats within admin area won’t be viewable)
  • Auto update of site-specific extraction rules fixed
  • Content security HTTP headers now used for the feed preview
  • Request parameters and response examples now listed in a table on the index page (new Request Parameters tab)
  • Compatibility test file updated to show if HTML5-PHP parser is supported (PHP 5.3 dependency), and to test for HHVM (not yet supported)
  • Config option removed: $options->registration_key
  • Preserve TTL element in RSS 2.0 feeds
  • Other minor fixes/improvements

Available to buy

Full-Text RSS 3.3 is now available to buy. If you’re an existing customer, please wait for an email from us with an upgrade link.


  1. When submitting HTML content directly in the request, make sure it’s UTF-8 encoded (we won’t try to detect encoding for you). You will also still need to supply a URL in the url parameter as that will determine how we extract the article (if we have site-specific rules associated with the URL). If you don’t have a URL, you can either submit the base URL where the document came from (e.g. chomsky.info, or simply make up a URL, e.g. example.org). 

New version of Full-Text RSS on Mashape

May 10, 2014

We’re getting ready to release the new version of Full-Text RSS, our tool to extract article content from web pages and transform partial web feeds into full-text feeds.

One of the big changes in this release is a new endpoint to make article extraction from regular web pages easier for developers. Although we’ve supported JSON output for a while, it has always been a JSON representation of our regular RSS output, which can be quite cumbersome. If you’re a developer, using our new endpoint for article extraction should be a lot easier.

To give you an example, here’s the JSON produced using the new endpoint (the input URL was a Chomsky article):

Try it out

If you’re curious, have a look at our new Mashape API listing. You’ll find the request parameters you can use with the new endpoint, including the ability to pass in HTML directly. We’ve also included sample output, so you’ll get an idea of what kind of responses you’ll receive.

In addition to the nice API documentation, you can also test the new version within Mashape without having to write any code. Simply fill in the parameter fields and click ‘Test Endpoint’. (You’ll have to sign up for the free plan on Mashape to do this.)

If you’re happy with the test results, you can start writing code. You’ll find that Mashape has already generated code to get you started, pre-filled with our endpoint URL and request parameters. Not only that, they’ve got the same code in Java, Node, PHP, Python, Objective-C, Ruby and .NET.

The Mashape code relies on their own open source HTTP client, so it’s not tied to their service. If you ever decide to host your own copy of Full-Text RSS (it’s free software), you’ll be able to change the base URL to point to your own copy.

We’ll have more information about the new version of Full-Text RSS in another blog post.

Reading List will be discontinued

Mar 11, 2014

Later in the year we will discontinue our Reading List service in favour of Wallabag.

If you haven’t heard of Wallabag, it is a self-hostable, free software read later alternative to Instapaper, Readability, Pocket1, and of course our own Reading List.2 We are happy to have been able to contribute Full-Text RSS 3.1 to handle Wallabag’s content extraction.

You can download Wallabag and install it on your own server, or try their hosted service. It also comes with a number of browser extensions and mobile apps.

Our Reading List service will continue to work for existing users, but sometime around August we will take it down. Before then we plan to release the source code for anyone interested.


  1. Wallabag used to be called Poche (the French word for pocket), until Pocket got nasty and threatened legal action

  2. We developed Reading List before we’d heard about Wallabag. Now that it’s available, and appears to be actively maintained, we have decided to focus on our existing tools instead of continuing work on Reading List. 

PDF Newspaper 2.5

Feb 4, 2014

We’ve just released a new version of PDF Newspaper, our tool to create printable versions of web articles and feeds. Here’s what’s new.

HTML output

The biggest change in this release is the ability to request HTML output instead of PDF. It may seem odd to offer HTML output in an application called PDF Newspaper, but browser support for recent CSS specifications means that browsers are now quite adept at producing print layouts very similar to what we were generating before as PDF files.

The HTML view we generate contains a print stylesheet to produce a somewhat similar result to our PDF output when printing. There are a number of differences though:

  • You can edit the text before printing.
  • Unlike our PDF output, columns are balanced (roughly same height) in this view.
  • You can convert to PDF using your own PDF creator (e.g. Acrobat) provided it has a PDF print driver.1
  • Our print stylesheet will use 3 columns if you print A4 landscape or larger, 2 columns for A4 portrait, and 1 column for A5 or smaller.2
  • Right-to-left languages can now be displayed.3
  • If you know HTML/CSS, you can make changes to the template.4

Video

Here’s a short video demonstrating some of the features of HTML output — mainly the multi-column CSS rules being applied in print view, based on paper size and orientation.

Live Examples

To load the feeds we used in the video with PDF Newspaper, use the links below:

PDF screenshots (A4)

multi-story mode single-story mode
PDF Newspaper output PDF Newspaper output

More screenshots

PDF changes

We’ve made changes to our PDF output to address problems with PDF.js rendering (used by recent versions of Firefox to render PDFs in the browser) and iOS rendering. If you used a previous version of PDF Newspaper and had trouble seeing your PDFs on Firefox or iOS devices, this new version should work better.

We also now offer a Letter template in addition to A4 and A5.

Combining stories from different sources

If you’d like to generate a newspaper by selecting articles from different sources, you’ll have to first create a feed from those articles. The easiest way to do that is to use our Feed Creator. Here you can paste URLs, one per line, and create a static feed containing only the URLs you’ve entered.

Let’s say we want to create a newspaper from the following articles:

We would paste the URLs to these articles in the Feed Creator field provided and click Create Simple RSS. This will generate a feed containing the 3 items above:

Now if we give this feed URL to PDF Newspaper, here’s what it can produce:

Try it!

PDF Newspaper 2.5 is live now at FiveFilters.org.

Request parameters

Using the form we provide is the easiest way to get started, but if you want to call PDF Newspaper programmatically, the following table of request parameters will tell you what PDF Newspaper can accept. These parameters should be used in a HTTP GET request to makepdf.php.

Parameter Value Description
url string URL of a feed or a single web article.
mode multi-story (default), single-story
multi-story
Use for feeds. You can customise the newspaper title.
single-story
Use to process a single article (even with feeds). Produces a more compact layout, omitting newspaper title.
template A4 (default), Letter, A5 Sets PDF paper size. The A4 and Letter templates produce a larger, two-column PDF. The A5 template produces a smaller, single-column PDF.
output pdf (default), pdf-download, html
pdf
Your browser decides what to do with the generated PDF - either it will load the PDF within the browser, download it, or prompt you to choose an action.
pdf-download
Tells the browser that you want to download the PDF rather than view it inside the browser. It will either download automatically or prompt you to choose an action.
html
Outputs the generated HTML without producing a PDF. Produces a result faster than the pdf options, but uses a print stylesheet to achieve a somewhat similar result when you print. Note: the template parameter currently has no effect with HTML output - if printing, you will set the paper size in the print dialog that appears. For best results printing or creating a PDF from this view, please use Firefox.
dir auto (default), ltr, rtl Sets text direction: auto = browser decides, ltr = left-to-right, rtl = right-to-left. This parameter currently only works for HTML output.
images 1 or 0 (default) Include images. Pass 1 to enable.
date 1 or 0 (default) Include date and time for each article (if available). Pass to 1 to enable.
sub string This can be a tagline, slogan, or the main title if using single story mode. If omitted, the default one set in the config file will be used.

Multi-story parameters — These parameters only apply when multi-story mode is enabled (see mode parameter above).

Parameter Value Description
title string Newspaper title. If you’d like to use the default title image instead, delete the title.
order desc (default) or asc Determines how stories will be ordered by date. Pass asc for chronological ordering (oldest story in the feed appears first). Pass desc to have the latest stories shown first.
date_start string If the feed contains dates for feed items, you can restrict items returned by specifying a start date. Any items with a publish date earlier than date_start will be omitted from the output.
date formats
You can pass an absolute date using the YYYY-MM-DD format, e.g. 2014-01-24 or a relative one, e.g. last week or yesterday.
date_end string If the feed contains dates for feed items, you can restrict items returned by specifying an end date. Any items with a publish date later than date_end will be omitted from the output. See note above about using relative and absolute dates.

Full-Text RSS integration — When you give PDF Newspaper a URL to a web article, Full-Text RSS is automatically used to extract its content. For a partial feed, you will have to tell PDF Newspaper if you’d like it passed to Full-Text RSS for full content. (Full-Text RSS integration can be configured or disabled within the config file.)

Parameter Value Description
fulltext 1 or 0 (default) Use for partial feeds. Runs feed through Full-Text RSS before processing result. Pass 1 to enable.
use_extracted_title 1 or 0 (default) Normally feed titles take precedance over extracted titles. Pass 1 to tell Full-Text RSS to replace feed titles with those it extracts. (Requires Full-Text RSS version 3.2 or greater. Has no effect without fulltext parameter.)

API keys — If you want to restrict access to PDF Newspaper you can specify API keys in the config file. URLs produced by PDF Newspaper can be used publically, e.g. linked from a website, so the API key should not appear in the final URL.

Parameter Value Description
api_key string A key that you’ve entered in the config. If you’re calling PDF Newspaper programattically, it’s better to use the key and hash parameters (see below) to hide the actual key in the HTTP request. If this parameter is used, PDF Newspaper will produce the key and hash values automatically and redirect to a new URL to hide the API key. If you’d like to link to a PDF publically while protecting your API key, make sure you copy and paste the URL that results after the redirect. If you’ve configured PDF Newspaper to require a key, an invalid key will result in an error message.
key integer This should be the index number which identifies an API key without revealing it. It must be passed along with the hash parameters. See the config file.
hash string A SHA-1 hash value of the API key (actual key, not index number) and requested URL, concatenated. It must be passed along with the key parameter. In PHP, for exmaple: $hash = sha1($api_key.$url);

Required parameters: url must be supplied.

Changelog

  • New: HTML output with editable content and print stylesheet (Firefox recommended for printing)
  • New: Output parameter to choose between PDF, HTML, and PDF for download
  • New: Text direction parameter (only for HTML output for the time being)
  • New: PDF Letter template for US users
  • New: Form field to specify start date (if feed items include dates)
  • New: Config option to set PDF filename - see $options->filename
  • New: Config option to enable/disable output caching - see options->caching
  • New: Config option to for whitelisting/blacklisting hosts - see $options->allowed_hosts and $options->blocked_urls
  • Font subsetting disabled in PDF output to improve iOS and PDF.js rendering
  • PDF is no longer generated if there are no items to include (e.g. no articles published after start date)
  • Table showing available request parameters now shown in index.php
  • Full-Text RSS updated to version 3.1
  • HTML Purifier updated to version 4.6.0
  • SimplePie updated to version 1.3.1 
  • PHP Typography updated
  • Humble HTTP Agent updated
  • TCPDF fonts updated
  • TCPDF minor update (latest version not compatible with our modifications)
  • Plus other minor fixes/improvements

  1. If you have a PDF print driver, you’ll see it in your list of printers when you go to print. If you have Acrobat, you’ll probably see ‘Adobe PDF’ in the list. If you get an error creating a PDF using Adobe PDF, go into Properties and in Adobe PDF Settings, uncheck ‘Rely on system fonts only; do not use document fonts’. 

  2. Not all browsers currently support multi-column printing. Notably, Chrome, which supports multi-column layouts in its regular view, does not support it in its print view. In our tests, Firefox (tested with version 26) and IE (version 11) had the best support for multi-column printing. 

  3. By default we rely on the browser to decide (based on the content) whether it should show the result as right-to-left (we use dir=”auto” attribute). You can override this, however, by passing &dir=rtl in the querystring to makepdf.php. 

  4. To make changes to the template used in the HTML view, save a copy of html_template.html as custom_html_template.html and edit it. The custom file will be used if it exists. This only applies to users of our self-hosted package. 

Feed Creator: Our new tool to monitor web pages using custom feeds

Oct 19, 2013

If you’ve ever wanted to treat set of links on a website as a web feed (so you can subscribe and be notified of updates), you might find our new feed creation tool useful.1

You can use the tool to create a feed from almost any publicly accessible webpage. That includes:

  • Twitter streams
  • Public Facebook timelines
  • Search results on a website

Here are a few cases and examples where you might want to use Feed Creator:

  • A webpage has no feed of its own
    For example: Twitter accounts or public Facebook timelines2
  • There is a feed, but not for the items that interest you
    For example: Search results on a website or a category on a news site

How does it work?

Our service sits in between your feed reader (e.g. NewsBlur, Feedly, IFTTT) and the publisher’s website.

If a website already offers a feed for the information you’re interested in, this is typically how your feed reader gets updates after you subscribe to the feed:

Feed Reader communicates directly with publisher's server

If there is no suitable feed, our service produces one based on the information you give it. You then subscribe to the feed we generate. Now when a feed reader requests the feed, here’s what happens behind the scenes:

Feed Reader communicates with publisher's server via the Feed Creator service

Getting started

To use the feed creator you will need:

  1. The URL of the source page which contains the items you’re interested in.
  2. Some knowledge of HTML (and CSS for advanced selection)

Before we go on, I’d like to stress that if there’s already a feed associated with the webpage, you should use it instead of relying on this tool.3 Feed Creator extracts links from the webpage by looking at its HTML.4 If a website gets redesigned, HTML changes could break our generated feeds.

Extracting links: 3 different ways

If you supply only the page URL, Feed Creator will return the first set of links it encounters in the HTML. This will include things like navigation elements - which usually appear at the top of the page. That’s probably not what you want. Below we’re going to look at 3 different ways to extract the items you’re interested in.

1. Selecting links using URL segments

The simplest way to narrow results to the set of links you are interested in is to see if you can find a URL segment that’s exclusive to that set of links.

Example

The official Noam Chomsky website has a page listing Chomsky’s articles. There is no RSS feed linked on the page nor in the HTML header.5 The first set of links point to other pages on the site (recent updates, books, audio and video) - these are navigation elements which we’re not interested in. Below those, the article links appear. These are the links we want to use in our RSS feed and monitor for updates.

If you hover your cursor over a few of the article links you’ll find that they all contain the segment ‘articles/’ - e.g. ‘articles/20130902.htm’, ‘articles/20130604.htm’. So it looks like ‘articles/’ is common to all these links. If you now hover over the navigation links, you’ll find: ‘books.htm’, ‘audionvideo.html’, ‘articles.htm’ (note, no forward slash here).

So, our brief examination suggests that the URL segment ‘articles/’ is exclusive to the article links we’re interested in. Let’s go ahead and try creating a feed from this information:

  1. Visit the Feed Creator site
  2. In the page URL field enter: http://chomsky.info/articles.htm
  3. In the ‘keep links if link URL contains’ field: articles/
  4. Click ‘Preview’ and wait for results.
  5. If results look okay, you can subscribe to our generated feed using the button provided.

It’s important to note that what we’re trying to do is to identify patterns within the page that will not only return items that are currently on the page, but also pick up future entries. That’s why we don’t want to select links using identifiers that only apply to existing links (e.g. ‘20130902.htm’ or ‘20130604.htm’).

2. Selecting links using class and id attributes

Sometimes you’ll need more than a URL segment to select the links you want. If you know some HTML you can check the source of the page and see if there are class or id attributes associated with the links, their parent elements, or ascendants.6 If you find some, you can use those values to restrict your search to those elements.

Example

John Pilger’s website already offers an RSS feed for his articles, so this is one of those cases where you shouldn’t really use this tool. But I’ll use it as an example.

If you visit the articles page and click ‘Expand all articles’, you’ll see his latest articles at the top. If you examine the HTML, you’ll find the entries are marked up as follows:

<span class="entry">
    <a href="{{ article url }}" class="entry-link">{{ article title }}</a>
    <span class="entry-date" title="1 day ago">{{ article date }}</span>
    <a href="#" rel="nofollow" class="show-intro" id="showintro-815">Show intro...</a>
    <span class="intro" id="article-intro-815">{{ article description }}</span>
</span>


Each article entry is contained in a <span> element with the class attribute value “entry”. This element holds two link (<a>) elements. The actual article title and URL appear in the <a> element with class “entry-link”.

So let’s try creating a feed from this information:

  1. Visit the Feed Creator site
  2. In the page URL field enter: http://johnpilger.com/articles
  3. In the ‘look for links inside…’ field enter: entry-link
  4. Click ‘Preview’ and wait for results.

Here’s a direct link to results.

3. Selecting links using CSS selectors

For more advanced selection, you can use CSS selectors. Note that this selection method cannot be used in combination with the previous one, and we don’t yet offer form fields for entering selectors, so you’ll have to create the URL by hand. You can refer to the information in the request parameters table to see how these should be used.

Example

First, let’s see how the previous example looks like using a CSS selector:

  1. Page URL: http://johnpilger.com/articles
  2. item parameter: .entry-link or a.entry-link

Here’s a direct link. It should produce the same results as in the previous example.

Now let’s look at a more complicated example: a Twitter timeline. If you view a Twitter timeline in your browser, this is how tweets are currently marked up in HTML:7

<div class="tweet original-tweet js-stream-tweet ...">
    <span class="icon dogear"></span>
    <div class="content">
        <div class="stream-item-header">
            <a class="account-group ..." href="..." data-user-id="...">
                <img class="avatar js-action-profile-avatar" src="..." alt="">
                <strong class="fullname js-action-profile-name ...">...</strong>
                <span>&rlm;</span>
                <span class="username ..."><s>@</s><b>...</b></span>
            </a>
            <small class="time">
                <a href="{{ tweet URL }}" class="tweet-timestamp ..." title="{{ date }}">
                    <span class="_timestamp js-short-timestamp ...">1h</span>
                </a>
            </small>
        </div>
        <p class="js-tweet-text tweet-text">{{ tweet text }}</p>
        ...
    </div>
</div>


Notice that the tweet URL appears in the element which holds the date, and there is no suitable title (unless you consider the tweet text to be the title) to use for feed items. So here we’re going to tell Feed Creator to omit item titles, and to use the tweet URL as the item URL. We could tell it to use the tweet text (p.tweet-text) as the description, but then we wouldn’t know who tweeted it (could be a retweet), so we’ll tell it to use the parent element (div.content). Here’s what our parameters will look like:

  1. Page URL: twitter.com/fivefilters/ (use any Twitter account you like)
  2. item: .original-tweet
  3. item_url: a.tweet-timestamp
  4. item_desc: .content
  5. item_title: 0

If we stop here, we’ll find that the description will contain text from elements within div.content which we’re not interested in. So let’s remove these elements using the strip parameter: .stream-item-footer,.username,.js-short-timestamp

Here’s a direct link.

Hosted service with self hosting option

Our hosted service (the one accessible on createfeed.fivefilters.org) is free to use. It is intended for personal use and to show you what the feed creator application can do. We limit results to 10 items per feed and we’ll soon start caching webpages for around 30 minutes.

If our service turns out to be popular for subscribing to Twitter/Facebook feeds, there’s a possibility that these companies will block access. If you want to avoid that happening, you should consider running your own copy of the code. You’ll find information on the self-hosted package and a ‘buy’ button at the bottom of the Feed Creator page.

That’s all for now. Hope you found the guide useful. Feel free to comment or post a question on our help page if you like.


  1. We announced this service in May, but have since improved it and made it available for self-hosting. 

  2. Facebook and Twitter don’t like RSS. When you visit a public account on either site, you will not find RSS links on the page nor in the HTML header (which is how other services usually detect if there’s a feed associated with a page). Facebook can in fact produce a feed for public pages, but you’ll have to do some work to get the feed URL. Twitter no longer produces a feed, so you’ll have to use a service like ours. 

  3. If a site publishes a feed, they’re generally committed to maintaining it. People who subscribe to the feed depend on it to get notified of updates, so it’s not really in the publisher’s interest to remove the feed. 

  4. We don’t yet support pages which rely exclusively on Javascript to load content. 

  5. Chomsky.info used to link to one of my earlier projects which produced an RSS feed for their latest news page. They then moved to Blogger, which provides feeds. Now they appear to be using Facebook for their latest news. 

  6. The easiest way to find class and id attributes to use as selectors is to use Firefox’s page inspector. Simply right click a link or some other element on the page and select ‘Inspect Element’. 

  7. When viewing the HTML source of a webpage, or using the Firefox page inspector, you should bear in mind that it might not be seen the same way by our software. Servers may send back different responses depending on where the request originates, and what it contains. Another issue is how the HTML response is parsed. Even if the server sends back the same response to us as it does to your browser, your browser could parse it differently to our application (which uses PHP’s built-in HTML parser). 

Gothenburg Book Fair

Sep 3, 2013

We have a small stand at the Gothenburg Book Fair this year. If you’re going, you’ll find us at stand F02:41.

We’ll be demoing some of our current applications and introducing a new one. More information soon.

Update: Thanks to everyone who showed up. The new application we demoed is the Reading List—we’ll have more to say about this soon.

Google Reader closing down

Jun 6, 2013

As you may have heard, Google is shutting down Google Reader on July 1 2013.

We’ve never been fond of Google Reader, mainly for these reasons:

  • It’s not free software
  • It’s a Google product
  • It doesn’t treat feeds equally—updating popular feeds (high subscriber counts) much more frequently than less popular feeds

So, how will this affect our users?

Some time ago, to deal with Google Reader’s selective treatment of feeds, we introduced monitored feeds to subscribers of our premium Full-Text RSS service. This was an experimental feature specifically aimed at Google Reader users. Feeds marked in this way are periodically checked and updates sent to Google Reader using pubsubhubbub.

Seeing as Google Reader will be shutting down, we will soon be removing this feature for new users. Existing users who have set up monitored feeds will receive an email with more information sometime after July 1 2013.

Readable.cc - a very nice news reader

Finally, if you’re looking for a simple, clean news reader, Readable.cc by Elbert Alias looks very nice. It’s free software (GPL) written in PHP, and there’s a hosted version you can sign up for and try out.

Navigate
« To the past Page 1 of 2

Recommended articles and tweets

Follow us on Twitter for more