Full-Text RSS 3.2 released

May 14, 2013

Full-Text RSS 3.2 is now available for purchase.

What’s new?

There are quite a few improvements in this release (see below), but the main one is probably the ability to include excerpts in the output.

You can enable excerpts by passing ‘&summary=1’ in the query string. This will place a plain text excerpt from the extracted content in the description element.

Another addition is the ability to omit full-text content from the output. So if you want Full-Text RSS to return only excerpts from a feed or web page, you can pass &summary=1&content=0 in the query string.

Note: if both content and excerpts are requested, the excerpt will be placed in the description element and the full content inside content:encoded. If excerpts are not requested, the full content will go inside the description element — where it has always appeared in previous versions.

Why does the location of the content change when excerpts are requested? According to the RSS advisory: “Publishers who employ summaries should store the summary in description and the full content in content:encoded, ordering description first within the item. On items with no summary, the full content should be stored in description.”

The full changelog for this release:

  • A short excerpt from the first few lines of the extracted content can now be included in the output (pass &summary=1 in querystring, see $options->summary in config file for more info)
  • Full content can now be excluded from the output (pass &content=0 in querystring, see $options->content in config file for more info)
  • Site config files can now be automatically updated from our GitHub repository (URL to call visible in admin area)
  • Site config files updated for better extraction
  • PHP Readability updated to be more lenient when pruning HTML
  • Language detection library updated
  • HTML meta refresh redirects now also followed
  • APC stats (if APC is available on your server) now visible in admin area
  • Bug fix: Duplicate find_string and replace_string values in site config files no longer removed (thanks Fabrizio!)
  • Bug fix: MIME type actions now applied when following single page URLs
  • Other minor fixes/improvements

Customers who purchased earlier versions should shortly receive an email with either a free download link (if purchased in the last 12 months) or links to a discounted upgrade.

Hosting our applications

May 12, 2013

We’re often asked to recommend web hosts for our applications. We’ve now added a new help page for Full-Text RSS doing just that. The suggestions apply to our other PHP tools too.

Hope you find it useful.

Code repository updated

Apr 18, 2013

While we sell the latest versions of our software to sustain our project, our code repository contains older versions of some of our code along with code we’ve ported to PHP from other languages for use in our tools.

We recently updated the repository. Here’s what’s new:

PHP Readability

PHP Readability is a library which detects and extracts the content block (article text) from a given HTML document. It is a PHP port of the original Readability code developed by Arc90.

In Full-Text RSS 3.1, we updated PHP Readability to preserve more images and YouTube, Vimeo and Viddler embeds. This update is now available to download.

We’ve also made it compatible with Composer by adding a composer.json file to the package and listed it on Packagist. See below for an example of how you can use this in a new project with Composer.

Term Extractor

Term Extractor is a library used to extract terms from English language texts. It is a PHP port of Topia’s Term Extractor.

Term Extractor is now also compatible with Composer and listed in Packagist. See below for an example of how to use this with Composer.

To use Term Extractor as a web service, we sell a self-hosted package - useful if you’d like to switch away from relying on Yahoo’s Term Extractor.

Full-Text RSS 2.9.5

Version 2.9.5, which was released 2012-04-29, is now freely available. The previous version in our code repository was 2.8 (released 2011-05-30). Here’s the changelog.

This does not contain the site config files we include with purchased copies, but these are now all available online. If you’d like to keep yours up to date using Git, follow the steps below:

  1. Change into the site_config/standard/ folder
  2. Delete everything in there
  3. Using the command line, enter:
    git clone https://github.com/fivefilters/ftr-site-config.git .
  4. Git should now download the latest site config files for you.

To update the site config files again, you can simply run git pull from the directory.

If you find Full-Text RSS useful, there are some nice changes in 3.0 and 3.1 which you might want to consider - if you’re not already a customer. :)

Using PHP Readability and Term Extractor with Composer

  1. Install Composer if you haven’t already.
  2. At the root of your project, create a file called composer.json (or update it) with the following content:
    {
        "require": {
            "fivefilters/php-readability": "1.0.*",
    	"fivefilters/term-extractor": "1.0.*"
        }
    }
  3. From the command line, run composer install (or composer update if your project already uses Composer).
  4. The new files should be downloaded into the vendor folder and ready to use.

Full-Text RSS 3.1 released

Mar 6, 2013

Full-Text RSS 3.1 is now available for purchase. The changelog entry for this release:

  • PHP Readability updated to preserve more images/videos
  • Site config files updated for better extraction
  • SimplePie updated
  • New site config option favour_feed_titles and request parameter use_extracted_title to allow extracted titles to be used in generated feed
  • Remove image lazy loading (looks for markup used by http://wordpress.org/extend/plugins/lazy-load/)
  • <category> elements appearing inside <item> elements are now preserved in generated feed
  • <media:thumbnail> elements now preserved
  • Allow multiple <media:content> elements (previously only one was preserved)
  • Bug fix: No more self-closing iframe elements
  • Bug fix: Fixed manifest.yml to prevent error message when deploying to AppFog
  • Other minor fixes/improvements

Customers who purchased in the last year should have received an email with a free download link.

Full-Text RSS site config files updated and now on GitHub

Feb 27, 2013

Full-Text RSS, our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, we check to see if there are extraction rules for the site being processed. If there are no site patterns, we try to detect the content block automatically.

Today we’ve updated the site config files which contain these rules and also uploaded them to GitHub: Full-Text RSS site config files on GitHub.

We hope having them on GitHub will encourage users to contribute updates for the sites they like and to keep their own copies up to date.

This is also what powers our Push to Kindle and PDF Newspaper tools. So users of either of these tools are more than welcome to submit improvements.

Contributing changes

We chose GitHub for this set of files because they offer one feature which we hope will make contributing changes easier: file editing through the web interface.

You can now make changes to any of our site config files and request that your changes be pulled into the main set we maintain. This is what GitHub calls the Fork and Pull model:

The Fork & Pull Model lets anyone fork an existing repository and push changes to their personal fork without requiring access be granted to the source repository. The changes must then be pulled into the source repository by the project maintainer. This model reduces the amount of friction for new contributors and is popular with open source projects because it allows people to work independently without upfront coordination.

When we receive a pull request we’ll review the changes and if everything’s okay we’ll update our copy.

If a site is not in our set, you can create a file for it in the same way. See Creating files on GitHub.

How to write a site config file

See our help page for a brief guide. We hope to have some tutorials up soon.

Instapaper

When we introduced site patterns, we chose to adopt the same format used by Instapaper. This allows us to make use of the existing extraction rules contributed by Instapaper users.

Marco, Instapaper’s creator, graciously opened up the database of contributions to everyone:

And, recognizing that your efforts could be useful to a wide range of other tools and services, I’ll make the list of all of these site-specific configurations available to the public, free, with no strings attached.

Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at instapaper.com/bodytext/ (login required).

Testing site config files

Currently you will have to have a copy of Full-Text RSS to test changes to the site config files. In the future we will try to make this process easier.

New Term Extraction tool

Jan 18, 2013

We’re happy to announce a new version of our Term Extraction tool. You can try it now.

The older version was in Python, and many of our users struggled to get it set up. We’re not Python experts ourselves, so couldn’t do much to help. This new version, in line with the rest of our code, is now PHP. Set up should be very simple (upload what’s in the package to your server and you should be up and running).

Topia’s Term Extractor

To extract terms from a piece of content, we use Topia’s Term Extractor (thanks Stephan Richter, Russ Ferriday and the Zope Community), which describes the extraction process as follows:

This package determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.

Topia’s Term Extractor tries to produce results somewhere between a POS tagger like TreeTagger and Yahoo Keyword Extraction.

Since we are only interested in nouns, a very simple POS tagging algorithm can be deployed, which will provide good results most of the time. We then use some simple statistics and linguistics to produce a narrow but strong list of terms for the content.

The core component in our new version is a PHP port of Topia’s Term Extractor with some of Joseph Turian’s changes applied.

If you’re only interested in the PHP port, it’s free to download from our code repository.

Use as a web service: alternative to Yahoo’s Term Extraction

Our goal with this tool is to allow it to be run as a web service, similar to Yahoo’s Term Extractor, but one which you can control (no corporate APIs or restrictive Terms of Service).

In this version we’ve added support for multiple output formats (JSON, XML, HTML, plain text, serialised PHP) and a Yahoo compatibility mode. If you decide to switch over from Yahoo’s service, it’s as simple as updating the base URL for your requests.

For example, let’s say we want to extract terms from the following piece of text (the example used by Yahoo):

“Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration.”

Here’s what the request might look like for Yahoo:

http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction?appid=YahooDemo&output=json&context=Italian%20sculptors%20and%20painters%20of%20the%20renaissance%20favored%20the%20Virgin%20Mary%20for%20inspiration.

To switch to Term Extraction from FiveFilters.org, you would simply change the base URL to point it to your own copy:

http://term-extraction.aws.af.cm/yahoo.php?appid=YahooDemo&output=json&context=Italian%20sculptors%20and%20painters%20of%20the%20renaissance%20favored%20the%20Virgin%20Mary%20for%20inspiration.

This would return the following response:

{"ResultSet":{"Result":["italian sculptors","virgin mary","painters","renaissance","inspiration"]}}

Note: in this case exactly the same terms are returned by both services, but Yahoo compatibility mode does not mean you’ll always get the same results as Yahoo’s service, only that the way the results are formatted should match Yahoo’s.

Cloud ready

If you don’t have your own hosting, you can host this for free on AppFog. AppFog offer users free hosting with 2GB RAM. That’s more than enough to run Term Extraction for most users.

To install:

  1. Buy Term Extraction from FiveFilters.org
  2. Create a free account on AppFog
  3. Install the AppFog command-line client (af)
  4. Unzip the Term Extraction package and change into the term-extraction folder
  5. Type af login to login to your AppFog account
  6. Type af push to upload the package
  7. Follow the prompts and you’re done
  8. Access the URL shown by the af program in your browser to get started

Push to Kindle for Opera

Dec 20, 2012

Opera users can now add a Push to Kindle button to the browser’s toolbar.

Although not technically an Opera extension, it works in almost exactly the same way as our Chrome and Firefox extensions. Unlike our bookmarklet, this can be added to the toolbar itself and will show up with an icon.

To install, simply drag the Opera link shown on the Push to Kindle extensions section to your toolbar.

Many thanks to Alexander Shchadilov for letting us know about this technique.

Push to Kindle delivery issues

Nov 18, 2012

Last updated: 16 December 2012

Amazon appears to have fixed the issue affecting their servers, we’re seeing no more bounces now.

Amazon is experiencing delivery issues which is affecting our Push to Kindle service. Many emails are bouncing with the following message:

The following message to <xxx@free.kindle.com> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 550-'5.1.1 <xxx@free.kindle.com>... User unknown'

Reporting-MTA: dns; smtp-border-fw-31001.sea31.amazon.com

Final-Recipient: rfc822;xxx@free.kindle.com
Action: failed
Status: 5.0.0 (permanent failure)
...
Diagnostic-Code: smtp; 5.1.0 - Unknown address error 550-'5.1.1 <xxx@free.kindle.com>... User unknown' (delivery attempts: 0)

The error says ‘user unknown’ but the user clearly exists in Amazon’s system. This problem appears to have started around November 16. Here’s a chart from our email service provider showing a sudden increase in bounces from Amazon’s servers:

The current workaround is to send to your @kindle.com address — read update #2 to make sure you do not get charged by Amazon.

We’ll update again here when we know more. Sorry for the inconvenience.

If you’d like to help get this fixed sooner, please contact Amazon’s Kindle support (UK, US) and tell them you’re experiencing emails bouncing when sent to your @free.kindle.com email.

Update #1

We’ve been able to reproduce this problem by manually sending an article as an attachment to one of our own Kindle accounts (using the @free.kindle.com address). Sending the same article again after receiving the bounce message appears to work, at least temporarily. Note: sending again via our Push to Kindle service will not always work as our email provider places a temporary block on addresses which bounce.

We’re looking to see if @kindle.com addresses are also affected. If they’re not, we may enable @kindle.com delivery for all.

Update #2

We’ve enabled @kindle.com sending for our web app (including browser extensions). This is now available for everyone, not just sustainers. Please try selecting @kindle.com if you have delivery trouble. There are a few differences you should be aware of:

  • If you own a 3G Kindle device, Amazon might charge you for delivery. You can disable this by setting your ‘Maximum Charge Limit’ to zero from the Manage your Kindle page (UK, US). This will force Wi-Fi delivery, which is free.
  • Unlike @free.kindle.com sending, you will not receive a confirmation email from Amazon when your article is delivered.
  • See our explanation of the difference between free.kindle.com and kindle.com for more information.

Update #3 (27 November 2012)

Amazon is still experiencing problems with deliveries to @free.kindle.com addresses.

If you haven’t already, we recommend you follow the advice in Update #2.

Here’s a more recent chart showing the bounces we’re continuing to receive from Amazon’s servers:

We’ve received some questions from our users about this, here’s what we know so far:

  • Bounces are all from servers processing @free.kindle.com emails. We enabled @kindle.com sending in our last update, but we’ve seen no bounces for these. If you’ve had trouble receiving articles sent to your @kindle.com address, please let us know.
  • This is not a problem restricted to our service. As we mentioned in an earlier update, we were able to reproduce this problem sending manually to the @free.kindle.com address (ie. not through Push to Kindle). There is also a thread on the Amazon Kindle Help forums where a few other users have reported this problem.
  • If you use our Android app, we are working on an update to allow sending to @kindle.com addresses.

Update #4 (16 December 2012)

This issue looks like it has been resolved. We are seeing no more bounces from Amazon’s servers.

Full-Text RSS behind the scenes

Nov 18, 2012

We’ve just updated our intro help page for Full-Text RSS with two sequence diagrams showing how Full-Text RSS works with your feed reader.

If you use a feed reader, you probably subscribe to a number of web feeds you have some interest in. Your feed reader then periodically checks those feeds for new items and pulls them in for you. Some of those feeds will contain the full content of each article, allowing you to read the entire entry in your news reading application. Other feeds will contain partial content, with the expectation that you will visit the original site to read the full entry. Here’s a sequence diagram showing showing what your feed reader will typically do when you subscribe to feeds from two web sites, Website 1 and Website 2:

In this example, Website 1 returns a partial feed. Rather than subscribe to the feed from Website 1 directly in your feed reader, you can update it to route the request through Full-Text RSS. What happens when you subscribe to a feed in this way is shown in the sequence diagram below:

We’ll have another post up soon showing you how to update partial feeds in your news reading application.

Recommended articles and tweets

Follow us on Twitter for more