Full-Text RSS 3.3

The new version of Full-Text RSS, our tool to extract article content from web pages and transform partial web feeds into full-text feeds, is now available to try and buy. A few of the main changes are described below.

New endpoint for simpler JSON results

If you’re a developer, it’s now even easier to use Full-Text RSS for content extraction. We’ve added a new endpoint called extract.php. It returns results in a much simpler JSON representation. Although we’ve supported JSON output for a while, it has always been a JSON representation of our regular RSS output, which can be quite cumbersome to navigate if you have no need for the other RSS elements.

To give you an example, here’s the JSON produced using the new endpoint (the input URL was a Chomsky article):

{
    "title": "De-Americanizing the World",
    "excerpt": "During the latest episode of the Washington farce that has astonish…",
    "date": null,
    "author": "Noam Chomsky",
    "language": "en",
    "url": "http://chomsky.info/articles/20131105.htm",
    "effective_url": "http://chomsky.info/articles/20131105.htm",
    "content": "<p>During the latest episode of the Washington farce that has aston…"
}

Note: For brevity, the output above is truncated.

Input HTML

Another feature of this new endpoint is the ability to submit your own HTML in the request, rather than have Full-Text RSS fetch it for you from the URL. We’ve had requests from users who already have large collections of HTML documents and simply want to use Full-Text RSS for article extraction, not retrieval. That’s now possible using the new inputhtml parameter.¹

New HTML5 parser

We’ve replaced HTML5Lib with HTML5-PHP. The old one had problems parsing a lot of pages. If you relied on HTML5Lib in the last version, we’ll automatically use HTML5-PHP in this version. But you should note that HTML5-PHP requires at least PHP 5.3. So if you’re still running Full-Text RSS on PHP 5.2, the new parser won’t be available (we’ll use libxml for everything).

It’s also now possible to request HTML5-PHP parsing using a new request parameter: &parser=html5php.

Schema.org articleBody extraction

When we’re figuring where the article lies, we now also look for HTML elements marked with Schema.org’s articleBody property.

Proxy support

Full-Text RSS now supports routing requests through proxy servers. There are a number of reasons why you might want to do this:

On some company networks you might have to use a proxy server to be able to reach external sites.
While we can cache results produced by Full-Text RSS, we do not yet cache HTTP responses we receive from servers. If this is something you need, you can set up a caching proxy and route requests through that.
Bypassing IP blocks. This might sound contentious, but today there are sites which pre-emptively block IP ranges belonging to popular web hosts (Linode, AWS, etc.) to avoid potential abuse. So even if your own application is not behaving abusively, you might still find that requests to certain sites are blocked simply based on where you’ve chosen to host Full-Text RSS. If this is something that affects you, you might want to consider paying for private proxies and routing your requests through those proxies.

You can specify one or more proxy servers in the config file. You can then tell Full-Text RSS to randomly pick a proxy in each request, or specify a proxy server via a request parameter: &proxy=my-proxy.

Mashape API listing

We’ve now made the service available on Mashape for developers. You’ll find API documentation on there, sample code in a number of programming languages, and an easy way to test the service. More information in our previous blog post.

Request parameters for article extraction (new endpoint)

You can pass the following parameters to extract.php in a GET or POST HTTP request. Note: for many of these parameters, the configuration file will ultimately determine if and how they can be used.

Parameter	Value	Description
url	string (URL)	This is the only required parameter. It should be the URL to a standard HTML page. You can omit the ‘http://’ prefix if you like.
inputhtml	string (HTML)	If you already have the HTML, you can pass it here. We will not make any HTTP requests for the content if this parameter is used. Note: The input HTML should be UTF-8 encoded. And you will still need to give us the URL associated with the content (the URL may determine how the content is extracted, if we have extraction rules associated with it).
content	`0`, `1` (default)	If set to 0, the extracted content will not be included in the output.
links	`preserve` (default), `footnotes`, `remove`	Links can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
xss	`0`, `1` (default)	Use this to enable/disable XSS filtering. It is enabled by default, but if your application/framework/CMS already filters HTML for XSS vulnerabilities, you can disable XSS filtering here. If enabled, we’ll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: when enabled this will remove certain elements you may want to preserve, such as iframes.
lang	`0`, `1` (default), `2`, `3`	Language detection. If you’d like Full-Text RSS to find the language of the articles it processes, you can use one of the following values: 0 Ignore language 1 Use article metadata (e.g. HTML lang attribute) (Default value) 2 As above, but guess the language if it’s not specified. 3 Always guess the language, whether it’s specified or not.
debug	[no value], `rawhtml`, `parsedhtml`	If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems. If the parameter value is `rawhtml`, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects. If the parameter value is `parsedhtml`, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (`rawhtml`) output. If your extraction rules are not picking out any elements, this will likely help identify the problem. Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.
parser	`html5php`, `libxml`	The default parser is libxml as it’s the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It’s slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter.
proxy	`0`, `1`, string (proxy name)	This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Request parameters for feed conversion

You can pass the following parameters to makefulltextfeed.php in a GET HTTP request. Most of these parameters have default values suitable for news enthusiasts who simply want to subscribe to a full-text feed in their news reading application. If that’s what you’re doing, you can safely ignore the details here. Note: for many of these parameters, the configuration file will ultimately determine if and how they can be used.

Parameter	Value	Description
url	string (URL)	This is the only required parameter. It should be the URL to a partial feed or a standard HTML page. You can omit the ‘http://’ prefix if you like.
format	`rss` (default), `json`	The default Full-Text RSS output is RSS. The only other valid output format is JSON. To get JSON output, pass format=json in the querystring. Exclude it from the URL (or set it to ‘rss’) if you’d like RSS.
summary	`0` (default), `1`	If set to 1, an excerpt will be included for each item in the output.
content	`0`, `1` (default)	If set to 0, the extracted content will not be included in the output.
links	`preserve` (default), `footnotes`, `remove`	Links can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
exc	`0` (default), `1`	If Full-Text RSS fails to extract the article body, the generated feed item will include a message saying extraction failed followed by the original item description (if present in the original feed). You ask Full-Text RSS to remove such items from the generated feed completely by passing 1 in this parameter.
html	`0` (default), `1`	Treat input source as HTML (or parse-as-html-first mode). To enable, pass html=1 in the querystring. If enabled, Full-Text RSS will not attempt to parse the response as a feed. This increases performance slightly and should be used if you know that the URL is not a feed. Note: If excluded, or set to 0, Full-Text RSS first tries to parse the server’s response as a feed, and only if it fails to parse as a feed will it revert to HTML parsing. In the default parse-as-feed-first mode, Full-Text RSS will identify itself as PHP first and only if a valid feed is returned will it identify itself as a browser in subsequent requests to fetch the feed items. In parse-as-html-first mode, Full-Text RSS will identify itself as a browser from the very first request.
xss	`0` (default), `1`	Use this to enable XSS filtering. We have not enabled this by default because we assume the majority of our users do not display the HTML retrieved by Full-Text RSS in a web page without further processing. If you subscribe to our generated feeds in your news reader application, it should, if it’s good software, already filter the resulting HTML for XSS attacks, making it redundant for Full-Text RSS do the same. Similarly with frameworks/CMSs which display feed content – the content should be treated like any other user-submitted content. If you are writing an application yourself which is processing feeds generated by Full-Text RSS, you can either filter the HTML yourself to remove potential XSS attacks or enable this option. This might be useful if you are processing our generated feeds with JavaScript on the client side – although there’s client side xss filtering available too. If enabled, we’ll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.
callback	string	This is for JSONP use. If you’re requesting JSON output, you can also specify a callback function (Javascript client-side function) to receive the Full-Text RSS JSON output.
lang	`0`, `1` (default), `2`, `3`	Language detection. If you’d like Full-Text RSS to find the language of the articles it processes, you can use one of the following values: 0 Ignore language 1 Use article metadata (e.g. HTML lang attribute) or feed metadata. (Default value) 2 As above, but guess the language if it’s not specified. 3 Always guess the language, whether it’s specified or not. If language detection is enabled and a match is found, the language code will be returned in the <dc:language> element inside the <item> element.
debug	[no value], `rawhtml`, `parsedhtml`	If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems. If the parameter value is `rawhtml`, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects. If the parameter value is `parsedhtml`, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (`rawhtml`) output. If your extraction rules are not picking out any elements, this will likely help identify the problem. Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.
parser	`html5php`, `libxml`	The default parser is libxml as it’s the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It’s slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter.
proxy	`0`, `1`, string (proxy name)	This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Feed-only parameters — These parameters only apply to web feeds. They have no effect when the input URL points to a web page.

Parameter	Value	Description
use_extracted_title	[no value]	By default, if the input URL points to a feed, item titles in the generated feed will not be changed – we assume item titles in feeds are not truncated. If you’d like them to be replaced with titles Full-Text RSS extracts, use this parameter in the request (the value does not matter). To enable/disable this for for all feeds, see the config file – specifically `$options->favour_feed_titles`
max	number	The maximum number of feed items to process. (The default and upper limit will be found in the configuration file.)

Full changelog

Content extractor now looks for Schema.org articleBody elements
New endpoint extract.php for developers looking for simpler JSON results (no RSS as input/output)
New endpoint extract.php accepts POST requests and HTML as input (inputhtml request parameter)
Proxy support added (proxy servers can now be added to the config file, see $options->proxy_servers, ->proxy and ->allow_proxy_override)
New HTML5 parser: HTML5Lib has been replaced by HTML5-PHP (the old one had too many problems)
New config option: cache time ($options->cache_time)
New config option: enable/disable single-page retrieval ($options->singlepage)
New config option: allow HTML parser override through querystring ($options->allow_parser_override)
New request parameter: parser – use it to force new HTML5 parser to be used, &parser=html5php (it will be slower)
Expanded debug request parameter: &debug=rawhtml (shows original response headers and body), &debug=parsedhtml (shows response body after parsing)
APC stats page now expects APCu (older version of APC still supported, but stats within admin area won’t be viewable)
Auto update of site-specific extraction rules fixed
Content security HTTP headers now used for the feed preview
Request parameters and response examples now listed in a table on the index page (new Request Parameters tab)
Compatibility test file updated to show if HTML5-PHP parser is supported (PHP 5.3 dependency), and to test for HHVM (not yet supported)
Config option removed: $options->registration_key
Preserve TTL element in RSS 2.0 feeds
Other minor fixes/improvements

Available to buy

Full-Text RSS 3.3 is now available to buy. If you’re an existing customer, please wait for an email from us with an upgrade link.

When submitting HTML content directly in the request, make sure it’s UTF-8 encoded (we won’t try to detect encoding for you). You will also still need to supply a URL in the url parameter as that will determine how we extract the article (if we have site-specific rules associated with the URL). If you don’t have a URL, you can either submit the base URL where the document came from (e.g. chomsky.info, or simply make up a URL, e.g. example.org). ↩︎