Ever wanted to extract images from web pages? Now you can with one simple API call.
Repustate’s clean-html API call has been one of our most popular API calls since Day 1. It hasn’t been touched much as its performance was quite good from the get-go, but that changed recently. Now you can extract images as well as the text from any web page.
We had a customer request to add the ability to extract the main image from a web page as well, similar to how Instapaper or Mobile Safari’s “Reader” feature works.
Now by default, when you call clean-html, an image attribute comes back with a URL for the main image, if it exists, for a given article.
Let’s take a look at an example. You’ll need a Repustate API key to try this on your own but it’s free and easy to get one. Let’s take this URL:
http://www.thestar.com/news/insight/2013/02/15/challenging_the_vatican_progressive_catholics_say_reform_must_begin_with_church_governance.html
and pass it to our API call.
curl -d "url=http://www.thestar.com/news/insight/2013/02/15/challenging_the_vatican_progressive_catholics_say_reform_must_begin_with_church_governance.html" http://api.repustate.com/v2/YOUR_API_KEY/clean-html.json
And here’s the response:
{"status": "OK", "text": "To progressive Canadian Catholic ... (shortened for this example)", "image": "http://www.thestar.com/content/dam/thestar/news/insight/2013/02/15/challenging_the_vatican_progressive_catholics_say_reform_must_begin_with_church_governance/vatican_lightning.jpg.size.xxlarge.promo.jpg", "url": "http://www.thestar.com/news/insight/2013/02/15/challenging_the_vatican_progressive_catholics_say_reform_must_begin_with_church_governance.html"}
As you can (kind of) see, there is an ‘image’ key in the JSON response with a URL for the main image of that article.
With this API call, you can create your own version of Instapaper or Readability for your own purposes.