what's the best way to "scrape" an HTML document with Xwiki

4 messages Options
Embed this post
Permalink
Niels Mayer

what's the best way to "scrape" an HTML document with Xwiki

Reply Threaded More More options
Print post
Permalink
Is there anything like the Xwiki-feed-plugin except that instead of fetching
a feed, it would fetch an HTML document via HTTP, returning a DOM structure
that can be scanned or filtered by API-calls, e.g.:

$fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://nielsmayer.com")
$images = $fetchedDom.getImgList()
$media =  $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4", ".mp4"])
$content = $fetchedDom.getDivListById(['xwikicontent, 'container',
'content'])

Since this would happen on the server, you'd probably need to "fake" being a
real browser (or just capture the user's browser configuration and pass it
via the call to the hypothetical "getDocumentDOM()" in order to capture an
accurate scraped representation of a modern site.)

The existing examples I've seen store an Xwiki document in the database
first. I was hoping there was an "in memory" option that would allow for the
document to be maintained in the app's context for long enough to process
the remaining stream of plugin calls such as "getDivListById()" or
"getAnchorHREFsByExtension()" and then appropriately dispose the DOM when no
longer referenced, via garbage collection. Maybe compared to the
implementation headaches -- of retrieving a potentially large document into
memory incrementally, parsing it into a DOM incrementally, making that
available in the context, etc -- maybe I should just write the damn document
into the database, scrape it, and delete it.

Since I would use Xwiki to store a JSON "scrape" of the document in the DB
(as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0] of the
retrieved document, and then just delete the wiki-contents after
scraping.... So actually, if anybody has any suggestions for "scraping" with
a retrieved document, stored as Xwiki doc, please, suggest as well! This
seems like an area potentially fraught with peril that many people have
already dealt with, so I would appreciate advice.

Thanks,

Niels
http://nielsmayer.com
_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs
vmassol

Re: what's the best way to "scrape" an HTML document with Xwiki

Reply Threaded More More options
Print post
Permalink
Hi Niels,

You could easily call $xwiki.getExternalURL() which returns the  
content at a URL.
Then you can use our XHTML parser to generate a XDOM and then do  
whatever you want with it.

Only little issue: the renderer is not available in the xwiki content  
right now. But if you're doing groovy it should be easy.

For large document we can add a method easily in Parser interface:  
parser(Reader, Listener). All you'd need to do is implement Listener a  
groovy script for ex and you'd get called for each element in the page.

Thanks
-Vincent

On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote:

> Is there anything like the Xwiki-feed-plugin except that instead of  
> fetching
> a feed, it would fetch an HTML document via HTTP, returning a DOM  
> structure
> that can be scanned or filtered by API-calls, e.g.:
>
> $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://
> nielsmayer.com")
> $images = $fetchedDom.getImgList()
> $media =  $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",  
> ".mp4"])
> $content = $fetchedDom.getDivListById(['xwikicontent, 'container',
> 'content'])
>
> Since this would happen on the server, you'd probably need to "fake"  
> being a
> real browser (or just capture the user's browser configuration and  
> pass it
> via the call to the hypothetical "getDocumentDOM()" in order to  
> capture an
> accurate scraped representation of a modern site.)
>
> The existing examples I've seen store an Xwiki document in the  
> database
> first. I was hoping there was an "in memory" option that would allow  
> for the
> document to be maintained in the app's context for long enough to  
> process
> the remaining stream of plugin calls such as "getDivListById()" or
> "getAnchorHREFsByExtension()" and then appropriately dispose the DOM  
> when no
> longer referenced, via garbage collection. Maybe compared to the
> implementation headaches -- of retrieving a potentially large  
> document into
> memory incrementally, parsing it into a DOM incrementally, making that
> available in the context, etc -- maybe I should just write the damn  
> document
> into the database, scrape it, and delete it.
>
> Since I would use Xwiki to store a JSON "scrape" of the document in  
> the DB
> (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0]  
> of the
> retrieved document, and then just delete the wiki-contents after
> scraping.... So actually, if anybody has any suggestions for  
> "scraping" with
> a retrieved document, stored as Xwiki doc, please, suggest as well!  
> This
> seems like an area potentially fraught with peril that many people  
> have
> already dealt with, so I would appreciate advice.
>
> Thanks,
>
> Niels
> http://nielsmayer.com
_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs
Pascal Voitot

Re: what's the best way to "scrape" an HTML document with Xwiki

Reply Threaded More More options
Print post
Permalink
On Thu, Jun 18, 2009 at 9:04 PM, Vincent Massol <[hidden email]> wrote:

> Hi Niels,
>
> You could easily call $xwiki.getExternalURL() which returns the
> content at a URL.
> Then you can use our XHTML parser to generate a XDOM and then do
> whatever you want with it.
>
> Only little issue: the renderer is not available in the xwiki content
> right now. But if you're doing groovy it should be easy.
>
> For large document we can add a method easily in Parser interface:
> parser(Reader, Listener). All you'd need to do is implement Listener a
> groovy script for ex and you'd get called for each element in the page.
>
> Thanks
> -Vincent
>

I agree with Vincent... Groovy is the easiest solution...
In the past, I tried another "weird" solution consisting in integrating a
JavaScript rendering engine on the serverside such as rhino... then
manipulating a DOM in Javascript was quite natural and I could use great
APIs such as prototype... It worked quite well but I'm not sure about the
performance and memory issues but I found this idea funny: Javascript on
serverside... This might seem a bit "heretic" to say that but there are some
products on the market proposing to build websites with javascript on client
and server side...



>
> On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote:
>
> > Is there anything like the Xwiki-feed-plugin except that instead of
> > fetching
> > a feed, it would fetch an HTML document via HTTP, returning a DOM
> > structure
> > that can be scanned or filtered by API-calls, e.g.:
> >
> > $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://
> > nielsmayer.com")
> > $images = $fetchedDom.getImgList()
> > $media =  $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",
> > ".mp4"])
> > $content = $fetchedDom.getDivListById(['xwikicontent, 'container',
> > 'content'])
> >
> > Since this would happen on the server, you'd probably need to "fake"
> > being a
> > real browser (or just capture the user's browser configuration and
> > pass it
> > via the call to the hypothetical "getDocumentDOM()" in order to
> > capture an
> > accurate scraped representation of a modern site.)
> >
> > The existing examples I've seen store an Xwiki document in the
> > database
> > first. I was hoping there was an "in memory" option that would allow
> > for the
> > document to be maintained in the app's context for long enough to
> > process
> > the remaining stream of plugin calls such as "getDivListById()" or
> > "getAnchorHREFsByExtension()" and then appropriately dispose the DOM
> > when no
> > longer referenced, via garbage collection. Maybe compared to the
> > implementation headaches -- of retrieving a potentially large
> > document into
> > memory incrementally, parsing it into a DOM incrementally, making that
> > available in the context, etc -- maybe I should just write the damn
> > document
> > into the database, scrape it, and delete it.
> >
> > Since I would use Xwiki to store a JSON "scrape" of the document in
> > the DB
> > (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0]
> > of the
> > retrieved document, and then just delete the wiki-contents after
> > scraping.... So actually, if anybody has any suggestions for
> > "scraping" with
> > a retrieved document, stored as Xwiki doc, please, suggest as well!
> > This
> > seems like an area potentially fraught with peril that many people
> > have
> > already dealt with, so I would appreciate advice.
> >
> > Thanks,
> >
> > Niels
> > http://nielsmayer.com
> _______________________________________________
> devs mailing list
> [hidden email]
> http://lists.xwiki.org/mailman/listinfo/devs
>
_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs
Niels Mayer

Re: what's the best way to "scrape" an HTML document with Xwiki

Reply Threaded More More options
Print post
Permalink
On Thu, Jun 18, 2009 at 11:50 PM, Pascal Voitot <[hidden email]
> wrote:

> I agree with Vincent... Groovy is the easiest solution...
> In the past, I tried another "weird" solution consisting in integrating a
> JavaScript rendering engine on the serverside such as rhino... then
> manipulating a DOM in Javascript was quite natural and I could use great
> APIs such as prototype... It worked quite well but I'm not sure about the
> performance and memory issues but I found this idea funny: Javascript on
> serverside... This might seem a bit "heretic" to say that but there are
> some
> products on the market proposing to build websites with javascript on
> client
> and server side...


Belated thanks to Vincent and Pascal for their suggestions regarding "web
scraping." in Xwiki.  It turned out jQuery() is the easiest solution for
what I needed, which is quick n' dirty. The suggestions to use groovy are
appreciated: it is an implementation strategy I will need to look into in
the future, I just don't want a tiny sub-project turning into a project in
and of itself, lest I never finish ....

Regarding Pascal's comment  about the "heretic" notion of
server-side-javascript: Please note
http://freebaseapps.com/   example app: http://fmdb.freebaseapps.com/
"Introducing Acre: Freebase’s integrated app development and hosting
environment"

This all sounds very familiar, like Xwiki.org/Xwiki.com, combined with a
product version of Exhibit <http://simile-widgets.org/exhibit/>. Which is
exactly what I was alluding to here (
http://www.mail-archive.com/devs@.../msg09547.html ) as a very
powerful development&delivery environment, and perhaps signalling a
direction-shift in web-development:

http://freebaseapps.com has lots of similarities to what Xwiki has provided
for years:

>
>    - A Browser-based JavaScript IDEincluding syntax highlighting, file
>    organzation and more
>    - Open Code for Open Data all Acre files are stored in Freebase and
>    code sharing is encouraged in a variety of ways
>
> Here's the part that looks exactly like Exhibit -- right down to the syntax
(exhibit uses ex:if=, acre uses acre:if=). In fact, Acre source code, with
their templating language, ends up looking suspiciously close to my
Exhibit+Xwiki (Velocity) code, except they can't access all the rich Java
functionality with their template language like you can in Xwiki.

[image: an Acre script] <http://freebaseapps.com/wiki/index>

Here's where the "heresy" starts. Xwiki uses Java on the server-side.
Freebaseapps uses JavaScript:

>
>    - Server-side JavaScriptuse the same language on the server that you
>    use in the client
>
> This is basically a product version of Exhibit:

>
>    - Template Languagea built-in, simple yet powerful XHTML template
>    language
>
> There's been a good amount of Exhibit/Freebase semantic-web work in the
Simile project. Again, looks like a productization of MIT semantic web work:

>
>    - Built-in support for FreebaseMQL query integration, plus helper
>    methods included for all Freebase APIs
>
> And finally an application hosting solution. Similar to Xwiki SAS...

> Hosted on freebaseapps.comno servers to maintain!
>

FYI, I  found out about this here (also relevant to the topic of
web-scraping in JavaScript)

> Freebase Hack Day II: Return of Hack Day
>
> You're invited to attend the Freebase Hack Day and Unconference on July
> 11th in San Francisco. This event is a great opportunity to learn about what
> Freebase is, find out about our developer platform, and chat to Freebase
> staff and experts. Some examples of what will be happening on the day:
>
>    - We'll be launching Acre 1.0 just a few days before Hack Day. Jason
>    Douglas will be showing off the features of our hosted app development
>    platform <http://freebaseapps.com/>, including the ability to share and
>    clone apps, connect to other APIs with our keystore and OAuth, and build
>    queries and templated web pages based on Freebase data more easily than ever
>    before. Acre's come a long way since our last Hack Day, so don't miss this.
>    (Read more about Acre.<http://blog.freebase.com/category/developer/acre/>)
>
>    - The MQL Boot Camp will be run this year by Bryan Culbertson. Learn
>    how to query against Freebase's structured data about almost 6 million
>    topics, and see the new features of our query editor, including
>    tab-completion for syntax and schema. (Read more about MQL<http://blog.freebase.com/category/developers/mql/>and the query
>    editor <http://blog.freebase.com/2009/04/22/query-editor-20/>).
>    - Learn how to use Freebase to enhance your website with structured
>    data, like the Wall Street Journal<http://blog.freebase.com/2009/06/25/freebase-data-now-on-wsj-com/>,
>    or build entire apps and websites on Acre, like Tippify<http://tippify.com/>.
>
>    - Hack on apps like our Games With A Purpose<http://blog.freebase.com/tag/gwap/>,
>    a TV program schedule mashup, and more. If you have a project and you're
>    looking for partners, technical help, or ideas, bring it with you! We're
>    also working on having a handful of projects ready for people to hack on who
>    haven't brought one of their own.)
>    - Find out about how Freebase's part in the Linked Open Data<http://linkeddata.org/>world, and how to use Semantic
>    Web <http://blog.freebase.com/category/semantic-web/> techniques and
>    tools to work with Freebase data.
>
> Niels
http://nielsmayer.com
_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs