|
|
|
Niels Mayer
|
Is there anything like the Xwiki-feed-plugin except that instead of fetching
a feed, it would fetch an HTML document via HTTP, returning a DOM structure that can be scanned or filtered by API-calls, e.g.: $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://nielsmayer.com") $images = $fetchedDom.getImgList() $media = $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4", ".mp4"]) $content = $fetchedDom.getDivListById(['xwikicontent, 'container', 'content']) Since this would happen on the server, you'd probably need to "fake" being a real browser (or just capture the user's browser configuration and pass it via the call to the hypothetical "getDocumentDOM()" in order to capture an accurate scraped representation of a modern site.) The existing examples I've seen store an Xwiki document in the database first. I was hoping there was an "in memory" option that would allow for the document to be maintained in the app's context for long enough to process the remaining stream of plugin calls such as "getDivListById()" or "getAnchorHREFsByExtension()" and then appropriately dispose the DOM when no longer referenced, via garbage collection. Maybe compared to the implementation headaches -- of retrieving a potentially large document into memory incrementally, parsing it into a DOM incrementally, making that available in the context, etc -- maybe I should just write the damn document into the database, scrape it, and delete it. Since I would use Xwiki to store a JSON "scrape" of the document in the DB (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0] of the retrieved document, and then just delete the wiki-contents after scraping.... So actually, if anybody has any suggestions for "scraping" with a retrieved document, stored as Xwiki doc, please, suggest as well! This seems like an area potentially fraught with peril that many people have already dealt with, so I would appreciate advice. Thanks, Niels http://nielsmayer.com _______________________________________________ devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
|
vmassol
|
Hi Niels,
You could easily call $xwiki.getExternalURL() which returns the content at a URL. Then you can use our XHTML parser to generate a XDOM and then do whatever you want with it. Only little issue: the renderer is not available in the xwiki content right now. But if you're doing groovy it should be easy. For large document we can add a method easily in Parser interface: parser(Reader, Listener). All you'd need to do is implement Listener a groovy script for ex and you'd get called for each element in the page. Thanks -Vincent On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote: > Is there anything like the Xwiki-feed-plugin except that instead of > fetching > a feed, it would fetch an HTML document via HTTP, returning a DOM > structure > that can be scanned or filtered by API-calls, e.g.: > > $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http:// > nielsmayer.com") > $images = $fetchedDom.getImgList() > $media = $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4", > ".mp4"]) > $content = $fetchedDom.getDivListById(['xwikicontent, 'container', > 'content']) > > Since this would happen on the server, you'd probably need to "fake" > being a > real browser (or just capture the user's browser configuration and > pass it > via the call to the hypothetical "getDocumentDOM()" in order to > capture an > accurate scraped representation of a modern site.) > > The existing examples I've seen store an Xwiki document in the > database > first. I was hoping there was an "in memory" option that would allow > for the > document to be maintained in the app's context for long enough to > process > the remaining stream of plugin calls such as "getDivListById()" or > "getAnchorHREFsByExtension()" and then appropriately dispose the DOM > when no > longer referenced, via garbage collection. Maybe compared to the > implementation headaches -- of retrieving a potentially large > document into > memory incrementally, parsing it into a DOM incrementally, making that > available in the context, etc -- maybe I should just write the damn > document > into the database, scrape it, and delete it. > > Since I would use Xwiki to store a JSON "scrape" of the document in > the DB > (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0] > of the > retrieved document, and then just delete the wiki-contents after > scraping.... So actually, if anybody has any suggestions for > "scraping" with > a retrieved document, stored as Xwiki doc, please, suggest as well! > This > seems like an area potentially fraught with peril that many people > have > already dealt with, so I would appreciate advice. > > Thanks, > > Niels > http://nielsmayer.com devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
|
Pascal Voitot
|
On Thu, Jun 18, 2009 at 9:04 PM, Vincent Massol <[hidden email]> wrote:
> Hi Niels, > > You could easily call $xwiki.getExternalURL() which returns the > content at a URL. > Then you can use our XHTML parser to generate a XDOM and then do > whatever you want with it. > > Only little issue: the renderer is not available in the xwiki content > right now. But if you're doing groovy it should be easy. > > For large document we can add a method easily in Parser interface: > parser(Reader, Listener). All you'd need to do is implement Listener a > groovy script for ex and you'd get called for each element in the page. > > Thanks > -Vincent > I agree with Vincent... Groovy is the easiest solution... In the past, I tried another "weird" solution consisting in integrating a JavaScript rendering engine on the serverside such as rhino... then manipulating a DOM in Javascript was quite natural and I could use great APIs such as prototype... It worked quite well but I'm not sure about the performance and memory issues but I found this idea funny: Javascript on serverside... This might seem a bit "heretic" to say that but there are some products on the market proposing to build websites with javascript on client and server side... > > On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote: > > > Is there anything like the Xwiki-feed-plugin except that instead of > > fetching > > a feed, it would fetch an HTML document via HTTP, returning a DOM > > structure > > that can be scanned or filtered by API-calls, e.g.: > > > > $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http:// > > nielsmayer.com") > > $images = $fetchedDom.getImgList() > > $media = $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4", > > ".mp4"]) > > $content = $fetchedDom.getDivListById(['xwikicontent, 'container', > > 'content']) > > > > Since this would happen on the server, you'd probably need to "fake" > > being a > > real browser (or just capture the user's browser configuration and > > pass it > > via the call to the hypothetical "getDocumentDOM()" in order to > > capture an > > accurate scraped representation of a modern site.) > > > > The existing examples I've seen store an Xwiki document in the > > database > > first. I was hoping there was an "in memory" option that would allow > > for the > > document to be maintained in the app's context for long enough to > > process > > the remaining stream of plugin calls such as "getDivListById()" or > > "getAnchorHREFsByExtension()" and then appropriately dispose the DOM > > when no > > longer referenced, via garbage collection. Maybe compared to the > > implementation headaches -- of retrieving a potentially large > > document into > > memory incrementally, parsing it into a DOM incrementally, making that > > available in the context, etc -- maybe I should just write the damn > > document > > into the database, scrape it, and delete it. > > > > Since I would use Xwiki to store a JSON "scrape" of the document in > > the DB > > (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0] > > of the > > retrieved document, and then just delete the wiki-contents after > > scraping.... So actually, if anybody has any suggestions for > > "scraping" with > > a retrieved document, stored as Xwiki doc, please, suggest as well! > > This > > seems like an area potentially fraught with peril that many people > > have > > already dealt with, so I would appreciate advice. > > > > Thanks, > > > > Niels > > http://nielsmayer.com > _______________________________________________ > devs mailing list > [hidden email] > http://lists.xwiki.org/mailman/listinfo/devs > devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
|
Niels Mayer
|
On Thu, Jun 18, 2009 at 11:50 PM, Pascal Voitot <[hidden email]
> wrote: > I agree with Vincent... Groovy is the easiest solution... > In the past, I tried another "weird" solution consisting in integrating a > JavaScript rendering engine on the serverside such as rhino... then > manipulating a DOM in Javascript was quite natural and I could use great > APIs such as prototype... It worked quite well but I'm not sure about the > performance and memory issues but I found this idea funny: Javascript on > serverside... This might seem a bit "heretic" to say that but there are > some > products on the market proposing to build websites with javascript on > client > and server side... Belated thanks to Vincent and Pascal for their suggestions regarding "web scraping." in Xwiki. It turned out jQuery() is the easiest solution for what I needed, which is quick n' dirty. The suggestions to use groovy are appreciated: it is an implementation strategy I will need to look into in the future, I just don't want a tiny sub-project turning into a project in and of itself, lest I never finish .... Regarding Pascal's comment about the "heretic" notion of server-side-javascript: Please note http://freebaseapps.com/ example app: http://fmdb.freebaseapps.com/ "Introducing Acre: Freebase’s integrated app development and hosting environment" This all sounds very familiar, like Xwiki.org/Xwiki.com, combined with a product version of Exhibit <http://simile-widgets.org/exhibit/>. Which is exactly what I was alluding to here ( http://www.mail-archive.com/devs@.../msg09547.html ) as a very powerful development&delivery environment, and perhaps signalling a direction-shift in web-development: http://freebaseapps.com has lots of similarities to what Xwiki has provided for years: > > - A Browser-based JavaScript IDEincluding syntax highlighting, file > organzation and more > - Open Code for Open Data all Acre files are stored in Freebase and > code sharing is encouraged in a variety of ways > > Here's the part that looks exactly like Exhibit -- right down to the syntax (exhibit uses ex:if=, acre uses acre:if=). In fact, Acre source code, with their templating language, ends up looking suspiciously close to my Exhibit+Xwiki (Velocity) code, except they can't access all the rich Java functionality with their template language like you can in Xwiki. [image: an Acre script] <http://freebaseapps.com/wiki/index> Here's where the "heresy" starts. Xwiki uses Java on the server-side. Freebaseapps uses JavaScript: > > - Server-side JavaScriptuse the same language on the server that you > use in the client > > This is basically a product version of Exhibit: > > - Template Languagea built-in, simple yet powerful XHTML template > language > > There's been a good amount of Exhibit/Freebase semantic-web work in the Simile project. Again, looks like a productization of MIT semantic web work: > > - Built-in support for FreebaseMQL query integration, plus helper > methods included for all Freebase APIs > > And finally an application hosting solution. Similar to Xwiki SAS... > Hosted on freebaseapps.comno servers to maintain! > FYI, I found out about this here (also relevant to the topic of web-scraping in JavaScript) > Freebase Hack Day II: Return of Hack Day > > You're invited to attend the Freebase Hack Day and Unconference on July > 11th in San Francisco. This event is a great opportunity to learn about what > Freebase is, find out about our developer platform, and chat to Freebase > staff and experts. Some examples of what will be happening on the day: > > - We'll be launching Acre 1.0 just a few days before Hack Day. Jason > Douglas will be showing off the features of our hosted app development > platform <http://freebaseapps.com/>, including the ability to share and > clone apps, connect to other APIs with our keystore and OAuth, and build > queries and templated web pages based on Freebase data more easily than ever > before. Acre's come a long way since our last Hack Day, so don't miss this. > (Read more about Acre.<http://blog.freebase.com/category/developer/acre/>) > > - The MQL Boot Camp will be run this year by Bryan Culbertson. Learn > how to query against Freebase's structured data about almost 6 million > topics, and see the new features of our query editor, including > tab-completion for syntax and schema. (Read more about MQL<http://blog.freebase.com/category/developers/mql/>and the query > editor <http://blog.freebase.com/2009/04/22/query-editor-20/>). > - Learn how to use Freebase to enhance your website with structured > data, like the Wall Street Journal<http://blog.freebase.com/2009/06/25/freebase-data-now-on-wsj-com/>, > or build entire apps and websites on Acre, like Tippify<http://tippify.com/>. > > - Hack on apps like our Games With A Purpose<http://blog.freebase.com/tag/gwap/>, > a TV program schedule mashup, and more. If you have a project and you're > looking for partners, technical help, or ideas, bring it with you! We're > also working on having a handful of projects ready for people to hack on who > haven't brought one of their own.) > - Find out about how Freebase's part in the Linked Open Data<http://linkeddata.org/>world, and how to use Semantic > Web <http://blog.freebase.com/category/semantic-web/> techniques and > tools to work with Freebase data. > > Niels _______________________________________________ devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
| Free Embeddable Forum Powered by Nabble | Help |