chaining map reduce in hovercraft

4 messages Options
Embed this post
Permalink
Chris Anderson-3

chaining map reduce in hovercraft

Reply Threaded More More options
Print post
Permalink
I finally got around to writing my map reduce copier. it's still
basic, but what do you think?

I want to put it into trunk as an http call, like:

POST /_snapshot_view

with JSON

{"src":"/srcdb/_design/app/_view/reduce_count", "group_level":2,
"target":"/targetdb"}

Chainable map reduce seems to be one of the most popular requests on
the survey we took, so hopefully this will make the heavy-data crew
happy.

There is an implementation here:

http://github.com/jchris/hovercraft/commit/34b44527b660a740858cc71aa2c8326747465e31#L0R290

What this does is take the results you'd get from query your reduce
view with group=true, and copy them to a new database. Basically you
end up with a database full of docs that look like:

{
"key":[2009,2,14],
"value": 511
}

Since they are docs sitting in another CouchDB, you can use more
ordinary CouchDB Map Reduce views on that database to do things like
sort by value, so you can for instance sort tags by popularity, or
days by user activity, etc.

Chris


--
Chris Anderson
http://jchrisa.net
http://couch.io
Seledkin Vyacheslav

Re: chaining map reduce in hovercraft

Reply Threaded More More options
Print post
Permalink
Chris Anderson wrote:

> I finally got around to writing my map reduce copier. it's still
> basic, but what do you think?
>
> I want to put it into trunk as an http call, like:
>
> POST /_snapshot_view
>
> with JSON
>
> {"src":"/srcdb/_design/app/_view/reduce_count", "group_level":2,
> "target":"/targetdb"}
>
> Chainable map reduce seems to be one of the most popular requests on
> the survey we took, so hopefully this will make the heavy-data crew
> happy.
>
> There is an implementation here:
>
> http://github.com/jchris/hovercraft/commit/34b44527b660a740858cc71aa2c8326747465e31#L0R290
>
> What this does is take the results you'd get from query your reduce
> view with group=true, and copy them to a new database. Basically you
> end up with a database full of docs that look like:
>
> {
> "key":[2009,2,14],
> "value": 511
> }
>
> Since they are docs sitting in another CouchDB, you can use more
> ordinary CouchDB Map Reduce views on that database to do things like
> sort by value, so you can for instance sort tags by popularity, or
> days by user activity, etc.
>
> Chris
>
>
> --
> Chris Anderson
> http://jchrisa.net
> http://couch.io
>
> .
>
>  
The process of updating of shapshot db will be incremental?
Zachary Zolton

Re: chaining map reduce in hovercraft

Reply Threaded More More options
Print post
Permalink
So, Chris, it sounds like you're saying that POSTing to that URL will
place the entire results of querying the view with group=true into
another database. Sounds great!

Will it work with 0.9? Would you suggest automating this using _changes?

Cheers,
Zach

On Fri, Jun 5, 2009 at 6:17 AM, Viacheslav Seledkin
<[hidden email]> wrote:

> Chris Anderson wrote:
>>
>> I finally got around to writing my map reduce copier. it's still
>> basic, but what do you think?
>>
>> I want to put it into trunk as an http call, like:
>>
>> POST /_snapshot_view
>>
>> with JSON
>>
>> {"src":"/srcdb/_design/app/_view/reduce_count", "group_level":2,
>> "target":"/targetdb"}
>>
>> Chainable map reduce seems to be one of the most popular requests on
>> the survey we took, so hopefully this will make the heavy-data crew
>> happy.
>>
>> There is an implementation here:
>>
>>
>> http://github.com/jchris/hovercraft/commit/34b44527b660a740858cc71aa2c8326747465e31#L0R290
>>
>> What this does is take the results you'd get from query your reduce
>> view with group=true, and copy them to a new database. Basically you
>> end up with a database full of docs that look like:
>>
>> {
>> "key":[2009,2,14],
>> "value": 511
>> }
>>
>> Since they are docs sitting in another CouchDB, you can use more
>> ordinary CouchDB Map Reduce views on that database to do things like
>> sort by value, so you can for instance sort tags by popularity, or
>> days by user activity, etc.
>>
>> Chris
>>
>>
>> --
>> Chris Anderson
>> http://jchrisa.net
>> http://couch.io
>>
>> .
>>
>>
>
> The process of updating of shapshot db will be incremental?
>
Chris Anderson-3

Re: chaining map reduce in hovercraft

Reply Threaded More More options
Print post
Permalink
On Fri, Jun 5, 2009 at 7:13 AM, Zachary Zolton <[hidden email]> wrote:
> So, Chris, it sounds like you're saying that POSTing to that URL will
> place the entire results of querying the view with group=true into
> another database. Sounds great!
>
> Will it work with 0.9? Would you suggest automating this using _changes?
>

I doubt this will get backported to the 0.9.x branch.

However, this is possible with 0.9 if you do it in a client. There are
examples in my CouchRest client of running a Ruby function over the
unique keys in a map view, but the pattern of just dumping a group
reduce function into another DB is simple and effective.

What I'm adding is simply a shortcut so that people can more
effectively play around with chaining map reduce queries. For now the
snapshot dbs will not update incrementally. However, they are just
documents so you can do in-place transformations on them (if you
want).

--- Actually I'm having second thoughts about putting this into
CouchDB. It's still a worthwhile technique, but I think we should
encourage you to use HTTP tools to run it. Here's why:

So, on a single node, this would be all well and good - you'd be able
to get a sorted list of tags by popularity, by running a simple
map-by-group-reduce-value view on the snapshot database.

On a clustered setup, like couchdb-lounge provides, you'd end up with
problems, as each snapshot db would only reflect reductions run
locally (on the single shard). This is because the Erlang API used by
Hovercraft is not a multi-node API. Eventually we could give CouchDB
an internal Erlang proxy - but for now, multi-node clusters must be
built on HTTP.

So, since these Hovercraft chain snapshots are built against a single
node, the fullly merged sort-by-value map query across the cluster
could have incorrect ordering.

To guarantee correct ordering of tags by popularity in a clustered
deployment, you'd have to run the global reduce function (not against
a single local node) but against the entire cluster, via something
like couchdb-lounge's Twisted Python rereducing proxy.

Ergo, a group-reduce chaining library is better off not written via
Hovercraft, because it should use the HTTP API. Anyone have a Python
version of this?

Performance freaks don't worry - in this application of HTTP there are
just a handful of long running connections and you should be able to
get disk IO bound even with the HTTP overhead.

Chris


> Cheers,
> Zach
>
> On Fri, Jun 5, 2009 at 6:17 AM, Viacheslav Seledkin
> <[hidden email]> wrote:
>> Chris Anderson wrote:
>>>
>>> I finally got around to writing my map reduce copier. it's still
>>> basic, but what do you think?
>>>
>>> I want to put it into trunk as an http call, like:
>>>
>>> POST /_snapshot_view
>>>
>>> with JSON
>>>
>>> {"src":"/srcdb/_design/app/_view/reduce_count", "group_level":2,
>>> "target":"/targetdb"}
>>>
>>> Chainable map reduce seems to be one of the most popular requests on
>>> the survey we took, so hopefully this will make the heavy-data crew
>>> happy.
>>>
>>> There is an implementation here:
>>>
>>>
>>> http://github.com/jchris/hovercraft/commit/34b44527b660a740858cc71aa2c8326747465e31#L0R290
>>>
>>> What this does is take the results you'd get from query your reduce
>>> view with group=true, and copy them to a new database. Basically you
>>> end up with a database full of docs that look like:
>>>
>>> {
>>> "key":[2009,2,14],
>>> "value": 511
>>> }
>>>
>>> Since they are docs sitting in another CouchDB, you can use more
>>> ordinary CouchDB Map Reduce views on that database to do things like
>>> sort by value, so you can for instance sort tags by popularity, or
>>> days by user activity, etc.
>>>
>>> Chris
>>>
>>>
>>> --
>>> Chris Anderson
>>> http://jchrisa.net
>>> http://couch.io
>>>
>>> .
>>>
>>>
>>
>> The process of updating of shapshot db will be incremental?
>>
>



--
Chris Anderson
http://jchrisa.net
http://couch.io