|
|
|
Asiri Rathnayake
|
Hello Devs,
After few discussions I have revised the new officeimporter API to take into account the use of DocumentName instead of plain strings for representing document names. I'll repeat the details of the previous proposal with the new changes applied: Currently we have the following officeimporter API: <code> OfficeImporter::importStream(InputStream is, String documentFormat, String targetDocumentName, Map params):void OfficeImporter::importAttachment(String documentName, String attachmentName, Map params):String </code> Problems with this API: * Loosely typed (params, document names) * Both of the above methods perform almost the same task. * Customizing the import process is implemented in a hackish way. (not visisble on the API) The new API proposed looks like below: <code> OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName referenceDocument, boolean filterStyles):XHTMLOfficeDocument OfficeImporter::xhtmlToXDOM(XHTMLOfficeDocument xhtmlOfficeDocument):XDOMOfficeDocument OfficeImporter::officeToXDOM(byte[] officeFileData, DocumentName referenceDocument, boolean filterStyles):XDOMOfficeDocument OfficeImporter::buildPresentation(byte[] officeFileData):XDOMOfficeDocument OfficeImporter::splitImport(XDOMOfficeDocument xdomOfficeDocument, int[] headingLevelsToSplit, NamingCriterion namingCriterion, DocumentName baseDocumentName):Map<TargetPageDescriptor, XDOMOfficeDocument> </code> As you can see, these methods are more granular and the responsibilities are well defined. Customizing the import process can be done from the client code. For an example: 1. Make the initial import from office to XHTMLOfficeDocument - OfficeImporter::officeToXHTML() 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM manipulations. 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - OfficeImporter::xhtmlToXDOM() 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM manipulations. 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument instances - OfficeImporter::splitImport() 6. Perform customizations on these child XDOMOfficeDocument instances - XDOM manipulations. 7. Render the XDOMOfficeDocument instances & save them into wiki pages - XWiki rendering operations. I think this interface will make it easy to extend & maintain officeimporter functionality in the future. Along with this, I would also like to refactor the xwiki-refactoring module a bit to get rid of string based document names from it. This whole refactoring operation would take approximately one day to complete. And since this operation is not adding any new features, I think it can be committed on both trunk and 2.0 branch. Here's my +1 to all of above. Thanks. - Asiri _______________________________________________ devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
|
Marius Dumitru Florea
|
Hi Asiri,
Asiri Rathnayake wrote: > Hello Devs, > > After few discussions I have revised the new officeimporter API to take into > account the use of DocumentName instead of plain strings for representing > document names. I'll repeat the details of the previous proposal with the > new changes applied: > > Currently we have the following officeimporter API: > > <code> > OfficeImporter::importStream(InputStream is, String documentFormat, String > targetDocumentName, Map params):void > OfficeImporter::importAttachment(String documentName, String attachmentName, > Map params):String > </code> > > Problems with this API: > > * Loosely typed (params, document names) > > * Both of the above methods perform almost the same task. > > * Customizing the import process is implemented in a hackish way. (not > visisble on the API) > > > The new API proposed looks like below: > > <code> > OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName > referenceDocument, boolean filterStyles):XHTMLOfficeDocument Can you explain why you replaced InputStream with byte[]? Arrays require contiguous memory locations and loading large arrays into memory is bad IMO. Thanks, Marius > OfficeImporter::xhtmlToXDOM(XHTMLOfficeDocument > xhtmlOfficeDocument):XDOMOfficeDocument > OfficeImporter::officeToXDOM(byte[] officeFileData, DocumentName > referenceDocument, boolean filterStyles):XDOMOfficeDocument > OfficeImporter::buildPresentation(byte[] officeFileData):XDOMOfficeDocument > OfficeImporter::splitImport(XDOMOfficeDocument xdomOfficeDocument, int[] > headingLevelsToSplit, NamingCriterion namingCriterion, DocumentName > baseDocumentName):Map<TargetPageDescriptor, XDOMOfficeDocument> > </code> > > As you can see, these methods are more granular and the responsibilities are > well defined. Customizing the import process can be done from the client > code. For an example: > > 1. Make the initial import from office to XHTMLOfficeDocument - > OfficeImporter::officeToXHTML() > > 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM > manipulations. > > 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - > OfficeImporter::xhtmlToXDOM() > > 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM > manipulations. > > 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument instances - > OfficeImporter::splitImport() > > 6. Perform customizations on these child XDOMOfficeDocument instances - XDOM > manipulations. > > 7. Render the XDOMOfficeDocument instances & save them into wiki pages - > XWiki rendering operations. > > I think this interface will make it easy to extend & maintain officeimporter > functionality in the future. > > Along with this, I would also like to refactor the xwiki-refactoring module > a bit to get rid of string based document names from it. > > > This whole refactoring operation would take approximately one day to > complete. And since this operation is not adding any new features, I think > it can be committed on both trunk and 2.0 branch. > > Here's my +1 to all of above. > > Thanks. > > - Asiri > _______________________________________________ > devs mailing list > [hidden email] > http://lists.xwiki.org/mailman/listinfo/devs devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
|
Asiri Rathnayake
|
Hi,
> OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName > > referenceDocument, boolean filterStyles):XHTMLOfficeDocument > > Can you explain why you replaced InputStream with byte[]? Arrays require > contiguous memory locations and loading large arrays into memory is bad > IMO. > Yes, you are correct. I still have some doubts about this: 1. OpenOffice requires the input to be on a disk file -- so we need to write the stream / byte [] into the disk. Having a stream instead of a byte[] is good in this case because we won't be loading the whole file into memory at once. 2. However OpenOffice will load the whole file into memory at once when performing the conversion. Still, having a stream is kind of better because we are not consuming any additional memory. 3. Currently file upload plugin has no method to get a stream of input. So it's returning a byte[]. Having this byte[] converted into a stream is kind of useless. 4. However, having an API based on streams is better because in future we might anyway upgrade the fileupload plugin to return a stream. 5. Even after the conversion I need to load the results into memory (byte[]) so that I can attach them into wiki pages. 6. But we might change the DAB api in future so that attachments can be saved in an streaming fashion (this might be hard). I might be wrong here, but I don't think having streams oriented API for officeimporter will bring any performance enhancement in near future. All it will add is some glue code to convert between streams / byte[] so that it can work with rest of XE infrastructure (fileupload plugin, DAB, attachment saving). If we base our decision solely on "having a streamable API is always wise", I'd also go with the streams based API. I'd like to stay away from this decision and let you all decide :) Thanks. - Asiri > > Thanks, > Marius > > > OfficeImporter::xhtmlToXDOM(XHTMLOfficeDocument > > xhtmlOfficeDocument):XDOMOfficeDocument > > OfficeImporter::officeToXDOM(byte[] officeFileData, DocumentName > > referenceDocument, boolean filterStyles):XDOMOfficeDocument > > OfficeImporter::buildPresentation(byte[] > officeFileData):XDOMOfficeDocument > > OfficeImporter::splitImport(XDOMOfficeDocument xdomOfficeDocument, int[] > > headingLevelsToSplit, NamingCriterion namingCriterion, DocumentName > > baseDocumentName):Map<TargetPageDescriptor, XDOMOfficeDocument> > > </code> > > > > As you can see, these methods are more granular and the responsibilities > are > > well defined. Customizing the import process can be done from the client > > code. For an example: > > > > 1. Make the initial import from office to XHTMLOfficeDocument - > > OfficeImporter::officeToXHTML() > > > > 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM > > manipulations. > > > > 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - > > OfficeImporter::xhtmlToXDOM() > > > > 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM > > manipulations. > > > > 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument > instances - > > OfficeImporter::splitImport() > > > > 6. Perform customizations on these child XDOMOfficeDocument instances - > XDOM > > manipulations. > > > > 7. Render the XDOMOfficeDocument instances & save them into wiki pages - > > XWiki rendering operations. > > > > I think this interface will make it easy to extend & maintain > officeimporter > > functionality in the future. > > > > Along with this, I would also like to refactor the xwiki-refactoring > module > > a bit to get rid of string based document names from it. > > > > > > This whole refactoring operation would take approximately one day to > > complete. And since this operation is not adding any new features, I > think > > it can be committed on both trunk and 2.0 branch. > > > > Here's my +1 to all of above. > > > > Thanks. > > > > - Asiri > > _______________________________________________ > > devs mailing list > > [hidden email] > > http://lists.xwiki.org/mailman/listinfo/devs > _______________________________________________ > devs mailing list > [hidden email] > http://lists.xwiki.org/mailman/listinfo/devs > devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
|
tmortagne
|
On Mon, Oct 26, 2009 at 09:00, Asiri Rathnayake
<[hidden email]> wrote: > Hi, > >> OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName >> > referenceDocument, boolean filterStyles):XHTMLOfficeDocument >> >> Can you explain why you replaced InputStream with byte[]? Arrays require >> contiguous memory locations and loading large arrays into memory is bad >> IMO. >> > > Yes, you are correct. I still have some doubts about this: > > 1. OpenOffice requires the input to be on a disk file -- so we need to write > the stream / byte [] into the disk. Having a stream instead of a byte[] is > good in this case because we won't be loading the whole file into memory at > once. > > 2. However OpenOffice will load the whole file into memory at once when > performing the conversion. Still, having a stream is kind of better because > we are not consuming any additional memory. > > 3. Currently file upload plugin has no method to get a stream of input. So > it's returning a byte[]. Having this byte[] converted into a stream is kind > of useless. > > 4. However, having an API based on streams is better because in future we > might anyway upgrade the fileupload plugin to return a stream. > > 5. Even after the conversion I need to load the results into memory (byte[]) > so that I can attach them into wiki pages. > > 6. But we might change the DAB api in future so that attachments can be > saved in an streaming fashion (this might be hard). > > I might be wrong here, but I don't think having streams oriented API for > officeimporter will bring any performance enhancement in near future. All it > will add is some glue code to convert between streams / byte[] so that it > can work with rest of XE infrastructure (fileupload plugin, DAB, attachment > saving). > > If we base our decision solely on "having a streamable API is always wise", +1 for "having a streamable API is always wise" :) > I'd also go with the streams based API. > > I'd like to stay away from this decision and let you all decide :) > > Thanks. > > - Asiri > > >> >> Thanks, >> Marius >> >> > OfficeImporter::xhtmlToXDOM(XHTMLOfficeDocument >> > xhtmlOfficeDocument):XDOMOfficeDocument >> > OfficeImporter::officeToXDOM(byte[] officeFileData, DocumentName >> > referenceDocument, boolean filterStyles):XDOMOfficeDocument >> > OfficeImporter::buildPresentation(byte[] >> officeFileData):XDOMOfficeDocument >> > OfficeImporter::splitImport(XDOMOfficeDocument xdomOfficeDocument, int[] >> > headingLevelsToSplit, NamingCriterion namingCriterion, DocumentName >> > baseDocumentName):Map<TargetPageDescriptor, XDOMOfficeDocument> >> > </code> >> > >> > As you can see, these methods are more granular and the responsibilities >> are >> > well defined. Customizing the import process can be done from the client >> > code. For an example: >> > >> > 1. Make the initial import from office to XHTMLOfficeDocument - >> > OfficeImporter::officeToXHTML() >> > >> > 2. Perform customizations on the XHTMLOfficeDocument - w3c DOM >> > manipulations. >> > >> > 3. Import the XHTMLOfficeDocument into XDOMOfficeDocument - >> > OfficeImporter::xhtmlToXDOM() >> > >> > 4. Perform customizations on the XDOMOfficeDocument (XDOM) - XDOM >> > manipulations. >> > >> > 5. Split the XDOMOfficeDocument into multiple XDOMOfficeDocument >> instances - >> > OfficeImporter::splitImport() >> > >> > 6. Perform customizations on these child XDOMOfficeDocument instances - >> XDOM >> > manipulations. >> > >> > 7. Render the XDOMOfficeDocument instances & save them into wiki pages - >> > XWiki rendering operations. >> > >> > I think this interface will make it easy to extend & maintain >> officeimporter >> > functionality in the future. >> > >> > Along with this, I would also like to refactor the xwiki-refactoring >> module >> > a bit to get rid of string based document names from it. >> > >> > >> > This whole refactoring operation would take approximately one day to >> > complete. And since this operation is not adding any new features, I >> think >> > it can be committed on both trunk and 2.0 branch. >> > >> > Here's my +1 to all of above. >> > >> > Thanks. >> > >> > - Asiri >> > _______________________________________________ >> > devs mailing list >> > [hidden email] >> > http://lists.xwiki.org/mailman/listinfo/devs >> _______________________________________________ >> devs mailing list >> [hidden email] >> http://lists.xwiki.org/mailman/listinfo/devs >> > _______________________________________________ > devs mailing list > [hidden email] > http://lists.xwiki.org/mailman/listinfo/devs > -- Thomas Mortagne _______________________________________________ devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
|
Sergiu Dumitriu-2
|
In reply to this post
by Asiri Rathnayake
On 10/26/2009 09:00 AM, Asiri Rathnayake wrote:
> Hi, > >> OfficeImporter::officeToXHTML(byte[] officeFileData, DocumentName >>> referenceDocument, boolean filterStyles):XHTMLOfficeDocument >> >> Can you explain why you replaced InputStream with byte[]? Arrays require >> contiguous memory locations and loading large arrays into memory is bad >> IMO. >> > > Yes, you are correct. I still have some doubts about this: > > 1. OpenOffice requires the input to be on a disk file -- so we need to write > the stream / byte [] into the disk. Having a stream instead of a byte[] is > good in this case because we won't be loading the whole file into memory at > once. > > 2. However OpenOffice will load the whole file into memory at once when > performing the conversion. Still, having a stream is kind of better because > we are not consuming any additional memory. > > 3. Currently file upload plugin has no method to get a stream of input. So > it's returning a byte[]. Having this byte[] converted into a stream is kind > of useless. > > 4. However, having an API based on streams is better because in future we > might anyway upgrade the fileupload plugin to return a stream. > > 5. Even after the conversion I need to load the results into memory (byte[]) > so that I can attach them into wiki pages. > > 6. But we might change the DAB api in future so that attachments can be > saved in an streaming fashion (this might be hard). > > I might be wrong here, but I don't think having streams oriented API for > officeimporter will bring any performance enhancement in near future. All it > will add is some glue code to convert between streams / byte[] so that it > can work with rest of XE infrastructure (fileupload plugin, DAB, attachment > saving). > > If we base our decision solely on "having a streamable API is always wise", > I'd also go with the streams based API. > > I'd like to stay away from this decision and let you all decide :) > +1 for streams. -- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list [hidden email] http://lists.xwiki.org/mailman/listinfo/devs |
||||||||||||||||
| Free Embeddable Forum Powered by Nabble | Help |