[help] Are there tools for documents manipulating that can provide an approximate size of components (text included)?

Red1C3@lemmy.world · edit-2 10 months ago

[help] Are there tools for documents manipulating that can provide an approximate size of components (text included)?

unmagical@lemmy.ml · 10 months ago

A docx is just a renamed zip archive with the XML data. You should be able to unzip it and use a parser to access that info directly. There are likely tools to do this for any relevant language. You can also find the official spec online with some more info.

Unfortunately, I can’t get into much more detail than that as my company actively develops similar tools and I’ve worked on their document renderers not too long ago.

No clue on the odt stuff. I worked on the MS fidelity part.

take6056@feddit.nl · 10 months ago

I would look into a library that does manipulation of odt (or docx). Code whatever algorithm you need to do the restructuring. Now your left with an in memory representation of the document that you can hopefully figure out how many pages it spans, or save it to a temporary file.

All depends really on how feature rich the odt libraries are and/or how deep you want to dive into the spec.

I feel like this is an XY problem. Is there an underlying issue your trying to resolve?

brakenium@lemm.ee · 10 months ago

This is very different from docz or odt, but maybe its worth looking into converting markdown or latex to PDF with something like pandoc. Maybe that or some other more open and less complex format might help with this?

JakenVeina@lemm.ee · edit-2 10 months ago

Ultimately, no, not really, these formats are built to be “render-agnostic”, and there’s really no way to pre-calculate aspects of what the render will be without actually running it through the rendering engine. Which is, in theory, doable, without having to send the render output to an actual screen or printer, but the followup problem is that all renderers are not created equal. I.E. an engine for rendering a docx that you grab from NuGet or somewhere else is not guaranteed to produce the same output as what Microsoft Word will, not exactly.

If you need accuracy in predicting the rendered-size of various things, you really need to be running the documents through the same renderer that will be used to actually print/draw the documents for the user. If this is Microsoft Office, you can look into Office Interop protocols, which will let you make programmatic calls into the actual Office programs installed on the system, from your program. There ought to be a way to kick off rendering from there.