0
Under review

PDF issues: memory use and UTF encoding

Frédéric Sarter 6 years ago updated by Chip Andre 4 years ago 3

Hello,

and first of all so many thanks for a wonderful piece of software! I'm running Ubooquity as a systemd service on a Raspberry Pi with OSMC and it is beautiful - the most elegant way to access & enjoy the large collection of ebooks I was sitting on! 

I see on the forum that there have been longstanding issues with Ubooquity and PDFs, and I thought I might add my grain of salt (and a few questions...)

1) It seems memory leaks are still an issue with larger PDFs, and Java OOM errors do occur - monitoring my swap space I could sometimes see it increase to gigas worth of data before the whole system would freeze completely (the RAM limit I set for Java was apparently not taken into account when it came to scanning PDFs). I'm pretty sure this is a pdfbox issue, not an Ubooquity one - so the ability to parse PDFs with an external software (mupdf or poppler) that Tom mentionned as a possibility for future versions seems the way to go! :)

In the meantime I manage to avoid most OOM by setting up a large swap space, temporarily disabling other services on the machine and setting the GPU share of the RAM to the minimum each time I want to launch a library scan with Ubooquity - a bit cumbersome but it does the trick!

2) My second PDF-related issue was a bit more tricky. At first Ubooquity would not scan files with accented characters - being French this was really a problem, although solving it was as simple as setting the system locale to fr_FR.UTF-8... *But* once this was done there was a fair number of PDFs that Ubooquity suddenly could no longer scan, with the following error showing up in logs: "java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDType1Font"

After fiddling with various parameters, I found out that flagging the java command I use to start Ubooquity with "-Dfile.encoding=UTF-16" does the trick: Ubooquity now scans all my PDF files flawlessly, provided there are not corrupt or malformatted.

*But* - there must always be a "but" :( - when I start Ubooquity this way it doesn't handle CBR files anymore.

Not a biggie really - I use a simple bash script to convert all my CBR file to CBZ... Voilà!  :)

Hence my questions:

- Is there another way to get Ubooquity to scan all my files, irrespective of their encoding?

- Would there be a way to use the "java -Dfile.encoding=UTF-16" flag ONLY for PDF parsing and rendering?

-Or alternatively, would there by a way to flag whatever utility is used to decompress and parse CBR files with another encoding parameter so that Ubooquity can parse RAR/CBR files again?

No worries, batch converting CBRs to CBZ is an easy enough workaround... But I'm still curious to know whether there would be a less dirty/more elegant/simpler solution to the PDF parsing issues... :)

Under review

Hi Frédéric,

1) I still plan to add a way to "plug" external PDF renderers to Ubooquity since I still haven't found a free Java library that is as efficient as tools like MuPDF or Poppler. But this feature is not at the top of my list yet (although it's not that far from the top), so it won't be implement very soon.

2) I don't have a complete understanding of the encoding issue you have encountered (as you can gather from the different threads about encoding on this forum), but what I'm sure of is that the encoding setting impacts the entire Java process (meaning all parts of Ubooquity), so there is no way to use it only for PDF related tasks.

The problem with encoding issues is that they are hard to reproduce on my side as they depend on your OS, your OS settings, the way you launch your JVM (the Java runtime) and the way your book files were written.

Thanks for your reply! 

As I said, setting the encoding to UTF-16 solves all my PDF issues for now, and negatively affects only CBR files - converting these to CBZ is no problem really. So while I would sure welcome such a "plug", I can live without it indefinitely, too... :)

As for the PDF memory leaks, I don't scan my library every day, so again no problem to work around the issue if and when needed...

After some fine-tuning of my system settings, Ubooquity runs very smoothly, and I am in love with it! Many many thanks again for your work! 


I just started using Ubooquity and it's still working its way through my collection.  Looking at the logs it appears that most (if not all) PDFs fail to be added.  Here's an example from the logs:

20191222 09:42:59 [Scanner thread] ERROR com.ubooquity.data.feeder.a - Failed to insert Z:\Vault\Books\Comics & Manga\Legend of Zelda\The Legend of Zelda - Majora's Mask & A Link to the Past [Eng TPB].pdf
java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Unknown Source) ~[na:1.8.0_231]
   at java.io.ByteArrayOutputStream.grow(Unknown Source) ~[na:1.8.0_231]
   at java.io.ByteArrayOutputStream.ensureCapacity(Unknown Source) ~[na:1.8.0_231]
   at java.io.ByteArrayOutputStream.write(Unknown Source) ~[na:1.8.0_231]
   at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:68) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:50) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:449) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:785) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1000) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:956) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:904) ~[pdfbox-2.0.6.jar.6729965043262619269.tmp:2.0.6]
   at com.ubooquity.fileformat.pdf.b.a(SourceFile:34) ~[Ubooquity.jar:2.1.2]
   at com.ubooquity.b.d.a(SourceFile:62) ~[Ubooquity.jar:2.1.2]
   at com.ubooquity.data.feeder.a.b(SourceFile:531) ~[Ubooquity.jar:2.1.2]
   at com.ubooquity.data.feeder.a.b(SourceFile:468) ~[Ubooquity.jar:2.1.2]
   at com.ubooquity.data.feeder.a.b(SourceFile:112) ~[Ubooquity.jar:2.1.2]
   at com.ubooquity.data.feeder.a$$Lambda$5/16085064.run(Unknown Source) ~[na:na]
   at java.lang.Thread.run(Unknown Source) ~[na:1.8.0_231]

Just wanted to share the PDF OOM errors I'm seeing. I doubt it will help, but it can't hurt.