I'm using Spring Boot 3.4.5 with Spring AI 1.0.0. On a particular PDF document the ParagraphPdfDocumentReader throws:
java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index out of bounds: -1
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.getTextBetweenParagraphs(ParagraphPdfDocumentReader.java:248)
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.toDocument(ParagraphPdfDocumentReader.java:161)
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.get(ParagraphPdfDocumentReader.java:147)
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.get(ParagraphPdfDocumentReader.java:50)
at org.springframework.ai.document.DocumentReader.read(DocumentReader.java:25)
at org.pdfsam.spec.agent.service.DefaultPdfLoader.loadPdf(DefaultPdfLoader.java:59)
at org.pdfsam.spec.agent.service.DefaultPdfLoader.loadPdfWithOutlineFrom(DefaultPdfLoader.java:54)
at org.pdfsam.spec.agent.service.DefaultLoadService.loadPDFFilesWithOutline(DefaultLoadService.java:82)
at org.pdfsam.spec.agent.service.DefaultLoadService.loadUnprocessed(DefaultLoadService.java:61)
at org.pdfsam.spec.agent.ETLApplication.lambda$commandLineRunner$0(ETLApplication.java:43)
at org.springframework.boot.SpringApplication.lambda$callRunner$5(SpringApplication.java:789)
at org.springframework.util.function.ThrowingConsumer$1.acceptWithException(ThrowingConsumer.java:82)
at org.springframework.util.function.ThrowingConsumer.accept(ThrowingConsumer.java:60)
at org.springframework.util.function.ThrowingConsumer$1.accept(ThrowingConsumer.java:86)
at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:797)
at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:788)
at org.springframework.boot.SpringApplication.lambda$callRunners$3(SpringApplication.java:773)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:186)
at java.base/java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:357)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:571)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:153)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:176)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:636)
at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:773)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:325)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1362)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1351)
at org.pdfsam.spec.agent.ETLApplication.main(ETLApplication.java:34)
Caused by: java.lang.IndexOutOfBoundsException: Index out of bounds: -1
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:299)
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:263)
at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1220)
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.getTextBetweenParagraphs(ParagraphPdfDocumentReader.java:196)
... 29 common frames omitted
The issue is with an outline item without any page destination (no Dest nor A item in the dictionary). This results in this printed as outline item:
Bla [-1,17], children = 0, pos = 0
I cannot share the PDF file but I guess I can create one if needed.
Comment From: WOONBE
I'd like to contribute this issue!