Class PDFTextStripperByArea


public class PDFTextStripperByArea extends PDFTextStripper
This will extract text from a specified region in the PDF.
Author:
Ben Litchfield
  • Constructor Details

    • PDFTextStripperByArea

      public PDFTextStripperByArea() throws IOException
      Constructor.
      Throws:
      IOException - If there is an error loading properties.
  • Method Details

    • setShouldSeparateByBeads

      public final void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
      This method does nothing in this derived class, because beads and regions are incompatible. Beads are ignored when stripping by area.
      Overrides:
      setShouldSeparateByBeads in class PDFTextStripper
      Parameters:
      aShouldSeparateByBeads - The new grouping of beads.
    • addRegion

      public void addRegion(String regionName, Rectangle2D rect)
      Add a new region to group text by.
      Parameters:
      regionName - The name of the region.
      rect - The rectangle area to retrieve the text from. The y-coordinates are java coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
    • removeRegion

      public void removeRegion(String regionName)
      Delete a region to group text by. If the region does not exist, this method does nothing.
      Parameters:
      regionName - The name of the region to delete.
    • getRegions

      public List<String> getRegions()
      Get the list of regions that have been setup.
      Returns:
      A list of java.lang.String objects to identify the region names.
    • getTextForRegion

      public String getTextForRegion(String regionName)
      Get the text for the region, this should be called after extractRegions().
      Parameters:
      regionName - The name of the region to get the text from.
      Returns:
      The text that was identified in that region.
    • extractRegions

      public void extractRegions(PDPage page) throws IOException
      Process the page to extract the region text.
      Parameters:
      page - The page to extract the regions from.
      Throws:
      IOException - If there is an error while extracting text.
    • processTextPosition

      protected void processTextPosition(TextPosition text)
      This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
      Overrides:
      processTextPosition in class PDFTextStripper
      Parameters:
      text - The text to process.
    • writePage

      protected void writePage() throws IOException
      This will print the processed page text to the output stream.
      Overrides:
      writePage in class PDFTextStripper
      Throws:
      IOException - If there is an error writing the text.
    • showGlyph

      protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
      Called when a glyph is to be processed. The heuristic calculations here were originally written by Ben Litchfield for PDFStreamEngine.
      Overrides:
      showGlyph in class PDFStreamEngine
      Parameters:
      textRenderingMatrix - the current text rendering matrix, Trm
      font - the current font
      code - internal PDF character code for the glyph
      unicode - the Unicode text for this glyph, or null if the PDF does provide it
      displacement - the displacement (i.e. advance) of the glyph in text space
      Throws:
      IOException - if the glyph cannot be processed
    • computeFontHeight

      protected float computeFontHeight(PDFont font) throws IOException
      Compute the font height. Override this if you want to use own calculations.
      Parameters:
      font - the font.
      Returns:
      the font height.
      Throws:
      IOException - if there is an error while getting the font bounding box.