How to convert scanned PDF to text and keep original layout?

For extracting content from scanned PDF file, sometime we need to convert scanned PDF to text. And in order to find corresponding words in output text file. we also hope to convert scanned PDF to text and keep original  layout.  VeryDOC Raster to Text OCR Converter Command Line has such function, by this software, you can also convert encrypted PDF document to Text with user password or owner password. For more information, please check software on homepage, in the following part, I will show you how to use this software.

Step 1. Download Raster to Text OCR Converter Command Line

  • There are two versions of this software on the website: Server License and Developer License. Under one Server License, you can use the corresponding SOFTWARE on exactly one server computer that offers service to clients. If the SOFTWARE contains source codes, you have the right to modify and reuse the codes under the Server License. Under one Developer License, you can integrate the corresponding SOFTWARE into your developed software and redistribute it with royalty-free. If the SOFTWARE contains source codes, you have the right to modify and reuse the codes under the Developer License.
  • When downloading finishes, there will be a zip file. Please  extract it to some folder then you can find the executable file and then call it from MS Dos Windows.

Step 2. Convert scanned PDF to text and keep original layout.

  • When you use this software, please refer to the usage and examples.
  • Here is the usage for your reference:   pdf2txtocr.exe [options] <PDF-file> <Text-file>
  • When converting scanned PDF to text, please refer to the following command line templates.
    pdf2txtocr.exe -ocr -lang eng -layout C:\in.pdf C:\out.txt
    By this command line, we can convert English scanned PDF file to text and keep original layout and formats. 
    pdf2txtocr.exe -ocr -bitcount 1 C:\in.pdf C:\out.txt
    By this command line, we can convert scanned PDF to text and specify output bit count.
    pdf2txtocr.exe -ocr -bitcount 8 C:\in.pdf C:\out.txt
    pdf2txtocr.exe -ocr -bitcount 24 C:\in.pdf C:\out.txt
    These two command line templates are same with the above one.
    pdf2txtocr.exe -ocr -lang deu C:\in.pdf C:\out.txt
    By this command line, we can convert Germany scanned PDF to text.
    pdf2txtocr.exe -text "PageText %PageNumber% of %PageCount%" C:\in.pdf C:\out.txt
    By this command line, we can convert scanned PDF to text and add page number on output text file.

Now let us check related parameters.
-layout : maintain original physical layout
-bitcount <int> : set color depth when render PDF page to image data, it can be set 1, 8, 24, default is 8bit
-text <string> : add additional text at end of each text page, this parameter supports the following variables:
    %PageNumber%: current page number
    %PageCount% : total page count of PDF file
-ocr                : enable OCR function for scanned PDF file
  -lang <string>      : choose the language for OCR engine
  -ocrmode <int>      : set OCR mode
    -ocrmode 0: output to text file
    -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
    -ocrmode 2: output to plain text based PDF file
    -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
    -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer

By those examples and parameters, you can convert scanned PDF to text easily. During the using, if you have any question, please contact us as soon as possible.

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)

Random Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Verify Code   If you cannot see the CheckCode image,please refresh the page again!