Easy way to extract Text from PDF with Laravel

Easy way to extract Text from PDF with Laravel

ยท

5 min read

PDF files have become a popular format for sharing and storing documents due to their platform independence and consistent layout. In Laravel, one of the most widely used PHP frameworks, extracting text from PDF files can be valuable for various applications, such as content indexing, data extraction, and document analysis.

This blog will explore different approaches to extracting text from PDF files using Laravel. And also, we will make a simple pdf to text extractor Laravel app.

Let's get started

To begin, make sure you have a Laravel project set up. If you don't have one, you can create it using Composer or via Laravel installer. Once your project is ready, we need to install a package that can handle PDF extraction. Here, we will explore the process of extracting text from PDF files in Laravel using two popular packages:


samalot/pdfparser

Installation:

To begin using samalot/pdfparser in your Laravel project, you need to install the package first. Open your terminal and run the following command:

composer require smalot/pdfparser

\*Note: This library requires PHP ^7.1. In case you can't use Composer, you can include [alt_autoload.php-dist](github.com/smalot/pdfparser/blob/master/alt..) . It will include all required files automatically.*

Usage:

After successful installation, you can now start extracting text from PDFs. The package provides a straightforward interface to achieve this. Just a few lines of code will give our expected output as follows:

// Parse PDF file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('/path/to/document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))

// extract text of the whole PDF
$text = $pdf->getText();
echo $text;

// or extract the text of a specific page (in this case the first page)
$onlyFirstpage = $pdf->getPages()[0]->getText();

// you can also extract text of a limited amount of pages. here, it will only use the first five pages.
$firstFivePage = $pdf->getText(5);

Advance usage:

You can also configure or change the behavior of the parse. In config, you can set DecodeMemoryLimit, FontSpaceLimit, HorizontalOffset, PdfWhitespaces, PdfWhitespacesRegex, and RetainImageContent.

To read more about the custom behavior/config of the parser you can click here. It can also extract pdf metadata, page height, and width, text width, read base64 encoded pdf, and extract text position. To learn more, you can follow the link here.


spatie/pdf-to-text

Installation:

Behind the scenes this package leverages pdftotext. You can verify if the binary is installed on your system by using the following command:

which pdftotext

If it is installed it will return the path to the binary. For example:

/usr/bin/pdftotext

If not you can install the binary using the following command (use the respective command as per your operating system):

// on Ubuntu or Debian
apt-get install poppler-utils

// On a mac
brew install poppler

// on RedHat, CentOS, Rocky Linux or Fedora
yum install poppler-utils

\*Note: for windows Poppler needs WSL2 to install Poppler on a Windows machine [check this link*](towardsdatascience.com/poppler-on-windows-1..)

Once you have installed Poppler now it's time to install spatie/pdf-to-text package on your project using Composer:

composer require spatie/pdf-to-text

Usages:

After successful installation, you can start extracting text from PDF files. Let's look at the example code below:

use Spatie\PdfToText\Pdf;

    // Easy way
    $text = (new Pdf())
    ->setPdf('book.pdf')
    ->text();
    echo $text;

    // or more easier way
    echo Pdf::getText('book.pdf');

Advance usages:

Pdftotext (Poppler) accepts some options to give customized output to some extent. To pass these options you can set them using the setOption method as follows:

use Spatie\PdfToText\Pdf;

    // first approach
    $text = (new Pdf())
        ->setPdf('table.pdf')
        ->setOptions(['layout', 'r 96'])
        ->text();

    // or second approach
    echo Pdf::getText('book.pdf', null, ['layout', 'opw myP1$$Word']);

Please note that successive calls to setOptions() will overwrite options passed in during previous calls.

If you need to make multiple calls to add options (for example if you need to pass in default options when creating the Pdf object from a container, and then add context-specific options elsewhere), you can use the addOptions() method.

$text = (new Pdf())
    ->setPdf('table.pdf')
    ->setOptions(['layout', 'r 96'])
    ->addOptions(['f 1'])
    ->text();

To read more about spatie/pdf-to-text package. Visit this link here


Now, let's build a Laravel pdf to text extractor app

Too much theory it's time to make our very first Laravel app that can extract text from pdf. Let's open your Laravel application(or create a new one ) and add the following route in the routes/web.php file:

Route::get('/pdf-to-text', 'PdfToTextController@convert')

Now, create a controller called PdfToTextController :

php artisan make:controller PdfToTextController

Then inside the controller class add the following lines of code:

For smalot/pdfparser:

Inside config/app.php add the following code,

'providers' => [
    // Other service providers...
    Smalot\PdfParser\ServiceProvider::class,
],

'aliases' => [
    // Other aliases...
    'PDFParser' => Smalot\PdfParser\Parser::class,
],

Then make a method inside the controller PdfToTextController called convert:

use PDFParser;

public function convert()
{
    $pdfFile = public_path('files/sample.pdf'); // Replace with the path to your PDF file

    $parser = new PDFParser();
    $pdf = $parser->parseFile($pdfFile);
    $text = $pdf->getText();

    return view('pdf_to_text', compact('text'));
}

For spatie/pdf-to-text:

Our controller will look like this:

use Spatie\PdfToText\Pdf;

public function convert()
{
    $pdfFile = public_path('files/sample.pdf'); // Replace with the path to your PDF file

    $text = Pdf::getText($pdfFile);

    return view('pdf_to_text', compact('text'));
}

Now, create a blade view file where we will display extracted text from the pdf. Then, create a new file inside the resources/views directory and name it pdf_to_text.blade.php then add the following code:

<!DOCTYPE html>
<html>
<head>
    <title>PDF to Text Conversion</title>
</head>
<body>
    <h1>Converted Text from PDF:</h1>
    <pre>{{ $text }}</pre>
</body>
</html>

Now time to serve your project, visit the URL `/pdf-to-text` and TADA ๐ŸŽ‰๐ŸŽ‰ you will be able to see the text version of your pdf.

Demo Output:

\* For source code* click here


Conclusion

We explored how to extract text from PDF files using Laravel. By leveraging the "smalot/pdfparser" and "spatie/pdf-to-text" packages, now you can easily integrate PDF text extraction into your Laravel applications. Behind the scene, they use OCR(Optical Character Recognition) techniques to detect text from the pdf file.

For more content, please make sure to follow us and leave a comment below for any topic you are curious about.

Keep Artisaning!!

ย