Search for text in PDF files and copy to folder

Microsoft Windows
Post Reply
User avatar
MigrationUser
Posts: 336
Joined: 2021-Jul-12, 1:37 pm
Contact:

Search for text in PDF files and copy to folder

Post by MigrationUser »

16 Sep 2011 19:11
Janus
I am trying to perform the following task:

Search for a specific text in many PDF files in one location and copy found PDF files
to a different location. Preferable with a log file specifying where the found PDF files were located + time.
This is furthermore complicated by the fact that the folder containing the PDF
files is random by date in the following format: YEAR-MONTH-DAY

Example below:

Text to search for: Coffee cup

Folder containing the PDF files reside here:
\\SERVERNAME1\LONG PATH\Atlanta\2011-09-13

Copy the PDF files containing the text 'Coffe cup' to this location:
\\NEW_SERVERNAME3\LONG PATH\Chicago

**********************
The next day the folder containing the PDF files would be:
\\SERVERNAME1\LONG PATH\Atlanta\2011-09-14

Copy the found PDF files to this location:
\\NEW_SERVERNAME3\LONG PATH\Chicago

Is this possible in a batch task ?

Any and all help is appreciated

Thanks Janus

----------------------------

#2 17 Sep 2011 11:04
bluesxman

Do you mean search for text in the document name (should be straight forward) or within the text of the document?

To do the latter, you'll need to find a command line tool that will extract the text from the PDF for you. Getting such a tool may prove more taxing than the script itself.

cmd | *sh | ruby | chef


----------------------------
#3 17 Sep 2011 13:17
Janus

Hi bluesxman, thanks for the reply. I mean text within the OCR scanned PDF file. I have tested and verified this little function:

Code: Select all

@echo off
findstr /m "Coffee cup" *.pdf > results.txt
if %errorlevel%==0 (
echo Found! logged files into results.txt
) else (
echo No matches found
)
However I still need to combine all the other requirements that I mentioned to this function. Any suggestions ?

----------------------------
#4 17 Sep 2011 17:23
bluesxman

Based on some limited tests I've done, I'm not convinced your method will successfully find the PDF files you want (I can't see any plaintext strings from the document content in the PDF files I've looked at in a text editor.)

But assuming that your method does work for you, I'd go with something like this (untested!):

Code: Select all

@echo off

set "source=\\SERVERNAME1\LONG PATH\Atlanta"
set "target=\\NEW_SERVERNAME3\LONG PATH\Chicago"
set "string=Coffee cup"

REM logs to the directory that the script sits in
set "logfile=%~dpn0.log"

REM create the date string for today's files
set "dt=%date:~-4%-%date:~-7,2%-%date:~-10,2%


call :main >> "%logfile%"

pause

goto :EOF

:main

REM inspect all files in the source directory structure
for /r "%source%\%dt%" %%a in ("*.pdf") do (
    find /c /i "%string%" "%%~a" 1>&2
    if not errorlevel 1 (
        set /p "out=%%~a / " <nul
        if exist "%target%\%%~nxa" (
            echo:Skipped - exists at target
        ) ELSE (
            copy "%%~a" "%target%"
            if errorlevel 1 (
                echo:Copy failed
            ) ELSE (
                echo:Copy succeeded
            )
        )
    )
)

goto :EOF
It'll display to screen the files it's looking at but will only log the ones it tries to copy (with some basic error handling).

Last edited by bluesxman (20 Sep 2011 13:47)

----------------------------

#5 19 Sep 2011 21:20
Janus

Thanks for the function bluesxman. It works perfectly with a little tweaking here and there...

I must also admit that i was a bit hasty as the findstr doesn't find text in .pdf but only in .txt. Only goes to show
that proper testing is vital :]

I'm currently testing Xpdf and the pdftotext and also looking at the greb function. If you have any other suggestions
where to look, I will be happy to hear them.

Amazing how such a 'small' task can be so difficult. In Windows it's quite easy to do this manually by searching for word or
phrase in a file, but apparently not in scripting. Also tried VB but ran into the same problem...

Anyhow will post a solution whenever or should I say if I find a solution to this annoying challenge.

----------------------------

#6 20 Sep 2011 14:02
bluesxman

Yeah the difference being that Windows may have a filter built in to extract the plaintext for searching. A filter that they don't seem to expose to scripting languages. So in a script you'd need to replicate this functionality, most likely with a third party tool.

cmd | *sh | ruby | chef

----------------------------

#7 13 Jan 2021 11:00
HDP

Hi, Not sure if you are available now as it is an old post.
Could you let me know what tweaks did you add to bluesxman's code for getting it successful.

Janus wrote:

Thanks for the function bluesxman. It works perfectly with a little tweaking here and there...

I must also admit that i was a bit hasty as the findstr doesn't find text in .pdf but only in .txt. Only goes to show
that proper testing is vital :]

I'm currently testing Xpdf and the pdftotext and also looking at the greb function. If you have any other suggestions
where to look, I will be happy to hear them.

Amazing how such a 'small' task can be so difficult. In Windows it's quite easy to do this manually by searching for word or
phrase in a file, but apparently not in scripting. Also tried VB but ran into the same problem...

Anyhow will post a solution whenever or should I say if I find a solution to this annoying challenge.

----------------------------

#8 20 Feb 2021 17:54
Vandalfoe

I do something similar daily.
To read the Pdf file, I use pdftk with the uncompress option. Then the pdf file triples in size, but it can be treated as a .txt file
Post Reply