Janus
I am trying to perform the following task:
Search for a specific text in many PDF files in one location and copy found PDF files
to a different location. Preferable with a log file specifying where the found PDF files were located + time.
This is furthermore complicated by the fact that the folder containing the PDF
files is random by date in the following format: YEAR-MONTH-DAY
Example below:
Text to search for: Coffee cup
Folder containing the PDF files reside here:
\\SERVERNAME1\LONG PATH\Atlanta\2011-09-13
Copy the PDF files containing the text 'Coffe cup' to this location:
\\NEW_SERVERNAME3\LONG PATH\Chicago
**********************
The next day the folder containing the PDF files would be:
\\SERVERNAME1\LONG PATH\Atlanta\2011-09-14
Copy the found PDF files to this location:
\\NEW_SERVERNAME3\LONG PATH\Chicago
Is this possible in a batch task ?
Any and all help is appreciated
Thanks Janus
----------------------------
#2 17 Sep 2011 11:04
bluesxman
Do you mean search for text in the document name (should be straight forward) or within the text of the document?
To do the latter, you'll need to find a command line tool that will extract the text from the PDF for you. Getting such a tool may prove more taxing than the script itself.
cmd | *sh | ruby | chef
----------------------------
#3 17 Sep 2011 13:17
Janus
Hi bluesxman, thanks for the reply. I mean text within the OCR scanned PDF file. I have tested and verified this little function:
Code: Select all
@echo off
findstr /m "Coffee cup" *.pdf > results.txt
if %errorlevel%==0 (
echo Found! logged files into results.txt
) else (
echo No matches found
)
----------------------------
#4 17 Sep 2011 17:23
bluesxman
Based on some limited tests I've done, I'm not convinced your method will successfully find the PDF files you want (I can't see any plaintext strings from the document content in the PDF files I've looked at in a text editor.)
But assuming that your method does work for you, I'd go with something like this (untested!):
Code: Select all
@echo off
set "source=\\SERVERNAME1\LONG PATH\Atlanta"
set "target=\\NEW_SERVERNAME3\LONG PATH\Chicago"
set "string=Coffee cup"
REM logs to the directory that the script sits in
set "logfile=%~dpn0.log"
REM create the date string for today's files
set "dt=%date:~-4%-%date:~-7,2%-%date:~-10,2%
call :main >> "%logfile%"
pause
goto :EOF
:main
REM inspect all files in the source directory structure
for /r "%source%\%dt%" %%a in ("*.pdf") do (
find /c /i "%string%" "%%~a" 1>&2
if not errorlevel 1 (
set /p "out=%%~a / " <nul
if exist "%target%\%%~nxa" (
echo:Skipped - exists at target
) ELSE (
copy "%%~a" "%target%"
if errorlevel 1 (
echo:Copy failed
) ELSE (
echo:Copy succeeded
)
)
)
)
goto :EOF
Last edited by bluesxman (20 Sep 2011 13:47)
----------------------------
#5 19 Sep 2011 21:20
Janus
Thanks for the function bluesxman. It works perfectly with a little tweaking here and there...
I must also admit that i was a bit hasty as the findstr doesn't find text in .pdf but only in .txt. Only goes to show
that proper testing is vital :]
I'm currently testing Xpdf and the pdftotext and also looking at the greb function. If you have any other suggestions
where to look, I will be happy to hear them.
Amazing how such a 'small' task can be so difficult. In Windows it's quite easy to do this manually by searching for word or
phrase in a file, but apparently not in scripting. Also tried VB but ran into the same problem...
Anyhow will post a solution whenever or should I say if I find a solution to this annoying challenge.
----------------------------
#6 20 Sep 2011 14:02
bluesxman
Yeah the difference being that Windows may have a filter built in to extract the plaintext for searching. A filter that they don't seem to expose to scripting languages. So in a script you'd need to replicate this functionality, most likely with a third party tool.
cmd | *sh | ruby | chef
----------------------------
#7 13 Jan 2021 11:00
HDP
Hi, Not sure if you are available now as it is an old post.
Could you let me know what tweaks did you add to bluesxman's code for getting it successful.
Janus wrote:
Thanks for the function bluesxman. It works perfectly with a little tweaking here and there...
I must also admit that i was a bit hasty as the findstr doesn't find text in .pdf but only in .txt. Only goes to show
that proper testing is vital :]
I'm currently testing Xpdf and the pdftotext and also looking at the greb function. If you have any other suggestions
where to look, I will be happy to hear them.
Amazing how such a 'small' task can be so difficult. In Windows it's quite easy to do this manually by searching for word or
phrase in a file, but apparently not in scripting. Also tried VB but ran into the same problem...
Anyhow will post a solution whenever or should I say if I find a solution to this annoying challenge.
----------------------------
#8 20 Feb 2021 17:54
Vandalfoe
I do something similar daily.
To read the Pdf file, I use pdftk with the uncompress option. Then the pdf file triples in size, but it can be treated as a .txt file