Automating image scans with OCR and importing to DS Note.

Questions that don't fit in any category below may go here
Forum rules
We've moved! Head over to Synology Community (community.synology.com) to meet up with our team and other Synology enthusiasts!
ismithers
Experienced
Experienced
Posts: 114
Joined: Mon Jun 03, 2013 9:56 am

Automating image scans with OCR and importing to DS Note.

Unread post by ismithers » Sun Nov 30, 2014 3:45 am

Has anyone attempted to get this running on their DS? I have a DS412+ and go so far as installing the dependancies, whereupon I could not get pil, reportlab and poppler installed. They all failed (poppler with a python error) and reportlab with a GCC error, where it was looking for a compiler that was not part of the GCC package I installed from IPKG.

Pil I could not find a package for anywhere. Just wondering if anyone else tried.
Last edited by ismithers on Sat Jan 24, 2015 7:23 am, edited 1 time in total.

ismithers
Experienced
Experienced
Posts: 114
Joined: Mon Jun 03, 2013 9:56 am

Re: PyPDFOCR

Unread post by ismithers » Mon Dec 01, 2014 1:08 am

I gave up on this and wrote my own python script to use Tesseract/Enscript and then import to Note Station.

deepc
I'm New!
I'm New!
Posts: 1
Joined: Fri Jan 23, 2015 2:54 pm

Re: PyPDFOCR

Unread post by deepc » Fri Jan 23, 2015 2:56 pm

Can you please share your script?

ismithers
Experienced
Experienced
Posts: 114
Joined: Mon Jun 03, 2013 9:56 am

Re: PyPDFOCR

Unread post by ismithers » Sat Jan 24, 2015 4:36 am

Sure with a disclaimer: I wasn't intending to release this, so it is pretty dry as far as code goes. Make sure you have Tesseract OCR and Enscript installed and have configured their paths (or added them to the environment table and adjusted the script).

Chuck this in a Python file, and then for ease of use I suggest grabbing Sublime Text which has a built in python environment that can be triggered with CTRL+SHIFT+B.

My usage scenario is as follows:
I have an Epson XP 610 printer, which can do cloud scanning. So I set it up to scan and drop images into a folder in my dropbox account. I then set CloudStation to be linked to my dropbox account and sync that folder to my PC. This script then just ran in a loop on my PC and as soon as files appeared in the directory I was monitoring, it ran OCR on them, created a new evernote note, attached the image and text and then moved the files to an archive directory. Then it was just a matter of an evening to sit at my printer and scan in all my receipts and another evening to go through and tag/title them all correctly. Once done, I then imported them to DS Note and now I just do a manual scan/note create whenever I get a new receipt and its minimal work.

Things I considered doing but discarded:
  1. 1) Getting it to parse the date - but unfortunately due to the variety of date formats vendors use, I decided it more work than reward to implement it. It could still be viable, but I only had 500 receipts to scan. If you have 10000 say, then it could save you some time.
  • 2) Same thing for the vendor names, very difficult to recognise what text represents the name, due to different receipt layouts.
  • 3) Better OCR, Tesseract has a wealth of options that you can tinker with, generally speaking I just left it as a default since I manually went through and renamed my receipts with the store title, the OCR is mainly a backup for searching. For example I purchased a hat stand, and so searching for hat brings up that receipt, despite me forgetting where I bought it from/when. The OCR provides an advantage in these cases and is cheap to get.
One final note, my printer very nicely has options to auto-crop scanned images. Keep that in mind as you don't want A4 images for tiny card receipts or whatever. I suggest doing some test scans to see what scan settings yield the best OCR results.

If you have better processes, please share them. :) Ta!

Code: Select all

def main():
	import msvcrt, time, os, string, subprocess, sys, shutil

	print('Directory monitor started')

	# Choose whether to use:
	# A) The current working directory...
	#path = os.getcwd()
	# B) Custom path: I pointed this to a directory that my printer uploads scanned images to, and that CloudStation syncs to my local PC.
	path = 'D:\\CloudStation\\dropbox\\epson scanned files'
	
	# Flags:
	moveFile = False # Moves any processed files into a processed directoy for archiving.
	uploadToEvernote = False # Triggers enscript to create a new Evernote script and attaches the contents.
	
	# If you wish this to run continually, uncomment the 'while True:' and then tab-indent everything below this line up to the 'raise' line.
	# Its worth noting you will need to manually kill the process if you choose to do this.
	
	# while True:
	time.sleep(0.25) # Polls every 1/4 of a second.

	files = []
	fileNames = os.listdir(path)
	for f in fileNames:
		if f.endswith('.jpg') or f.endswith('.tif'): # Check for valid image files.
			files.append(f)

	numFiles = len(files)

	if (numFiles > 0):
		print('Found {0} files: {1}'.format(numFiles, files))

		time.sleep(1.0) # Allow time for file IO to finish.

		if (not os.path.exists('{0}\\processed'.format(path))):
			os.mkdir(os.path.join(path, 'processed'))
			print('Creating directory for processed images')

		print('Found {0} new files, processing...'.format(numFiles))
		for i in range(0, numFiles):
			try:
				# Processing the image file via Tesseract OCR. This generates a text file with the image text in it.
				src = os.path.join(path, files[i])
				outFileName = os.path.splitext(files[i])[0] # Extensionless as Tesseract adds on the .txt file extension.
				out = os.path.join(path, outFileName)
				tesseract = ['tesseract', src, out]
				subprocess.call(tesseract, stdin=None, stdout=None, stderr=None, shell=False)
				
				if uploadToEvernote:					
					# Add the contents to Evernote.
					enscript = ['C:\\Software\\Evernote\\enscript', 'createNote','/n receipts', '/a {0}'.format(src),'/s {0}.{1}'.format(out, 'txt')]
					subprocess.call(enscript, stdin=None, stdout=None, stderr=None, shell=False)

				if moveFile:
					# Moving the image file to the processed directory
					dstSrc = os.path.join('{0}\\processed\\{1}'.format(path, files[i]))
					shutil.move(src, dstSrc)

					# Moving the text file to the processed diretory.
					dstOut = os.path.join('{0}\\processed\\{1}.{2}'.format(path, outFileName, 'txt'))
					shutil.move('{0}.{1}'.format(out, 'txt'), dstOut)
			except Exception:
				raise

	print('Directory monitor stopping')

if __name__ == "__main__":
	main()

Locked

Return to “General Mods”