Connect a scanner and OCR system to LedgerSMB for document storage (invoices, receipts) and retrieval. ============================================================== Draft in development - v. 0.1.2 - 2011 Editor: Havard Sorli, http://www.anix.no Long term goal: Scan all incoming snail mail (invoices, receipts), detect invoices and run OCR on them to locate vendor, amount, tag it and make the transaction available for the accountant for assigning of the right account and approval of the right people. (I think this could be part of a receive system for other electronic invoices to. Lots of similar logic) Suggest 4 release cycles: 1. Step : Scan and store, connect to a transaction in LedgerSMB Scan on MFC / network scanner / Photo from Phone Deliver to e-mail ledgersmb+@domain.com [1] or shared folder Spool handler interface in LedgerSMB. (general - with plug-in interface) Icons/links on transaction(s) [Show document] [Add document] [Source: File] Example: http://www.sicon.co.uk/sales_ledger_invoices_credits.html [1] (RFC 5233 "sub-addressing" [2] 2. Step: Make it easy to use Based on feedbak from users 3. Step : OCR tesseract-ocr http://en.wikipedia.org/wiki/Tesseract_%28software%29 http://code.google.com/p/tesseract-ocr/ Make the scanned pdf's searchable http://ubuntuforums.org/showthread.php?t=1647350 4. Step : Find locate vendor, amount, tag it from the OCR Complete the long term goal. [2] Questions asked: ------------------------------------------------------- Use an external Document management system ? http://en.wikipedia.org/wiki/Document_management_system http://en.wikipedia.org/wiki/Document_capture_software A way to distinguish spam from the true queue items. 1) mailadress+verrylongcode@domain.com 2) use spamassasin and make som spam rules based on your trading partnes details in the DB. (mail adresses, phone number ..) Possible tech used: ------------------------------------------------------- aGiro - OCR scanner for Swedish bills (alpa version) Android application, Developer: https://github.com/pakerfeld Source code is gone from Github (License: Apache2) Expence tracking - Android app - exports QIF file http://f-droid.org/forums/topic/expense-tracking/ Procmail - filtering of incomming mail (Unix ok / Windows ??) (OTRS use .promail as filter www.otrs.org) Look at OTRS for perl modules/rutines to mailhandling from Windows (IMAP / POP3 retrival) http://doc.otrs.org/3.0/en/html/ sftp/scp - uploading of scanner files (please do not use ftp with clear text passwords to this task..) WebDAV over HTTPS - uploading of scanner files http://en.wikipedia.org/wiki/WebDAV incron & inotify, on write close exec.. prosessing, ocr, import ... http://inotify.aiken.cz/?section=incron&page=doc&lang=en Gmane tools - http://gmane.org/dist.php weft weft is a command line based program that takes a mail or a news message and formats it into HTML. twine twine is a command line based program that converts mail into RSS. It's based on libxml2 and libgmime. Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, If you're after a packaged search engine for your website, you should take a look at Omega, an application we supply built upon Xapian http://xapian.org/ http://itunes.apple.com/gb/app/business-card-reader/id328175747?mt=8 ZBar bar code reader http://sourceforge.net/projects/zbar/ A comprehensive software suite for reading barcodes. Supports EAN/UPC, Code 128, Code 39, Interleaved 2 of 5 and QR Code. Includes libraries and applications for decoding captured barcode images and using a video device (eg, webcam) as a barcode scanner. Flere aktuelle bruk av ZBAR: http://sourceforge.net/apps/mediawiki/zbar/index.php?title=Poll:_How_do_you_use_ZBar%3F Jörg and Paul each have an application that splits multi-page TIFF scans into smaller documents using Perl. the barcodes are used both to mark the split points and to name each new section. http://sourceforge.net/forum/forum.php?thread_id=2422910&forum_id=664596 Business Card Reader By SHAPE Services View More By This Developer http://itunes.apple.com/gb/app/business-card-reader/id328175747?mt=8 Bra kritikk. http://twitter.com/#!/miniforetak/statuses/172803669161553920 Lastet ned App: BusinessCardReader. Scanner og arkiverer visittkort med forbløffende presisjon. andre alternativer: http://appadvice.com/appguides/show/business-card-scanning-apps Open Document Management API http://en.wikipedia.org/wiki/ODMA http://odma.info/active/ (ActiveODMA Development) Google API Apply accounting rules based on keywords in the document in with Google API Dublin Core http://en.wikipedia.org/wiki/Open_Source_Metadata_Framework Convert PDF to html- (The demo looks good) Easyer to loock at in the browser http://pdftohtml.sourceforge.net/ LedgerSMB::Scripts::import_trans This is a module that demonstrates how to set up scripts for importing bulk data http://ledger-smb.svn.sourceforge.net/viewvc/ledger-smb/addons/1.3/import_trans/trunk/scripts/import_trans.pl?revision=3093&view=markup Projects to look at: (Have others done this) Maybe: Openpro (Check...! ) http://en.wikipedia.org/wiki/OpenPro#History 2009 – first to have OCR integration with Payables processing Possible Integration with external Document management systems: --------------------------------------------------------------- Article from 2009: Open Source Document Managment and Fujitsu scanners with different platforms http://neteasy.us/news/2009/document-managment-on-different-platforms OpenKM - http://en.wikipedia.org/wiki/OpenKM (GPL2, Java) KnowledgeTree - http://knowledgetree.org/ (GPL3, PHP 5.2.x, MySQL, Java(Apache POI and Lucene) http://wiki.knowledgetree.org/Platform_Requirements http://forge.knowledgetree.com/gf/ Alfresco http://en.wikipedia.org/wiki/Alfresco_%28software%29 ScrollKeeper: Open Source Document Management http://www.xml.com/pub/a/2001/11/28/scrollkeeper.html http://en.wikipedia.org/wiki/ScrollKeeper Rarian http://en.wikipedia.org/wiki/Rarian EPIWare - http://www.epiware.com/ - Project and document managment (GPL2, PHP) https://sourceforge.net/projects/epiware/files/ Project dead ? Last file from 2008 OCR service: Convert receipts and invoices to Xero data with the click of a button http://landing.shoeboxed.com/xero-shoeboxed-signup/ OCR and billing http://www.bill.com/about-us/ DocMGR is a complete, web-based Document Management System (DMS). It allows for the storage of any file type, and supports full-text indexing of the most popular document formats. DocMGR runs on PHP, the Apache webserver, and Postgresql. It uses tsearch2 for full-text indexing Old project: on sourceforge.net since 2001-11-08 License: GPLv2 Use webdaw OCR: Integrated Tesseract and gocr API: http://www.docmgr.org/api-documentation/ (workflow from API) DocManger have intregrated OCR with PDF to txt and indexing of this. How does docmgr handle searchable PDF's? https://sourceforge.net/projects/docmgr/forums/forum/125579/topic/4696532 http://www.docmgr.org/about/ https://sourceforge.net/projects/docmgr/ Document Management (Drupal long tread - from 2006->) http://drupal.org/node/57400 mention: Here are 2 serious DMS opensource products : (2006) - KnowledgeTree http://www.ktdms.com/products/ktdmsfeatures (look very powerfull) - MyDMS http://dms.markuswestphal.de/about.html (already integrated in eGroupWare) - OWL (http://owl.sourceforge.net/) - DocMGR: http://drupal.org/node/57400#comment-192829 (for omtale av DocMGR, se over) - Alfresco - LogicalDOC Community Edition - LGPL v.3 - Java - Webservice - not ? - CMIS API: http://drupal.org/project/cmis The CMIS API project aims to provide a generic API for integrating with CMIS compliant Enterprise CMS (ECM) systems. This is a joint effort between Optaros, Acquia, and Alfresco. This looks like a good way to do it for LedgerSMB New Tutorial: Getting Started with CMIS (standard) - well written & recomend reading http://www.optaros.com/blogs/new-tutorial-getting-started-with-cmis (2009) http://ecmarchitect.com/images/articles/cmis/cmis-article.pdf Content Management Interoperability Services (CMIS) http://www.oasis-open.org/committees/cmis "founded by IBM, Microsoft, and EMC and then broadened in 2007 to include Alfresco, Open Text, SAP, and Oracle." “CMIS is SQL for content management,” cmislib: A CMIS client library for Python http://chemistry.apache.org/python/cmislib.html Welcome to Apache ChemistryTM - open source implementations of the Content Management Interoperability Services (CMIS) specification http://chemistry.apache.org/ - Java lib - Python, PHP, .NET CMIS Navigator - http://www.open-t.nl/?webpages=navigator CMIS Navigator is a Python/GTK based desktop client for CMIS repositories. # Drag and drop files and folder from and to Nautilus and from and to itself # Check in, check out, edit # Download files for viewing # View and edit metadata Assembled Solutions http://www.optaros.com/blogs/metrics-for-the-assembled-web Assembly Oriented Architecture http://www.optaros.com/blogs/assembly-oriented-architecture TwainX - (IE - .NET / win32 - TWAIN) Scanner interface is wrapped as a scriptable object that developers can download and host on their web sites, have a few lines of code embedded in their pages where they wish to integrate scanning and ..., well that is it. you should have a very decent set of scanner controlling features available to you. http://twainx.sourceforge.net/features.html OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods. - Python - Unicode, largley works http://code.google.com/p/ocropus/ hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. http://code.google.com/p/hocr-tools/ Decapod- scanner with Canon PowerShot G10 and a pyton service http://wiki.fluidproject.org/display/fluid/Decapod+0.4+User+Guide#Decapod0.4UserGuide-Introduction sidenote: XML Advanced Electronic Signatures (XAdES) Used in Norwegian BankID http://uri.etsi.org/01903/v1.2.2/ sidenote: eyeOS is an open source web desktop following the cloud computing concept that seeks to enable collaboration and communication among users. It is mainly written in PHP, XML, and JavaScript. http://en.wikipedia.org/wiki/EyeOS Sidenote: Web based tender system to source, award and manage the total procurement process. Leverages supply and demand, through reverse auction, ensuring that goods are bought at the best possible price. Developed in PHP with MySQL/PostgreSQL database. https://sourceforge.net/projects/tendersystem/ SAP have WebDAV support http://help.sap.com/saphelp_nw70/helpdata/en/0c/e6504062939523e10000000a1550b0/frameset.htm Comments: -------------------------------------------------------- To look at: Wawe's import of bank transaction - posting to accounts - http://waveaccounting.com/ From the list: http://homeschoolent.com/2011/08/15-free-accounting-programs-for-small-businesses/ Virtual accounting: "we are moving away from manual data entry" http://openacct.wordpress.com/2011/09/29/virtual-accounting/ Electronic Invoices as PDF/A, According to Italian Law http://www.pdfa.org/2009/06/electronic-invoices-as-pdfa-according-to-italian-law/ Comercial alternatives: ------------------------------------------------ EMC Captivia - invoice capture solution. Paper with details on how Captiva interfaces with content repositories and Accounts Payables systems from vendors like SAP. (2010) (h4871-inputaccel-invoices-sap-wp.pdf) http://www.emc.com/collateral/software/white-papers/h4871-inputaccel-invoices-sap-wp.pdf Upside billing - spool prosessing - nice feature list for our project.. http://www.upsidesoft.com/Upside+Software/PDF/UpsideBillingFAQ.pdf http://www.e-conomic.co.uk/ Kofax Mobile Capture http://www.kofax.com/software/mobile-capture/features.php Techical implementation - early proposal / idea ================================================ 1. Step : Scan and store, connect to a transaction in LedgerSMB Deliver to e-mail ledgersmb+@domain.com [1] or shared folder Spool handler interface in LedgerSMB. (general - with plug-in interface) Icons/links on transaction(s) [Show document] [Add document] [Source: File] Example: http://www.sicon.co.uk/sales_ledger_invoices_credits.html [1] (RFC 5233 "sub-addressing" [2] Transcript of IRC discussion / feedback: ------------------------------------------------------- (slightly edited, removed irc noise) Would you like to "Connect a scanner and OCR system to LedgerSMB for document storage (invoices, receipts) and retrieval." ? haso: in 1.3, invoices, orders, and ar/ap/gl transactions can take attachments. Receipts and checks currently can't for reasons that are a little annoying but will be able to in 1.4. I could add them but would probably require sponsorship.... anyway the preferred way to do this is to scan an image and attach that to the invoice. or transaction. You could add OCR but that introduces possible errors, etc. A second option is to use an external document management solution and attach URL's to these documents. whatever works best for you. * human_blip (~quassel@CPE-120-146-141-15.static.nsw.bigpond.net.au) has joined #ledgersmb I have a outline: http://www.anix.no/p/OCR-2-ledgerSMB.txt ah, you mean for data entry. couple suggestions: 1) You really are going to do better not to create invoices in 1.3 in an automated process. You can do it with ar/ap/gl transactions as long as you don't set them to approved. but if you create orders instead of invoices, then someone can quickly review them for sanity and then generate the invoices. 2) The big difficulty here is likely to be the fact that vendor invoices (the ones inbound) are likely to be in different formats and layouts. I know. It isn't clear to me how the variable layout problem would be solved. But that is step 4... Step 1 - 3 would be a huge step. scan and store would not be too hard in 1.3 for existing invoices. or even non-existing invoices. Also one could add text attachments also via OCR, and add search routines to those using full text search. This is easier if you require PostgreSQL 8.3 Yeah, if there was a module for this, we'd probably include it in 1.4 stock, and 1.3 addons. also I think ehu has had some thoughts on something similar, and so our attachment system is built with that in mind. If 'am correct: the ting we miss i to implement this is the "Spool handler interface". and some logic under it... I'am sorry, not a perl programmer... How much time do think you would it take to make the interface. metatrontech: "Receipts and checks currently can't... could add them.." How much sponsorship are we talking about? haso, checking and adjusting my previous estimates. also.... for a spooler interface, not very much effort there. haso: the issue with the payments/receipts is that there isn't really a good db structure to hang them on, which means a lot of UI development. In the past I said $720, which is probably still a decent estimate (would be quoted maximum). Most other objects could have file attachments added for maybe $60-$120 for a spooler interface, if I was writing it, it wouldn't be more than $120 if we weren't figuring out automatically what to attach to. anyway, off to eat. back in a bit ehu: Do you have comments on OCR 2 ledgerSMB ? ehu: Do you have comments on OCR 2 ledgerSMB ? ( http://www.anix.no/p/OCR-2-ledgerSMB.txt ) haso: viewed your posted suggestion. I like it. The main question I would like to raise is: should we delegate that to an external CMS/DMS or rather: how far should we take our document storage? (scans seem to be an obvious use for the current system I admit) btw, haso, if you want to do the handler for incoming scanner stuff.... I could provide an overview of how to do this too. (that;s free help, of course) my thinking is that these would be temporarily unlinked and then linked to the relevant transactions as needed. Also I'd recommend doing the OCR stuff for 1.4 because full text searching of notes will be a larger priority. I am also not 100% sure how effectively searching links would work for an external DMS. I mean searching the documents is straight-forward but searching documetns linked to invoice 1234 may be more of a problem anyway, haso: this would be worth discussing on the developer list as well. you might get more/better feedback there too do you know of anyone who hooked up ledgersmb with paypal? Hi. Psy-Q: I don't know of anyone but good to ask on the lists. I am working on hooking it up with Amazon FPS. I believe I may have heard of others mentioning it. So not 100% sure how full the integration was. okay! thanks -- I also have some intentions to automate all the follow up paperwork with flat rental. (rent, divide water and electrics ++) based on incoming invoices. nice idea. as part of LSMB? Yes. It's the next step in the OCR project. after OCR, or before? seems like that one could much easier be achieved than the OCR bit. The OCR bit (are more important to me). So I can get rid of the papers... And make it possible to work together with money issues in the company. Cooperation with accountant.. and so on.. yea. that's a good reason. I have bin doing som thinking on this http://www.anix.no/p/OCR-2-ledgerSMB.txt and what we need is an incoming queue with some plug-in possibilities. yup. you do. and you need a way to distinguish spam from the true queue items. If the queue support plug-in's - it would be easy to extend to different incoming formats (scan, photo, e-mail invoice ... orders) with our current file attachments, you probably have the store available. Yes, it looks good. "distinguish spam" : if you send to: mailadress+verrylongcode@domain.com I think you would be quite safe (for a while..) or you could use spamassasin and make som spam rules based on your trading partnes detalis in the DB. (mail adresses, phone number ..) that works too, of course. ----