Commit Graph

33 Commits

Author SHA1 Message Date
Roberto Rosario
deb09d3d83 Re enabled tesseract language specific OCR processing and added a 1 time language neutral retry for failed language specific OCR 2011-11-22 17:46:18 -04:00
Roberto Rosario
f0c019f6fc Reduce severity of the messages displayed when no OCR backend is found for a language 2011-11-06 01:06:43 -04:00
Roberto Rosario
bcb61c3ca3 Enabled OCR queue transformation processing 2011-07-25 03:40:15 -04:00
Roberto Rosario
90e876ca93 Code cleanup 2011-07-21 11:46:15 -04:00
Roberto Rosario
89fc258a59 Adapter the OCR app to the new pre cache and preview generation methods 2011-07-21 03:49:27 -04:00
Roberto Rosario
8579c5081d Improved OCR file conversion 2011-07-19 20:56:21 -04:00
Roberto Rosario
648be556a6 Finished adapting the OCR app to the new transformations refactor 2011-07-19 04:21:36 -04:00
Roberto Rosario
5bfd607b31 Removed pdftotext from the requirements, move unpaper calling to the OCR app 2011-07-18 04:06:19 -04:00
Roberto Rosario
5829bbde4d Added per OCR queue transformation models and CRUD views to replace the CONVERTER_OCR_OPTIONS with the new refactored converter transformations systems 2011-07-17 01:32:46 -04:00
Roberto Rosario
0a5dfd6aa9 Plug file descriptor leak 2011-05-19 22:55:57 -04:00
Roberto Rosario
07e9b12e78 flake8 cleanups, ununsed imports and variables cleanup, changed register_diagnostics to use reverse_lazy instead of reverse 2011-05-06 10:39:54 -04:00
Roberto Rosario
ae35e89549 Unicode updates 2011-05-03 21:11:35 -04:00
Roberto Rosario
1e0d8d1f25 Added doctring description 2011-05-03 20:58:58 -04:00
Roberto Rosario
2a744cefea PEP8, pylint cleanups and removal of relative imports 2011-04-23 02:49:07 -04:00
Roberto Rosario
eaaaa5b645 Added support for the command line program pdftotext from the poppler-utils packages to extract text from PDF documents without doing OCR 2011-04-15 23:59:52 -04:00
Roberto Rosario
6b67cff5d7 Changed the way document page count is parsed from the graphics backend, fixing issue #7 2011-04-08 03:29:48 -04:00
Roberto Rosario
71a3c218f4 PEP8, pylint and django-lint cleanups 2011-04-08 02:09:39 -04:00
Roberto Rosario
d54fd98ec5 Finished adding language specific ocr cleanup code 2011-04-07 12:23:26 -04:00
Roberto Rosario
d1ff305a3f Initial commit for the ocr_cleanup branch 2011-04-07 04:07:59 -04:00
Roberto Rosario
f66c8ec6e2 Fixed error and some warning returned by pylint 2011-04-05 00:04:11 -04:00
Roberto Rosario
e4912a8d4d Close file descriptors to prevent memory leaks 2011-03-07 23:22:53 -04:00
Roberto Rosario
6a9e114acb Set all *.py files permissions to 644 2011-03-07 12:15:25 -04:00
Roberto Rosario
d0bea8ffeb Initial version of the GridFS storage driver 2011-03-04 01:08:20 -04:00
Roberto Rosario
c18cb099c6 Improved tesseract execution handling 2011-02-17 23:31:54 -04:00
Roberto Rosario
77b8a432a2 Added distributed OCR queue support 2011-02-17 04:37:35 -04:00
Roberto Rosario
478fb3502e Changed from python's multiprocessing to celery to handle concurrency 2011-02-17 03:45:30 -04:00
Roberto Rosario
409a52af95 First commit to support ocr subprocess 2011-02-17 01:57:14 -04:00
Roberto Rosario
dfd101c33b Cleanup file after ocr 2011-02-16 20:54:11 -04:00
Roberto Rosario
b1e2f64617 Apply transformation before doing OCR, added unpaper to the OCR pre processing pipe 2011-02-16 03:32:21 -04:00
Roberto Rosario
fbc8bc960a Decoupled page transformation interface, added default transformation support 2011-02-14 02:11:39 -04:00
Roberto Rosario
06d7e5a46a Added multipage document support and document page transformation 2011-02-14 00:18:16 -04:00
Roberto Rosario
d6afcc64bb Changed file permissions 2011-02-09 13:55:01 -04:00
Roberto Rosario
6569faad11 Added OCR capabilites 2011-02-09 02:12:14 -04:00