Steven Miles

2011
JANDigidx - Making Digitised Content Discoverable
Digidx is the next step in taking some my R&D and prototyped idea's of using open source software to post process digitised content into a more discoverable and accessible. Although a lot of institutions digitise material's OCR it and produce searchable PDF's that are usually indexed by some sort of CMS or Indexing engine. But this is where it stops. What if it didn't have to stop there? Why can't we turn this it a more organic and crowd sourced solution, allowing those who use these resources, to also continually improve that resource?
Much of what I'm working towards is not new, and has been done it various forms, and singlar projects. But what I feel is new is the combining of lot of existing technologies into a single easy to use workflow, that allows almost anybody to make their digitised content more discoverable.
So what does Digidx currently do..
- Ingest digitised images from various sources.
- Correct Pages for rotation and crop
- Enhance each page for the purpose of OCR
- OCR each page taking note of the position of each line on the page.
- Use Natural Language Processing or Name Entity Recognition to extract common elements
- Produce a Searchable PDF
- Produce web accessible copies of each page.
- Create Indexable MetaData Files
- Add Metadata files to a indexer (Zebra or Solr)
- Provide a discovery interface to search and explorer multiple resources
- Provide a Online Document Viewer with the ability to Correct OCR Errors
- Corrections update Indexers and Searchable PDF's