Ground Truth Introduction
The number of citations that has been processed by MARS II is over 200,000. Somewhere, buried, is the perfect set of examples to use as ground truth data. While the perfect set is not a reality, the goal is to provide as thorough a sample of articles as possible. The MARS system currently categorizes journals into major types. The majority of journals in the entire NLM collection are represented in the ground truth data available here.
What format are the files in?
After identifying a number of Journal Titles to use for ground truth data we needed to use a common format that can support the community's needs. XML was chosen as the format for the GT data and excels in:
· Intelligence - How well data knows itself
Given that XML is growing in popularity and all its strengths work well for our structured data, XML is the format of choice.
How are the files stored?
All the Ground Truth Files are stored in subdirectories that match the layout used by the MARS system. A Journal Issue is assigned a unique number, called an MRI. This MRI has been created as a directory. Inside this directory are individual scanned pages (TIFF files) and corresponding ground truth files (XML files). We took a sample of one to five pages from specific journal issues to match the different layout types.
For convenience, we have links to zip files that contain either a sample of each type or the entire collection for a particular layout type. If you wish to download the entire ground truth collection use the link below - download all. To download a specific layout type, follow the links below or use the menu system on the left.
Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
There is a file called TrueViz.DTD that has the data definition, but many of the listed fields are not used in our initial version of GT data. If they are unused and required we use the keyword Unknown. If the field is not required, it is simply omitted. You can download the DTD to create your own XML data to be run in Rover.