Layout types

The page layouts selected for this ground truth collection represent typical biomedical journals from which we automatically extracted bibliographic fields (e.g., article Title, Authors, Affiliation, Abstract). We have classified the page layouts into nine types. These types are graphically represented in the sidebar and described below.

It is important to note we have not zoned or labeled any data that is not a title, author, affiliation or abstract. It is our hope that in the future other zoned data will be provided. The extracted data from MARS only includes the fields that were created by our automated processes.

Type A

The article Title, Author(s), Affiliation and Abstract appearing in the defined order and are located in the upper half of the page. The lower portion could include the abstract or the beginning of the article text. Variations include: single column (as shown in sidebar); 2 single columns (large white space separating); single column, followed by double column (large white space separating single from double); single, double single column; and single, double, double column.

Type B

The Title, Author and Abstract are in the upper portion of the page. The Affiliation is located at the bottom.

Type C

The Title, Author and Abstract are in the upper half of the page. The Affiliation is single columned and located in the left column of double column text. Variations include: a body of text below the double column Affiliation area that is other data. That is, data MEDLINE does not record and isn't classified.

Type D

The Title, Author, and Affiliation are in the upper half of the page. The Abstract usually is all or some of the first column. Variations include the Abstract continuing into a portion of column 2.

Type E

The Title, Author, and Affiliation is in the upper half of the page. The Abstract is double columned, but above the body text of the article in most cases.

Type F

The Title and Author are in the upper half of the page. The Affiliation is along the bottom. The Abstract usually is all or some of the first column. Variations include the abstract continuing into a portion of column 2.

Type G

The Title and Author are in the upper half of the page. The Affiliation is along the bottom-left. The Abstract usually is all or some of the first column. Variations include the Abstract continuing into a portion of column 2.

Type H

The Title and Author are in the upper half of the page. The Affiliation is along the bottom. The Abstract is double columned, but above the body text of the article in most cases.

Type Other

This category holds all unusual article layouts encountered. There are quite a few (about 25% of the NLM collection) that fall into this category and have not been categorized as of yet. This area has little internal research done and would be a great area to improve upon.
Introduction Download All Type A Type B Type C Type D Type E
 

Type F Type G Type H Type Other

NOTE All Layout Types have been groundtruthed at the ZONE level, including Label Tags. Line, Word, and Character level information are still being worked on.
 

Chart with a distribution of thepage layouts in 3,337 journals.

 

 

 


U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility

Visual Definitions

Legend
Type A
Type B
Type C
Type D
Type E
Type F
Type G
Type H
Type Other