ACL Special Workshop 2012
- Rediscovering 50 Years of Discoveries
10 July 2012, Jeju, Republic of Korea

Call for contributions

Contributed task

In addition to the technical program, the workshop will host a contributed-task, in the spirit of a crowd-sourcing activity, for augmenting and improving the current status of the ACL Anthology Reference Corpus (ACL ARC).

This contributed-task includes both, the format conversion of the most recent papers from pdf to rich-text format, and the processing, post-edition, correction and segmentation of the available digital versions of very old proceedings.

The goal of the contributed-task is to provide a high-quality version of the textual content of the ACL Anthology as a corpus. Its rich text XML markup will contain information on section headings, footnotes, table and figure captions, bibliographic references, italics/emphasized text portions, non-latin scripts, etc.

Besides the more accurate text extraction, the rich markup can be an important additional source of information for corpus-based applications such as summarization tasks (also from figure and table captions, section headings), detecting argumentative zones, analyzing scientific discourse, citation analysis, citation classification, question answering, textual entailment, taxonomy, ontology, information extraction, parsing, coreference resolution, semantic search and many more.

The input to the task consists of two XML formats (raw versions could be made available in addition on request):

  1. paperXML from the ACL Anthology Searchbench provided by DFKI Saarbr├╝cken, approx. 22,500 papers (all papers currently in the Anthology except ROCLING). These were obtained by running a commercial "O"CR program and applying logical markup postprocesing and conversion to XML [Example from Paper N03-1026.pdf, N03-1026-paperxml.xml, paperxml.dtd]
  2. TEI P5 XML generated by PDFExtract. For papers from 2000-2011, an additional high-quality extraction step took place, implemented by IFI, University of Oslo. It applies state-of the art word boundary and layout recognition methods directly to the native, logical PDF structure. As no character recognition errors occur, this will form the master format for textual content (except for tables) if available. [Example from Paper N03-1026.pdf, N03-1026-pdfextract.xml]
  3. PDFExtract Alternatively, you can request access to use and revise PDFExtract source code to make enhancements to the pdf processing directly. If you are interested in this option, an individual branch can be created for you in the SVN repository hosted by University of Oslo. In such case you must request this option when registering as a participant in the contributed task.

Because both versions are not perfect, a large, initial part of the contributed-task will consist in automatically adding missing or correcting wrong markup, using information from "O"CR where necessary (e.g. for tables).

Hence, for most papers from 2000-2011 (currently approx. 70% of the papers), the contributed-task can make use of both representations simultaneously. The other, older papers will be converted to a compatible TEI P5 XML subset.

The target format will be an enriched TEI P5 XML instance. [Example from Paper N03-1026.pdf, N03-1026-gold.xml, aclarc.tei.dtd, aclarc.tei.xsd, aclarc.tei_doc.html]

Call for Contributions

The idea is to provide code and possibly train statistical models that perform the tasks automatically. Manual correction only for training.

Contributions are invited, for the following tasks:

  • Footnotes. Identify footnotes, assign footnote numbers and footnote text and generate markup for them in TEI style [See description]
  • Tables. Identify table references in running text, link them to their captions (currently also occurring in running text) and transform and insert HTML tables from "O"CR to fill missing table content in PDFExtract output, or implement this for PDFExtract. [See description]
  • Bibliographic Markup. Identify and link bibliographic references with ParsCit [See description]
  • De-hyphenation. Improve de-hyphenation [See description]
  • Garbage removal. Remove garbage such as leftovers from figures (occurs both in PDFExtract and "O"CR) [See description]
  • TEI P5 Markup. Generate poor man's TEI markup from "O"CR output/dfki-paperxml of scanned papers (because PDFExtract output is not available for these) [See description]
  • Sentence Splitting. Add sentence splitting and tokenization markup. Having a standard to which everyone can refer to later is important. It can be based on JTok. [See description]
  • Math Formulae. Insert/link formula markup from external tool (formula ocr , e.g.) [See description]

Interested contributors, please register.