<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.tei-c.org/ns/1.0 aclarc.tei.xsd" xml:lang="en">
	<teiHeader>
		<fileDesc>
			<titleStmt>
				<title>Task-oriented Evaluation of Syntactic Parsers and Their Representations</title>
				<author>
					Yusuke Miyao† Rune Sætre† Kenji Sagae† Takuya Matsuzaki† Jun’ichi Tsujii†‡*
					†Department of Computer Science, University of Tokyo, Japan
					‡School of Computer Science, University of Manchester, UK
					∗National Center for Text Mining, UK
					{yusuke,rune.saetre,sagae,matuzaki,tsujii}@is.s.u-tokyo.ac.jp
				</author>
			</titleStmt>
			<publicationStmt>
				<publisher>Association for Computational Linguistics</publisher>
				<pubPlace> Columbus, Ohio, USA</pubPlace>
				<date>June 2008</date>
			</publicationStmt>
			<sourceDesc>
				<p>PDF from the ACL Anthology</p>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<projectDesc>
				<p>Evaluation exercise for the application of TEI in the WeSearch project, November 2010</p>
			</projectDesc>
			<samplingDecl>
				<p>Entire paper taken from the ACL Anthology.</p>
			</samplingDecl>
			<editorialDecl>
				<hyphenation eol="all">
					<p>End-of-line hyphenation removed.</p>
					<p>Authors do not consistently hypenate 're-train'. These have been changed to 'retrain' (Looglefight indicates that is preferred).</p>
				</hyphenation>
			</editorialDecl>
			<refsDecl>
				<cRefPattern matchPattern="(BI[0-9]+)" replacementPattern="#xpath(/TEI/text/back/div/listBibl/bibl[@xml:id='$1'])" />
			</refsDecl>
			<refsDecl>
				<cRefPattern matchPattern="([FI|TA][0-9]+)" replacementPattern="#xpath(//figure[@xml:id='$1'])" />
			</refsDecl>
			<refsDecl>
				<cRefPattern matchPattern="(SE[0-9]+)" replacementPattern="#xpath(/TEI/text/body/div[@xml:id='$1'])" />
			</refsDecl>
			<refsDecl>
				<cRefPattern matchPattern="(SE[0-9]+\.[0-9]+)" replacementPattern="#xpath(/TEI/text/body/div/div[@xml:id='$1'])" />
			</refsDecl>
			<refsDecl>
				<cRefPattern matchPattern="(SE[0-9]+\.[0-9]+\.[0-9]+)" replacementPattern="#xpath(/TEI/text/body/div/div/div[@xml:id='$1'])" />
			</refsDecl>
			<tagsDecl>
				<rendition xml:id="italic" scheme="css">font-style: italic</rendition>
				<rendition xml:id="small" scheme="css">font-size: small</rendition>
				<rendition xml:id="monospace" scheme="css">font-family: monospace</rendition>
			</tagsDecl>
		</encodingDesc>
	</teiHeader>
	<text>
		<front>
			<div type="abs">
				<head>Abstract</head>
				<p>
					This paper presents a comparative evalu<del type="lb">-</del>
					ation of several state-of-the-art English parsers
					based on different frameworks. Our approach
					is to measure the impact of each parser when it
					is used as a component of an information ex<del type="lb">-</del>
					traction system that performs protein-protein
					interaction (PPI) identification in biomedical
					papers. We evaluate eight parsers (based on
					dependency parsing, phrase structure parsing,
					or deep parsing) using five different parse rep<del type="lb">-</del>
					resentations. We run a PPI system with several
					combinations of parser and parse representa<del type="lb">-</del>
					tion, and examine their impact on PPI identi<del type="lb">-</del>
					fication accuracy. Our experiments show that
					the levels of accuracy obtained with these dif<del type="lb">-</del>
					ferent parsers are similar, but that accuracy
					improvements vary when the parsers are re<del type="lb">-</del>
					trained with domain-specific data.</p>
			</div>
		</front>
		<body>
			<div xml:id="SE1">
				<head>Introduction</head>
				<p>
					Parsing technologies have improved considerably in
					the past few years, and high-performance syntactic
					parsers are no longer limited to PCFG-based frame<del type="lb">-</del>
					works (<ref target="#BI6">Charniak, 2000</ref>; <ref target="#BI22">Klein and Manning, 2003</ref>;
					<ref target="#BI5">Charniak and Johnson, 2005</ref>; <ref target="#BI33">Petrov and Klein,
					2007</ref>), but also include dependency parsers (<ref target="#BI26">Mc<del type="lb">-</del>
					Donald and Pereira, 2006</ref>; <ref target="#BI32">Nivre and Nilsson, 2005</ref>;
					<ref target="#BI40">Sagae and Tsujii, 2007</ref>) and deep parsers (<ref target="#BI19">Kaplan
					et al., 2004</ref>; <ref target="#BI7">Clark and Curran, 2004</ref>; <ref target="#BI28">Miyao and
					Tsujii, 2008</ref>). However, efforts to perform extensive
					comparisons of syntactic parsers based on different
					frameworks have been limited. The most popular
					method for parser comparison involves the direct
					measurement of the parser output accuracy in terms
					of metrics such as bracketing precision and recall, or<cb />
					dependency accuracy. This assumes the existence of
					a gold-standard test corpus, such as the Penn Tree<del type="lb">-</del>
					bank (<ref target="#BI24">Marcus et al., 1994</ref>). It is difficult to apply
					this method to compare parsers based on different
					frameworks, because parse representations are often
					framework-specific and differ from parser to parser
					(<ref target="#BI38">Ringger et al., 2004</ref>). The lack of such comparisons
					is a serious obstacle for NLP researchers in choosing
					an appropriate parser for their purposes.
				</p>
				<p>
					In this paper, we present a comparative eval<del type="lb">-</del>
					uation of syntactic parsers and their output represen<del type="lb">-</del>
					tations based on different frameworks: dependency
					parsing, phrase structure parsing, and deep pars<del type="lb">-</del>
					ing. Our approach to parser evaluation is to mea<del type="lb">-</del>
					sure accuracy improvement in the task of identify<del type="lb">-</del>
					ing protein-protein interaction (PPI) information in
					biomedical papers, by incorporating the output of
					different parsers as statistical features in a machine
					learning classifier (<ref target="#BI44">Yakushiji et al., 2005</ref>; <ref target="#BI20">Katrenko
					and Adriaans, 2006</ref>; <ref target="#BI14">Erkan et al., 2007</ref>; <ref target="#BI39">Sætre et al.,
					2007</ref>). PPI identification is a reasonable task for
					parser evaluation, because it is a typical information
					extraction (IE) application, and because recent stud<del type="lb">-</del>
					ies have shown the effectiveness of syntactic parsing
					in this task. Since our evaluation method is applica<del type="lb">-</del>
					ble to any parser output, and is grounded in a real
					application, it allows for a fair comparison of syn<del type="lb">-</del>
					tactic parsers based on different frameworks.
				</p>
				<p>
					Parser evaluation in PPI extraction also illu<del type="lb">-</del>
					minates domain portability. Most state-of-the-art
					parsers for English were trained with the Wall Street
					Journal (WSJ) portion of the Penn Treebank, and
					high accuracy has been reported for WSJ text; how<del type="lb">-</del>
					ever, these parsers rely on lexical information to at<del type="lb">-</del>
					tain high accuracy, and it has been criticized that
					these parsers may overfit to WSJ text (<ref target="#BI15">Gildea, 2001</ref>;
					<pb n="46" />
					<ref target="#BI22">Klein and Manning, 2003</ref>). Another issue for dis<del type="lb">-</del>
					cussion is the portability of training methods. When
					training data in the target domain is available, as
					is the case with the GENIA Treebank (<ref target="#BI21">Kim et al.,
					2003</ref>) for biomedical papers, a parser can be re<del type="lb">-</del>trained
					to adapt to the target domain, and larger ac<del type="lb">-</del>
					curacy improvements are expected, if the training
					method is sufficiently general. We will examine
					these two aspects of domain portability by compar<del type="lb">-</del>
					ing the original parsers with the retrained parsers.
				</p>
			</div>
			<div xml:id="SE2">
				<head>Syntactic Parsers and Their Representations</head>
				<p>
					This paper focuses on eight representative parsers
					that are classified into three parsing frameworks:
					<hi rend="italic">dependency parsing</hi>, <hi rend="italic">phrase structure parsing</hi>,
					and <hi rend="italic">deep parsing</hi>. In general, our evaluation methodol<del type="lb">-</del>
					ogy can be applied to English parsers based on any
					framework; however, in this paper, we chose parsers
					that were originally developed and trained with the
					Penn Treebank or its variants, since such parsers can
					be re<del type="ed">-</del>trained with GENIA, thus allowing for us to
					investigate the effect of domain adaptation.
				</p>
				<div xml:id="SE2.1">
					<head>Dependency parsing</head>
					<p>
						Because the shared tasks of CoNLL-2006 and
						CoNLL-2007 focused on data-driven dependency
						parsing, it has recently been extensively studied in
						parsing research. The aim of dependency pars<del type="lb">-</del>
						ing is to compute a tree structure of a sentence
						where nodes are words, and edges re<del type="lb">-</del>
						present the relations among words. <ref target="#FI1">Figure 1</ref> shows a dependency
						tree for the sentence “IL-8 recognizes and activates
						CXCR1.” An advantage of dependency parsing is
						that dependency trees are a reasonable approxima<del type="lb">-</del>
						tion of the semantics of sentences, and are readily
						usable in NLP applications. Furthermore, the effi<del type="lb">-</del>
						ciency of popular approaches to dependency pars<del type="lb">-</del>
						ing compare favorable with those of phrase struc<del type="lb">-</del>
						ture parsing or deep parsing. While a number of ap<del type="lb">-</del>
						proaches have been proposed for dependency pars<del type="lb">-</del>
						ing, this paper focuses on two typical methods.
					</p>
					<figure xml:id="FI1">
						<graphic url="P08-1006/FI1.png" />
						<head>Figure 1: CoNLL-X dependency tree</head>
					</figure>
					<list type="gloss">
						<label>
							<hi rend="small">MST</hi>
						</label>
						<item>
							<ref target="#BI26">McDonald and Pereira (2006)</ref>’s dependency
							parser,<note place="foot" n="1"><hi rend="monospace">http://sourceforge.net/projects/mstparser</hi></note> based on the Eisner algorithm for projective
							dependency parsing (<ref target="#BI13">Eisner, 1996</ref>) with the second-
							order factorization.
						</item>
						<cb />
						<label>
							<hi rend="small">KSDEP</hi>
						</label>
						<item>
							<ref target="#BI40">Sagae and Tsujii (2007)</ref>’s dependency
							parser,<note place="foot" n="2"><hi rend="monospace">http://www.cs.cmu.edu/~sagae/parser/</hi></note> based on a probabilistic shift-reduce al<del type="lb">-</del>
							gorithm extended by the pseudo-projective parsing
							technique (<ref target="#BI32">Nivre and Nilsson, 2005</ref>).
						</item>
					</list>
				</div>
				<div xml:id="SE2.2" n="2.2" type="subsection">
					<head>Phrase structure parsing</head>
					<p>
						Owing largely to the Penn Treebank, the mainstream
						of data-driven parsing research has been dedicated
						to the phrase structure parsing. These parsers output
						Penn Treebank-style phrase structure trees, although
						function tags and empty categories are stripped off
						(<ref target="#FI2">Figure 2</ref>). While most of the state-of-the-art parsers
						are based on probabilistic CFGs, the parameteriza<del type="lb">-</del>
						tion of the probabilistic model of each parser varies.
						In this work, we chose the following four parsers.
					</p>
					<figure xml:id="FI2">
						<graphic url="P08-1006/FI2.png" />
						<head>Figure 2: Penn Treebank-style phrase structure tree</head>
					</figure>
					<list type="gloss">
						<label>
							<hi rend="small">NO-RERANK</hi>
						</label>
						<item>
							<ref target="#BI6">Charniak (2000)</ref>’s parser, based on a
							lexicalized PCFG model of phrase structure trees.<note place="foot" n="3"><hi rend="monospace">http://bllip.cs.brown.edu/resources.shtml</hi></note>
							The probabilities of CFG rules are parameterized on
							carefully hand-tuned extensive information such as
							lexical heads and symbols of ancestor/sibling nodes.
						</item>
						<label>
							<hi rend="small">RERANK</hi>
						</label>
						<item>
							<ref target="#BI5">Charniak and Johnson (2005)</ref>’s rerank<del type="lb">-</del>
							ing parser. The reranker of this parser receives <formula>n</formula>-
							best<note place="foot" n="4">We set <formula>n=50</formula> in this paper.</note> parse results from <hi rend="small">NO-RERANK</hi>, and selects
							the most likely result by using a maximum entropy
							model with manually engineered features.</item>
						<label>
							<hi rend="small">BERKELEY</hi>
						</label>
						<item>
							Berkley's parser (<ref target="#BI33">Petrov and Klein,
							2007</ref>).<note place="foot" n="5"><hi rend="monospace">http://nlp.cs.berkeley.edu/Main.html#Parsing</hi></note> The parameterization of this parse is op<del type="lb">-</del><pb n="47" />
							timized automatically by assigning latent variables
							to each nonterminal node and estimating the param<del type="lb">-</del>
							eters of the latent variables by the EM algorithm
							(<ref target="#BI25">Matsuzaki et al., 2005</ref>).
						</item>
						<label>
							<hi rend="small">STANFORD</hi>
						</label>
						<item>
							Stanford’s unlexicalized parser (<ref target="#BI22">Klein
							and Manning, 2003</ref>).<note place="foot" n="6"><hi rend="monospace">http://nlp.stanford.edu/software/lex-parser.shtml</hi></note> Unlike <hi rend="small">NO-RERANK</hi>, proba<del type="lb">-</del>
							bilities are not parameterized on lexical heads.
						</item>
					</list>
				</div>
				<div xml:id="SE2.3" n="2.3" type="subsection">
					<head>Deep parsing</head>
					<p>
						Recent research developments have allowed for ef<del type="lb">-</del>
						ficient and robust deep parsing of real-world texts
						(<ref target="#BI19">Kaplan et al., 2004</ref>; <ref target="#BI7">Clark and Curran, 2004</ref>; <ref target="#BI28">Miyao
						and Tsujii, 2008</ref>). While deep parsers compute
						theory-specific syntactic/semantic structures, pred<del type="lb">-</del>
						icate argument structures (PAS) are often used in
						parser evaluation and applications. PAS is a graph
						structure that represents syntactic/semantic relations
						among words (<ref target="#FI3">Figure 3</ref>). The concept is therefore
						similar to CoNLL dependencies, though PAS ex<del type="lb">-</del>
						presses deeper relations, and may include reentrant
						structures. In this work, we chose the two versions
						of the Enju parser (<ref target="#BI28">Miyao and Tsujii, 2008</ref>).
					</p>
					<figure xml:id="FI3">
						<graphic url="P08-1006/FI3.png" />
						<head>Figure 3: Predicate argument structure</head>
					</figure>
					<list type="gloss">
						<label>
							<hi rend="small">ENJU</hi>
						</label>
						<item>
							The HPSG parser that consists of an HPSG
							grammar extracted from the Penn Treebank, and
							a maximum entropy model trained with an HPSG
							treebank derived from the Penn Treebank.<note place="foot" n="7"><hi rend="monospace">http://www-tsujii.is.s.u-tokyo.ac.jp/enju/</hi></note>
						</item>
						<label>
							<hi rend="small">ENJU-GENIA</hi>
						</label>
						<item>
							The HPSG parser adapted to
							biomedical texts, by the method of <ref target="#BI17">Hara et al.
							(2007)</ref>. Because this parser is trained with both
							WSJ and GENIA, we compare it parsers that are
							retrained with GENIA (see <ref target="#SE3.3">section 3.3</ref>).
						</item>
					</list>
				</div>
			</div>
			<div xml:id="SE3" n="3" type="section">
				<head>Evaluation Methodology</head>
				<p>
					In our approach to parser evaluation, we measure
					the accuracy of a PPI extraction system, in which<cb />
					the parser output is embedded as statistical features
					of a machine learning classifier. We run a classi<del type="lb">-</del>
					fier with features of every possible combination of a
					parser and a parse representation, by applying con<del type="lb">-</del>
					versions between representations when necessary.
					We also measure the accuracy improvements ob<del type="lb">-</del>
					tained by parser retraining with GENIA, to examine
					the domain portability, and to evaluate the effective<del type="lb">-</del>
					ness of domain adaptation.</p>
				<div xml:id="SE3.1" n="3.1" type="subsection">
					<p>
						PPI extraction is an NLP task to identify protein
						pairs that are mentioned as interacting in biomedical
						papers. Because the number of biomedical papers is
						growing rapidly, it is impossible for biomedical re<del type="lb">-</del>
						searchers to read all papers relevant to their research;
						thus, there is an emerging need for reliable IE tech<del type="lb">-</del>
						nologies, such as PPI identification.
					</p>
					<p>
						<ref target="#FI4">Figure 4</ref> shows two sentences that include pro<del type="lb">-</del>
						tein names: the former sentence mentions a protein
						interaction, while the latter does not. Given a pro<del type="lb">-</del>
						tein pair, PPI extraction is a task of binary classi<del type="lb">-</del>
						fication; for example, &lt;IL-8, CXCR1&gt; is a positive
						example, and &lt;RBP, TTR&gt; is a negative example.
						Recent studies on PPI extraction demonstrated that
						dependency relations between target proteins are ef<del type="lb">-</del>
						fective features for machine learning classifiers (<ref target="#BI20">Ka<del type="lb">-</del>
						trenko and Adriaans, 2006</ref>; <ref target="#BI14">Erkan et al., 2007</ref>; <ref target="#BI39">Sætre
						et al., 2007</ref>). For the protein pair <hi rend="bold">IL-8</hi> and <hi rend="bold">CXCR1</hi>
						in <ref target="#FI4">Figure 4</ref>, a dependency parser outputs a depen<del type="lb">-</del>
						dency tree shown in <ref target="#FI1">Figure 1</ref>. From this dependency
						tree, we can extract a dependency path shown in <ref target="#FI5">Fig<del type="lb">-</del>
						ure 5</ref>, which appears to be a strong clue in knowing
						that these proteins are mentioned as interacting.
					</p>
					<figure xml:id="FI4">
						<graphic url="P08-1006/FI4.png" />
						<head>Figure 4: Sentences including protein names</head>
					</figure>
					<figure xml:id="FI5">
						<graphic url="P08-1006/FI5.png" />
						<head>Figure 5: Dependency path</head>
					</figure>
					<pb n="48" />
					<p>
						We follow the PPI extraction method of <ref target="#BI39">Sætre et
						al. (2007)</ref>, which is based on SVMs with SubSet
						Tree Kernels (<ref target="#BI10">Collins and Duffy, 2002</ref>;<ref target="#BI30">Moschitti,
						2006</ref>), while using different parsers and parse rep<del type="lb">-</del>
						resentations. Two types of features are incorporated
						in the classifier. The first is bag-of-words features,
						which are regarded as a strong baseline for IE sys<del type="lb">-</del>
						tems. Lemmas of words before, between and after
						the pair of target proteins are included, and the linear
						kernel is used for these features. These features are
						commonly included in all of the models. Filtering
						by a stop-word list is not applied because this setting
						made the scores higher than <ref target="#BI39">Sætre et al. (2007)</ref>’s set<del type="lb">-</del>
						ting. The other type of feature is syntactic features.
						For dependency-based parse representations, a de<del type="lb">-</del>
						pendency path is encoded as a flat tree as depicted in
						<ref target="#FI6">Figure 6</ref> (prefix “r” denotes reverse relations). Be<del type="lb">-</del>
						cause a tree kernel measures the similarity of trees
						by counting common subtrees, it is expected that the
						system finds effective subsequences of dependency
						paths. For the <hi rend="small">PTB</hi> representation, we directly en<del type="lb">-</del>
						code phrase structure trees.
					</p>
					<figure xml:id="FI6">
						<graphic url="PO8-1006/FI6.png" />
						<head>Figure 6: Tree representation of a dependency path</head>
					</figure>
				</div>
				<div xml:id="SE3.2" n="3.2" type="subsection">
					<head>Conversion of parse representations</head>
					<p>
						It is widely believed that the choice of representa<del type="lb">-</del>
						tion format for parser output may greatly affect the
						performance of applications, although this has not
						been extensively investigated. We should therefore
						evaluate the parser performance in multiple parse
						representations. In this paper, we create multiple
						parse representations by converting each parser’s de<del type="lb">-</del>
						fault output into other representations when possi<del type="lb">-</del>
						ble. This experiment can also be considered to be
						a comparative evaluation of parse representations,
						thus providing an indication for selecting an appro<del type="lb">-</del>
						priate parse representation for similar IE tasks.
					</p>
					<p>
						<ref target="#FI7">Figure 7</ref> shows our scheme for representation
						conversion. This paper focuses on five representa<del type="lb">-</del>
						tions as described below.
					</p>
					<figure xml:id="FI7">
						<graphic url="PO8-1006/FI7.png" />
						<head>Figure 7: Conversion of parse representations</head>
					</figure>
					<list type="gloss">
						<label>
							<hi rend="small">CoNLL</hi>
						</label>
						<item>
							The dependency tree format used in the
							2006 and 2007 CoNLL shared tasks on dependency
							parsing. This is a representation format supported by
							several data-driven dependency parsers. This repre<del type="lb">-</del><cb />
							sentation is also obtained from Penn Treebank-style
							trees by applying constituent-to-dependency conver<del type="lb">-</del>
							sion<note place="foot" n="8"><hi rend="monospace">http://nlp.cs.lth.se/pennconverter/</hi></note> (<ref target="#BI18">Johansson and Nugues, 2007</ref>). It should be
							noted, however, that this conversion cannot work
							perfectly with automatic parsing, because the con<del type="lb">-</del>
							version program relies on function tags and empty
							categories of the original Penn Treebank.
						</item>
						<label>
							<hi rend="small">PTB</hi>
						</label>
						<item>
							Penn Treebank-style phrase structure trees
							without function tags and empty nodes. This is the
							default output format for phrase structure parsers.
							We also create this representation by converting
							<hi rend="small">ENJU</hi>’s output by tree structure matching, although
							this conversion is not	perfect because forms of <hi rend="small">PTB</hi>
							and <hi rend="small">ENJU</hi> output are not necessarily compatible.
						</item>
						<label>
							<hi rend="small">HD</hi>
						</label>
							<item>
								Dependency trees of syntactic heads (<ref target="#FI8">Fig<del type="lb">-</del>
								ure 8</ref>). This representation is obtained by convert<del type="lb">-</del>
								ing <hi rend="small">PTB</hi> trees. We first determine lexical heads of
								nonterminal nodes by using Bikel’s implementation
								of Collins’ head detection algorithm<note place="foot" n="9"><hi rend="monospace">http://www.cis.upenn.edu/~dbikel/software.html</hi></note> (<ref target="#BI1">Bikel, 2004</ref>;
								<ref target="#BI11">Collins, 1997</ref>). We then convert lexicalized trees
								into dependencies between lexical heads.
							</item>
						<label>
							<hi rend="small">SD</hi>
						</label>
						<item>
							The Stanford dependency format (<ref target="#FI9">Figure 9</ref>).
							This format was originally proposed for extracting
							dependency relations useful for practical applica<del type="lb">-</del>
							tions (<ref target="#B12">de Marneffe et al., 2006</ref>). A program to con<del type="lb">-</del>
							vert <hi rend="small">PTB</hi> is attached to the Stanford parser. Although
							the concept looks similar to <hi rend="small">CoNLL</hi>, this representa<del type="lb">-</del><pb n="49" />
							tion does not necessarily form a tree structure, and is
							designed to express more fine-grained relations such
							as apposition. Research groups for biomedical NLP
							recently adopted this representation for corpus anno<del type="lb">-</del>
							tation (<ref target="#BI35">Pyysalo et al., 2007a</ref>) and parser evaluation
							(<ref target="#BI9">Clegg and Shepherd, 2007</ref>; <ref target="#BI36">Pyysalo et al., 2007b</ref>).
						</item>
						<label>
							<hi rend="small">PAS</hi>
						</label>
						<item>
							Predicate-argument structures. This is the de<del type="lb">-</del>
							fault output format for <hi rend="small">ENJU</hi> and <hi rend="small">ENJU-GENIA</hi>.
						</item>
					</list>
					<figure xml:id="FI8">
						<graphic url="PO8-1006/FI8.png" />
						<head>Figure 8: Head dependencies</head>
					</figure>
					<figure xml:id="FI9">
						<graphic url="PO8-1006/FI9.png" />
						<head>Figure 9: Stanford dependencies</head>
					</figure>
					<p>
						Although only <hi rend="small">CoNLL</hi> is available for depen<del type="lb">-</del>
						dency parsers, we can create four representations for
						the phrase structure parsers, and five for the deep
						parsers. Dotted arrows in <ref target="#FI7">Figure 7</ref> indicate imper<del type="lb">-</del>
						fect conversion, in which the conversion inherently
						introduces errors, and may decrease the accuracy.
						We should therefore take caution when comparing
						the results obtained by imperfect conversion. We
						also measure the accuracy obtained by the ensem<del type="lb">-</del>
						ble of two parsers/representations. This experiment
						indicates the differences and overlaps of information
						conveyed by a parser or a parse representation.
					</p>
				</div>
				<div xml:id="SE3.3" n="3.3" type="subsection">
					<head>Domain portability and parser retraining</head>
					<p>
						Since the domain of our target text is different from
						WSJ, our experiments also highlight the domain
						portability of parsers. We run two versions of each
						parser in order to investigate the two types of domain
						portability. First, we run the original parsers trained
						with WSJ<note place="foot" n="10">Some of the parser packages include parsing models trained with extended data, but we used the models trained with WSJ section 2-21 of the Penn Treebank.</note> (39832 sentences). The results in this
						setting indicate the domain portability of the original
						parsers. Next, we run parsers re<del type="ed">-</del>trained with GE<del type="lb">-</del>
						NIA <note place="foot" n="11">The domains of GENIA and AImed are not exactly the same, because they are collected independently.</note> (8127 sentences), which is a Penn Treebank-
						style treebank of biomedical paper abstracts. Accu<del type="lb">-</del>racy
						improvements in this setting indicate the pos<del type="lb">-</del>
						sibility of domain adaptation, and the portability of
						the training methods of the parsers. Since the parsers
						listed in <ref target="#SE2">Section 2</ref> have programs for the training<cb />
						with a Penn Treebank-style treebank, we use those
						programs as-is. Default parameter settings are used
						for this parser re<del type="ed">-</del>training.
					</p>
					<p>
						In preliminary experiments, we found that de<del type="lb">-</del>
						pendency parsers attain higher dependency accuracy
						when trained only with GENIA. We therefore only
						input GENIA as the training data for the retraining
						of dependency parsers. For the other parsers, we in<del type="lb">-</del>
						put the concatenation of WSJ and GENIA for the
						retraining, while the reranker of <hi rend="small">RERANK</hi> was not re<del type="lb">-</del>
						trained due to its cost. Since the parsers other than
						<hi rend="small">NO-RERANK</hi> and <hi rend="small">RERANK</hi> require an external POS
						tagger, a WSJ-trained POS tagger is used with WSJ-
						trained parsers, and <hi rend="monospace">geniatagger</hi> (<ref target="#BI43">Tsuruoka et al.,
						2005</ref>) is used with GENIA-retrained parsers.
					</p>
				</div>
			</div>
			<div xml:id="SE4" n="4" type="section">
				<head>Experiments</head>
				<div xml:id="SE4.1" n="4.1" type="subsection">
					<head>Experiment settings</head>
					<p>
						In the following experiments, we used AImed
						(<ref target="#BI3">Bunescu and Mooney, 2004</ref>), which is a popular
						corpus for the evaluation of PPI extraction systems.
						The corpus consists of 225 biomedical paper ab<del type="lb">-</del>
						stracts (1970 sentences), which are sentence-split,
						tokenized, and annotated with proteins and PPIs.
						We use gold protein annotations given in the cor<del type="lb">-</del>
						pus. Multi-word protein names are concatenated and treated as single words.
						The accuracy is mea<del type="lb">-</del>
						sured by abstract-wise 10-fold cross validation and
						the one-answer-per-occurrence criterion (<ref target="#BI16">Giuliano
						et al., 2006</ref>). A threshold for SVMs is moved to
						adjust the balance of precision and recall, and the
						maximum f-scores are reported for each setting.
					</p>
				</div>
				<div xml:id="SE4.2" n="4.2" type="subsection">
					<head>Comparison of accuracy improvements</head>
					<p>
						<ref target="#TA1">Tables 1</ref> and <ref target="#TA2">2</ref> show the accuracy obtained by using
						the output of each parser in each parse representa<del type="lb">-</del>
						tion. The row “baseline” indicates the accuracy ob<del type="lb">-</del>
						tained with bag-of-words features. <ref target="#TA3">Table 3</ref> shows
						the time for parsing the entire AImed corpus, and
						<ref target="#TA4">Table 4</ref> shows the time required for 10-fold cross
						validation with GENIA-retrained parsers.
					</p>
					<figure xml:id="TA1">
						<head>Table 1: Accuracy on the PPI task with WSJ-trained parsers (precision/recall/f-score)</head>
					</figure>
					<figure xml:id="TA2">
						<head>Table 2: Accuracy on the PPI task with GENIA-retrained parsers (precision/recall/f-score)</head>
					</figure>
					<figure xml:id="TA3">
						<head>Table 3: Parsing time (sec.)</head>
					</figure>
					<figure xml:id="TA4">
						<head>Table 4: Evaluation time (sec.)</head>
					</figure>
					<p>
						When using the original WSJ-trained parsers (<ref target="#TA1">Ta<del type="lb">-</del>
						ble 1</ref>), all parsers achieved almost the same level
						of accuracy — a significantly better result than the
						baseline. To the extent of our knowledge, this is
						the first result that proves that dependency parsing,
						phrase structure parsing, and deep parsing perform<pb n="50" />
						equally well in a real application. Among these
						parsers, <hi rend="small">RERANK</hi> performed slightly better than the
						other parsers, although the difference in the f-score
						is small, while it requires much higher parsing cost.
					</p>
					<p>
						When the parsers are retrained with GENIA (<ref target="#TA2">Ta<del type="lb">-</del>
						ble 2</ref>), the accuracy increases significantly, demon<del type="lb">-</del>
						strating that the WSJ-trained parsers are not suffi<del type="lb">-</del>
						ciently domain-independent, and that domain adap<del type="lb">-</del>
						tation is effective. It is an important observation that
						the improvements by domain adaptation are larger
						than the differences among the parsers in the pre<del type="lb">-</del>
						vious experiment. Nevertheless, not all parsers had
						their performance improved upon retraining. Parser<cb />
						retraining yielded only slight improvements for
						<hi rend="small">RERANK</hi>, <hi rend="small">BERKELEY</hi>, and <hi rend="small">STANFORD</hi>, while larger
						improvements were observed for <hi rend="small">MST</hi>, <hi rend="small">KSDEP</hi>, <hi rend="small">NO-
						RERANK</hi>, and <hi rend="small">ENJU</hi>. Such results indicate the dif<del type="lb">-</del>
						ferences in the portability of training methods. A
						large improvement from <hi rend="small">ENJU</hi> to <hi rend="small">ENJU-GENIA</hi> shows
						the effectiveness of the specifically designed do<del type="lb">-</del>
						main adaptation method, suggesting that the other
						parsers might also benefit from more sophisticated
						approaches for domain adaptation.
					</p>
					<p>
						While the accuracy level of PPI extraction is
						the similar for the different parsers, parsing speed<pb n="51" />
						differs significantly. The dependency parsers are
						much faster than the other parsers, while the phrase
						structure parsers are relatively slower, and the deep
						parsers are in between. It is noteworthy that the
						dependency parsers achieved comparable accuracy
						with the other parsers, while they are more efficient.
					</p>
					<p>
						The experimental results also demonstrate that
						<hi rend="small">PTB</hi> is significantly worse than the other representa<del type="lb">-</del>
						tions with respect to cost for training/testing and
						contributions to accuracy improvements. The con<del type="lb">-</del>
						version from <hi rend="small">PTB</hi> to dependency-based representa<del type="lb">-</del>
						tions is therefore desirable for this task, although it
						is possible that better results might be obtained with
						<hi rend="small">PTB</hi> if a different feature extraction mechanism is
						used. Dependency-based representations are com<del type="lb">-</del>
						petitive, while <hi rend="small">CoNLL</hi> seems superior to <hi rend="small">HD</hi> and <hi rend="small">SD</hi>
						in spite of the imperfect conversion from<hi rend="small">PTB</hi> to
						<hi rend="small">CoNLL</hi>. This might be a reason for the high per<del type="lb">-</del>
						formances of the dependency parsers that directly
						compute <hi rend="small">CoNLL</hi> dependencies. The results for <hi rend="small">ENJU</hi>-
						<hi rend="small">CoNLL</hi> and <hi rend="small">ENJU</hi>-<hi rend="small">PAS</hi> show that <hi rend="small">PAS</hi> contributes to a
						larger accuracy improvement, although this does not
						necessarily mean the superiority of <hi rend="small">PAS</hi>, because two
						imperfect conversions, i.e., <hi rend="small">PAS</hi>-to-<hi rend="small">PTB</hi> and <hi rend="small">PTB</hi>-to-
						<hi rend="small">CoNLL</hi>, are applied for creating <hi rend="small">CoNLL</hi>.
					</p>
				</div>
				<div xml:id="SE4.3" n="4.3" type="subsection">
					<head>Parser ensemble results</head>
					<p>
						<ref target="#TA5">Table 5</ref> shows the accuracy obtained with ensembles
						of two parsers/representations (except the <hi rend="small">PTB</hi> for<del type="lb">-</del>
						mat). Bracketed figures denote improvements from
						the accuracy with a single parser/representation.
						The results show that the task accuracy significantly
						improves by parser/representation ensemble. Inter<del type="lb">-</del>
						estingly, the accuracy improvements are observed
						even for ensembles of different representations from
						the same parser. This indicates that a single parse
						representation is insufficient for expressing the true<cb />
						potential of a parser. Effectiveness of the parser en<del type="lb">-</del>
						semble is also attested by the fact that it resulted in
						larger improvements. Further investigation of the
						sources of these improvements will illustrate the ad<del type="lb">-</del>
						vantages and disadvantages of these parsers and rep<del type="lb">-</del>
						resentations, leading us to better parsing models and
						a better design for parse representations.
					</p>
					<figure xml:id="TA5">
						<head>Table 5: Results of parser/representation ensemble (f-score)</head>
					</figure>
				</div>
				<div xml:id="SE4.4" n="4.4" type="subsection">
					<head>Comparison with previous results on PPI extraction</head>
					<p>
						PPI extraction experiments on AImed have been re<del type="lb">-</del>
						ported repeatedly, although the figures cannot be
						compared directly because of the differences in data
						preprocessing and the number of target protein pairs
						(<ref target="#BI39">Sætre et al., 2007</ref>). <ref target="#TA6">Table 6</ref> compares our best re<del type="lb">-</del>
						sult with previously reported accuracy figures. <ref target="#BI16">Giu<del type="lb">-</del>
						liano et al. (2006)</ref> and <ref target="#BI27">Mitsumori et al. (2006)</ref>) do
						not rely on syntactic parsing, while the former ap<del type="lb">-</del>
						plied SVMs with kernels on surface strings and the
						latter is similar to our baseline method. <ref target="#BI4">Bunescu
						and Mooney (2005)</ref> applied SVMs with subsequence
						kernels to the same task, although they provided only
						a precision-recall graph, and its f-score is
						around 50. Since we did not run experiments on
						protein-pair-wise cross validation, our system can<del type="lb">-</del>
						not be compared directly to the results reported
						by <ref target="#BI14">Erkan et al. (2007)</ref> and <ref target="#BI20">Katrenko and Adriaans<pb n="52" />
						(2006)</ref>, while <ref target="#BI39">Sætre et al. (2007)</ref> presented better re<del type="lb">-</del>
						sults than theirs in the same evaluation criterion.
					</p>
				</div>
			</div>
			<div xml:id="SE5" n="5" type="section">
				<head>Related Work</head>
				<p>
					Though the evaluation of syntactic parsers has been
					a major concern in the parsing community, and a
					couple of works have recently presented the com<del type="lb">-</del>
					parison of parsers based on different frameworks,
					their methods were based on the comparison of the
					parsing accuracy in terms of a certain intermediate
					parse representation (<ref target="#BI38">Ringger et al., 2004</ref>; <ref target="#BI19">Kaplan
					et al., 2004</ref>; <ref target="#BI2">Briscoe and Carroll, 2006</ref>; <ref target="#BI8">Clark and
					Curran, 2007</ref>; <ref target="#BI29">Miyao et al., 2007</ref>; <ref target="#BI9">Clegg and Shep<del type="lb">-</del>
					herd, 2007</ref>; <ref target="#BI36">Pyysalo et al., 2007b</ref>; <ref target="#BI35">Pyysalo et al.,
					2007a</ref>; <ref target="#BI41">Sagae et al., 2008</ref>). Such evaluation requires
					gold standard data in an intermediate representation.
					However, it has been argued that the conversion of
					parsing results into an intermediate representation
					is difficult and far from perfect.
				</p>
				<p>
					The relationship between parsing accuracy and
					task accuracy has been obscure for many years.
					<ref target="#BI37">Quirk and Corston-Oliver (2006)</ref> investigated the
					impact of parsing accuracy on statistical MT. How<del type="lb">-</del>
					ever, this work was only concerned with a single de<del type="lb">-</del>
					pendency parser, and did not focus on parsers based
					on different frameworks.
				</p>
			</div>
			<div xml:id="SE6">
				<head>Conclusion and Future Work</head>
				<p>
					We have presented our attempts to evaluate syntac<del type="lb">-</del>
					tic parsers and their representations that are based on
					different frameworks; dependency parsing, phrase
					structure parsing, or deep parsing. The basic idea
					is to measure the accuracy improvements of the
					PPI extraction task by incorporating the parser out<del type="lb">-</del>
					put as statistical features of a machine learning
					classifier. Experiments showed that state-of-the-
					art parsers attain accuracy levels that are on par
					with each other, while parsing speed differs sig<del type="lb">-</del>
					nificantly. We also found that accuracy improve<del type="lb">-</del>
					ments vary when parsers are retrained with domain-
					specific data, indicating the importance of domain
					adaptation and the differences in the portability of
					parser training methods.
				</p>
				<p>
					Although we restricted ourselves to parsers
					trainable with Penn Treebank-style treebanks, our
					methodology can be applied to any English parsers.
					Candidates include RASP (<ref target="#BI2">Briscoe and Carroll,<cb />
					2006)</ref>, the C&amp;C parser (<ref target="#BI7">Clark and Curran, 2004)</ref>,
					the XLE parser (<ref target="#BI19">Kaplan et al., 2004</ref>), MINIPAR
					(<ref target="#BI23">Lin, 1998</ref>), and Link Parser (<ref target="#BI42">Sleator and Temperley,
					1993</ref>; <ref target="#BI34">Pyysalo et al., 2006</ref>), but the domain adapt<del type="lb">-</del>
					ation of these parsers is not straightforward. It is also
					possible to evaluate unsupervised parsers, which is
					attractive since evaluation of such parsers with
					gold-standard data is extremely problematic.
				</p>
				<p>
					A major drawback of our methodology is that
					the evaluation is indirect and the results depend
					on a selected task and its settings. This indicates
					that different results might be obtained with other
					tasks. Hence, we cannot conclude the superiority of
					parsers/representations only with our results. In or<del type="lb">-</del>
					der to obtain general ideas on parser performance,
					experiments on other tasks are indispensable.
				</p>
			</div>
		</body>
		<back>
			<div type="ack">
				<head>Acknowledgements</head>
				<p>
					This work was partially supported by Grant-in-Aid
					for Specially Promoted Research (MEXT, Japan),
					Genome Network Project (MEXT, Japan), and
					Grant-in-Aid for Young Scientists MEXT, Japan).
				</p>
			</div>
			<div type="bib">
				<head>References</head>
				<listBibl>
					<bibl xml:id="BI1">
						D. M. Bikel. 2004. Intricacies of Collins’ parsing model.
						<hi rend="italic">Computational Linguistics</hi>, 30(4):479–511.
					</bibl>
					<bibl xml:id="BI2">
						T. Briscoe and J. Carroll. 2006. Evaluating the accu<del type="lb">-</del>
						racy of an unlexicalized statistical parser on the PARC
						DepBank. In <hi rend="italic">COLING/ACL 2006 Poster Session</hi>.
					</bibl>
					<bibl xml:id="BI3">
						R, Bunescu and R. J. Mooney. 2004. Collective infor<del type="lb">-</del>
						mation extraction with relational markov networks. In
						<hi rend="italic">ACL 2004</hi>, pages 439–446.
					</bibl>
					<bibl xml:id="BI4">
						R. C. Bunescu and R. J. Mooney. 2005. Subsequence
						kernels for relation extraction. In <hi rend="italic">NIPS 2005</hi>.
					</bibl>
					<bibl xml:id="BI5">
						E. Charniak and M. Johnson. 2005. Coarse-to-fine n-
						best parsing and MaxEnt discriminative reranking. In <hi rend="italic">ACL 2005</hi>.
					</bibl>
					<bibl xml:id="BI6">
						E. Charniak. 2000. A maximum-entropy-inspired parser.
						In <hi rend="italic">NAACL-2000</hi>, pages 132–139.
					</bibl>
					<bibl xml:id="BI7">
						S. Clark and J. R. Curran. 2004. Parsing the WSJ using
						CCG and log-linear models. In <hi rend="italic">42nd ACL</hi>.
					</bibl>
					<bibl xml:id="BI8">
						S. Clark and J. R. Curran. 2007. Formalism-independent
						parser evaluation with CCG and DepBank. In <hi rend="italic">ACL
						2007</hi>.
					</bibl>
					<bibl xml:id="BI9">
						A. B. Clegg and A. J. Shepherd. 2007. Benchmark<del type="lb">-</del>
						ing natural-language parsers for biological applica<del type="lb">-</del>
						tions using dependency graphs. <hi rend="italic">BMC Bioinformatics</hi>,
						8:24.
					</bibl>
					<pb n="53" />
					<bibl xml:id="BI10">
						M. Collins and N. Duffy. 2002. New ranking algorithms
						for parsing and tagging: Kernels over discrete struc<del type="lb">-</del>
						tures, and the voted perceptron. In <hi rend="italic">ACL 2002</hi>.
					</bibl>
					<bibl xml:id="BI11">
						M. Collins. 1997. Three generative, lexicalised models
						for statistical parsing. In <hi rend="italic">35th ACL</hi>.
					</bibl>
					<bibl xml:id="BI12">
						M.-C. de Marneffe, B. MacCartney, and C. D. Man<del type="lb">-</del>
						ning. 2006. Generating typed dependency parses from
						phrase structure parses. In <hi rend="italic">LREC 2006</hi>.
					</bibl>
					<bibl xml:id="BI13">
						J. M. Eisner. 1996. Three new probabilistic models
						for dependency parsing: An exploration. In <hi rend="italic">COLING
						1996</hi>.
					</bibl>
					<bibl xml:id="BI14">
						G. Erkan, A. Ozgur, and D. R. Radev. 2007. Semi-
						supervised classification for extracting protein interac<del type="lb">-</del>
						tion sentences using dependency parsing. In <hi rend="italic">EMNLP 2007</hi>.
					</bibl>
					<bibl xml:id="BI15">
						D. Gildea. 2001. Corpus variation and parser perfor<del type="lb">-</del>
						mance. In <hi rend="italic">EMNLP 2001</hi>, pages 167–202.
					</bibl>
					<bibl xml:id="BI16">
						C. Giuliano, A. Lavelli, and L. Romano. 2006. Exploit<del type="lb">-</del>
						ing shallow linguistic information for relation extrac<del type="lb">-</del>
						tion from biomedical literature. In <hi rend="italic">EACL 2006</hi>.
					</bibl>
					<bibl xml:id="BI17">
						T. Hara, Y. Miyao, and J. Tsujii. 2007. Evaluating im<del type="lb">-</del>
						pact of re<del type="ed">-</del>training a lexical disambiguation model on
						domain adaptation of an HPSG parser. In <hi rend="italic">IWPT 2007</hi>.
					</bibl>
					<bibl xml:id="BI18">
						R. Johansson and P. Nugues. 2007. Extended
						constituent-to-dependency conversion for English. In
						<hi rend="italic">NODALIDA 2007</hi>.
					</bibl>
					<bibl xml:id="BI19">
						R. M. Kaplan, S. Riezler, T. H. King, J. T. Maxwell, and
						A. Vasserman. 2004. Speed and accuracy in shallow
						and deep stochastic parsing. In <hi rend="italic">HLT/NAACL’04</hi>.
					</bibl>
					<bibl xml:id="BI20">
						S. Katrenko and P. Adriaans. 2006. Learning relations
						from biomedical corpora using dependency trees. In
						<hi rend="italic">KDECB</hi>, pages 61–80.
					</bibl>
					<bibl xml:id="BI21">
						J.-D. Kim, T. Ohta, Y. Teteisi, and J. Tsujii. 2003. GE<del type="lb">-</del>
						NIA corpus — a semantically annotated corpus for
						bio-textmining. <hi rend="italic">Bioinformatics</hi>, 19:i180–182.
					</bibl>
					<bibl xml:id="BI22">
						D. Klein and C. D. Manning. 2003. Accurate unlexical<del type="lb">-</del>
						ized parsing. In <hi rend="italic">ACL 2003</hi>.
					</bibl>
					<bibl xml:id="BI23">
						D. Lin. 1998. Dependency-based evaluation of MINI<del type="lb">-</del>PAR.
						In <hi rend="italic">LREC Workshop on the Evaluation of Parsing
						Systems</hi>.
					</bibl>
					<bibl xml:id="BI24">
						M. Marcus, B. Santorini, and M. A. Marcinkiewicz.
						1994. Building a large annotated corpus of En<del type="lb">-</del>
						glish: The Penn Treebank. <hi rend="italic">Computational Linguistics</hi>
						, 19(2):313–330.
					</bibl>
					<bibl xml:id="BI25">
						T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilis<del type="lb">-</del>
						tic CFG with latent annotations. In <hi rend="italic">ACL 2005</hi>.
					</bibl>
					<bibl xml:id="BI26">
						R. McDonald and F. Pereira. 2006. Online learning of
						approximate dependency parsing algorithms. In <hi rend="italic">EACL 2006</hi>
					</bibl>
					<bibl xml:id="BI27">
						T. Mitsumori, M. Murata, Y. Fukuda, K. Doi, and H. Doi.
						2006. Extracting protein-protein interaction informa<del type="lb">-</del>
						tion from biomedical text with SVM. <hi rend="italic">IEICE - Trans.
						Inf. Syst.</hi>, E89-D(8):2464–2466.
					</bibl>
					<cb />
					<bibl xml:id="BI28">
						Y. Miyao and J. Tsujii. 2008. Feature forest models for
						probabilistic HPSG parsing. <hi rend="italic">Computational Linguis<del type="lb">-</del>
						tics</hi>, 34(1):35–80.
					</bibl>
					<bibl xml:id="BI29">
						Y. Miyao, K. Sagae, and J. Tsujii. 2007. Towards
						framework-independent evaluation of deep linguistic
						parsers. In <hi rend="italic">Grammar Engineering across Frameworks
						2007</hi>, pages 238–258.
					</bibl>
					<bibl xml:id="BI30">
						A. Moschitti. 2006. Making tree kernels practical for
						natural language processing. In <hi rend="italic">EACL 2006</hi>.
					</bibl>
					<bibl xml:id="BI32">
						J. Nivre and J. Nilsson. 2005. Pseudo-projective depen<del type="lb">-</del>
						dency parsing. In <hi rend="italic">ACL 2005</hi>.
					</bibl>
					<bibl xml:id="BI33">
						S. Petrov and D. Klein. 2007. Improved inference for
						unlexicalized parsing. In <hi rend="italic">HLT-NAACL 2007</hi>.
					</bibl>
					<bibl xml:id="BI34">
						S. Pyysalo, T. Salakoski, S. Aubin, and A. Nazarenko.
						2006. Lexical adaptation of link grammar to the
						biomedical sublanguage: a comparative evaluation of
						three approaches. <hi rend="italic">BMC Bioinformatics</hi>, 7(Suppl. 3).
					</bibl>
					<bibl xml:id="BI35">
						S. Pyysalo, F. Ginter, J. Heimonen, J. Björne, J. Boberg,
						J. Jävinen and T. Salakoski. 2007a. BioInfer: a cor<del type="lb">-</del>
						pus for information extraction in the biomedical domain.
						<hi rend="italic">BMC Bioinformatics</hi>, 8(50).
					</bibl>
					<bibl xml:id="BI36">
						S. Pyysalo, F. Ginter, V. Laippala, K. Haverinen, J. Hei<del type="lb">-</del>
						monen, and T. Salakoski. 2007b. On the unification of
						syntactic annotations under the Stanford dependency
						scheme: A case study on BioInfer and GENIA. In
						<hi rend="italic">BioNLP 2007</hi>, pages 25–32.
					</bibl>
					<bibl xml:id="BI37">
						C. Quirk and S. Corston-Oliver. 2006. The impact of
						parse quality on syntactically-informed statistical ma<del type="lb">-</del>
						chine translation. In <hi rend="italic">EMNLP 2006</hi>.
					</bibl>
					<bibl xml:id="BI38">
						E. K. Ringger, R. C. Moore, E. Charniak, L. Vander<del type="lb">-</del>
						wende, and H. Suzuki. 2004. Using the Penn Tree<del type="lb">-</del>
						bank to evaluate non-treebank parsers. In <hi rend="italic">LREC 2004</hi>.
					</bibl>
					<bibl xml:id="BI39">
						R. Sætre, K. Sagae, and J. Tsujii. 2007. Syntactic
						features for protein-protein interaction extraction.
						In <hi rend="italic">LBM 2007 short papers</hi>.
					</bibl>
					<bibl xml:id="BI40">
						K. Sagae and J. Tsujii. 2007. Dependency parsing and
						domain adaptation with LR models and parser ensem<del type="lb">-</del>
						bles. In <hi rend="italic">EMNLP-CoNLL 2007</hi>.
					</bibl>
					<bibl xml:id="BI41">
						K. Sagae, Y. Miyao, T. Matsuzaki, and J. Tsujii. 2008.
						Challenges in mapping of syntactic representations
						for framework-independent parser evaluation. In <hi rend="italic">the
						Workshop on Automated Syntatic Annotations for In<del type="lb">-</del>
						teroperable Language Resources</hi>.
					</bibl>
					<bibl xml:id="BI42">
						D. D. Sleator and D. Temperley. 1993. Parsing English
						with a Link Grammar. In <hi rend="italic">3rd IWPT.</hi>
					</bibl>
					<bibl xml:id="BI43">
						Y. Tsuruoka, Y. Tateishi, J.-D. Kim, T. Ohta, J. Mc<del type="lb">-</del>
						Naught, S. Ananiadou, and J. Tsujii. 2005. Develop<del type="lb">-</del>
						ing a robust part-of-speech tagger for biomedical text.
						In <hi rend="italic">10th Panhellenic Conference on Informatics</hi>.
					</bibl>
					<bibl xml:id="BI44">
						A. Yakushiji, Y. Miyao, Y. Tateisi, and J. Tsujii. 2005.
						Biomedical information extraction with predicate-
						argument structure patterns. In <hi rend="italic">First International
						Symposium on Semantic Mining in Biomedicine</hi>.
					</bibl>
				</listBibl>
				<pb n="54" />
			</div>
		</back>
	</text>
</TEI>
