lunduniversity.lu.se

Language Technology at LTH

Computer Science | Faculty of Engineering, LTH

Denna sida på svenska This page in English

The LTH Constituent-to-Dependency Conversion Tool for Penn-style Treebanks

This is a tool to automatically convert the constituent format used in the Penn Treebank into dependency trees. The tool was used to prepare the English dependency treebanks in the 2007, 2008, and 2009 versions of the CoNLL Shared Task.

NOTE: The tool has been updated so that the default output (mostly) corresponds to the linguistic conventions used in the CoNLL-2008 Shared Task.

Prerequisites

  • You need to have a Java virtual machine, version 1.5 or above.
  • If you are using the WSJ part of the Penn Treebank, we strongly recommend that you patch it using the the NP bracketing by David Vadas.

Installation

By downloading the tool, you agree to this license . There are no restrictions on usage except what is written in the license, but we are happy if you cite our paper if you use the program in your academic work.

The tool is available as a JAR executable here (last updated on August 7, 2008). No installation is needed -- just download the file.

Overview of Usage

The program was written to convert the WSJ part of Penn Treebank into a dependency format. Since then, it has been extended so it can also handle Brown, as distributed in Penn Treebank 3. It can not handle the Switchboard or ATIS parts of the Treebank. Also, it is not able to handle treebanks in other languages than English. If you are looking for a conversion tool for the Chinese Treebank, you might want to consider Penn2Malt .

To apply the tool to a Penn-style treebank, do as follows:

java -jar pennconverter.jar < penn_treebank > dep_treebank

The resulting dependency treebank generally follows the linguistic conventions used in the CoNLL-2008 Shared Task . The main differences are that hyphenated words are not split and the annotation of some complex named entities. The output is encoded using the CoNLL-X format .

By default, the tool assumes that the input has the full format of the Penn Treebank, i.e. that it is correctly annotated with function tags and traces, and that is has been patched using Vadas' NP bracketing.

If you are applying the tool to a fully annotated treebank without the NP bracketing (such as the Brown corpus), then write:

java -jar pennconverter.jar -rightBranching=false < penn_treebank > dep_treebank

If you are applying the tool to parse trees without function tags and traces, such as if you are using an automatic parser, then use -raw option:

java -jar pennconverter.jar -raw < penn_treebank > dep_treebank

Command-line Options

The behavior and linguistic conventions used by the program can be configured using a wide range of command-line options.

The following options are available (defaults in bold):

File options
-f FILEread input from FILE (default: stdin)
-t FILEoutput to FILE (default: stdout)
-log FILEwrite log messages to FILE (default: no messages)
-verbosity Nset verbosity level in log file to N (0, 1, or 2; default: 0)
-stopOnError[=true|false]terminate if an error is encountered
Input format options
-rightBranching[=true|false]assume implicit right branching of NPs.
Disable this option if you are NOT using the NP bracketing by Vadas.
Shorthand options
-conll2007turns on options to emulate the conventions used in CoNLL Shared Task 2007
-rawturns on options for trees without function tags and secondary edges
-oldLTHturns on options to emulate the old conventions from the NODALIDA article
Linguistic options
-coordStructure=oldLTH|prague|melchukdetermines how to represent coordination
-posAsHead[=true|false]let possessive be head in possessive NPs
-prepAsHead[=true|false]let preposition be head in PPs
-subAsHead[=true|false]let subordinating conjunction (IN/DT) be head in SBARs
-whAsHead[=true|false]let wh-phrase be head in relative clauses
-imAsHead[=true|false]let infinitive marker (to) be head in VPs
-splitSmallClauses[=true|false]split small clauses into object/OPRD
-advFuncs[=true|false]use adverbial tags such as LOC, TMP
-rootLabels[=true|false]use separate root labels such as ROOT-S, ROOT-FRAG
-labelCoords[=true|false]use separate coordination labels: SCOORD, VCOORD, COORD
-splitSlash[=true|false]rewrite A/B as A / B
-ddtGapping[=true|false]DDT-style encoding of gapping
-conll2008clf[=true|false]annotate cleft sentences as in CoNLL-2008
-conll2008exp[=true|false]annotate expletive constructions as in CoNLL-2008
-iobj[=true|false]use the IOBJ label for indirect objects
-relinkCyclicPRN[=true|false]move cyclic parentheticals to top
-name[=true|false]annotate dependencies inside atomic names using NAME
-suffix[=true|false]use the SUFFIX label for possessive suffixes
-title[=true|false]use the TITLE label for titles in names
-posthon[=true|false]annotate posthonorifics using POSTHON
-appo[=true|false]annotate appositions using APPO
-clr[=true|false]use the CLR function tag
-deepenQP[=true|false]add additional structure to numerical phrases
-qmod[=true|false]annotate dependencies inside numerical phrases using QMOD
-noPennTags[=true|false]ignore function tags if present
-noSecEdges[=true|false]ignore secondary edges if present
Output format options
-format[=conllx|conll2008|tab]Output format

References

  • Richard Johansson and Pierre Nugues. Extended Constituent-to-dependency Conversion for English. In Proceedings of NODALIDA 2007. Tartu, Estonia, 2007. [PDF]
  • Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluís Màrquez, and Joakim Nivre. The CoNLL-2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies. In Proceedings of the 12th Conference on Computational Natural Language Learning (CoNLL). 2008. [PDF]
Page Manager: