go to DevelopmentPage

Use Case: treeler-fgen

Simple tool for monitoring feature generation:

  • Reads patterns from a file and outputs feature vectors for parts instantiable in a pattern
  • The tool registers models that specify the class of patterns, parts and features

Example: first order dependency parsing

  • Spec: feature templates
  • Input: a sentence, e.g. "The cat eats fish" tagged with pos and other attributes
  • Output: for each part (h,m,l), the feature vector representing it

Indicator Features

Many NLP tasks use the so-called indicator features, i.e. features testing the presence of a particular value in a part, or a conjunction of values. We define such features using templates, a specification of values that can be computed looking at a pair pattern-part. We use n-ary templates to refer to templates taking n primitive values of the part. In NLP, we typically use unary and binary templates, while n=3,4 are also used.

Each instance of a template corresponds to one dimension in the feature space constructed by the FGen. Thus we refer as feature index to an instance of a feature template. A feature index encodes the type of the template that generated it, and the instantiated values.

For efficiency reasons, it is critical to represent feature indices compactly. We consider two implementations of feature indices: ints (requiring dictionaries) and bitstrings (requiring encoding/decoding routines).

Typically, the value of indicator features for a pattern-part the number of times the feature appears in the pattern-part pair.

Other Features

Other popular representations in NLP are:

  • A feature vector, computed externally
  • kernels, i.e. functions computing similarities between two parts.

Use Case: parsing CoNLL datasets

  • Tools for training parsers and testing them on CoNLL data.
  • The main focus is in multilingual dependency parsing from the  CoNLL 2007 Shared Task. See  Results from CoNLL 2007
  • Other CoNLL datasets for Named Entity extraction, SRL, coref.
  • Constituent parsing as well, using similar formats

Use Case: pos+dep pipe

  • A classic pipeline for processing free text. Via  FreeLing, implementing tagging and parsing FreeLing? modules using treeler.