Lingua::StanfordCoreNLP - A Perl interface to Stanford's CoreNLP tool set.
# Note that Lingua::StanfordCoreNLP can't be instantiated. use Lingua::StanfordCoreNLP;
# Create a new NLP pipeline (silence messages, make corefs bidirectional) my $pipeline = new Lingua::StanfordCoreNLP::Pipeline(1, 1);
# Process text
# (Will output lots of debug info from the Java classes to STDERR.)
my $result = $pipeline->process(
'Jane looked at the IBM computer. She turned it off.'
);
my @seen_corefs;
# Print results
for my $sentence (@{$result->toArray}) {
print "\n[Sentence ID: ", $sentence->getIDString, "]:\n";
print "Original sentence:\n\t", $sentence->getSentence, "\n";
print "Tagged text:\n";
for my $token (@{$sentence->getTokens->toArray}) {
printf "\t%s/%s/%s [%s]\n",
$token->getWord,
$token->getPOSTag,
$token->getNERTag,
$token->getLemma;
}
print "Dependencies:\n";
for my $dep (@{$sentence->getDependencies->toArray}) {
printf "\t%s(%s-%d, %s-%d) [%s]\n",
$dep->getRelation,
$dep->getGovernor->getWord,
$dep->getGovernorIndex,
$dep->getDependent->getWord,
$dep->getDependentIndex,
$dep->getLongRelation;
}
print "Coreferences:\n";
for my $coref (@{$sentence->getCoreferences->toArray}) {
printf "\t%s [%d, %d] <=> %s [%d, %d]\n",
$coref->getSourceToken->getWord,
$coref->getSourceSentence,
$coref->getSourceHead,
$coref->getTargetToken->getWord,
$coref->getTargetSentence,
$coref->getTargetHead;
print "\t\t(Duplicate)\n"
if(grep { $_->equals($coref) } @seen_corefs);
push @seen_corefs, $coref;
}
}
This module implements a StanfordCoreNLP pipeline for annotating
text with part-of-speech tags, dependencies, lemmas, named-entity tags, and coreferences.
(Note that the archive contains the CoreNLP annotation models, which is why it's so darned big.)
The following should do the job:
$ perl Build.PL $ ./Build test $ sudo ./Build install
Lingua::StanfordCoreNLP consists mainly of Java code, and thus needs the Inline::Java manpage installed to function.
Lingua::StanfordCoreNLP exports the following Java-classes via the Inline::Java manpage:
The main interface to StanfordCoreNLP. This class is the only one you
should need to instantiate yourself.
new($silent)Creates a new Lingua::StanfordCoreNLP::Pipeline object. The optional
boolean parameter $silent silences the output from annotators if true,
while the optional parameter $bidirectionalCorefs makes coreferences bidirectional;
that is to say, the coreference is added to both the source and the target
sentence of all coreferences (if the source and target sentence are different).
$silent and $bidirectionalCorefs default to false.
If the pipeline was created to be $silent, return logged messages as a string.
Otherwise, or if no output has been logged, returns an empty string.
Returns a reference to the StanfordCoreNLP pipeline used for annotation.
You probably won't want to touch this.
process($str)Process a string. Returns a Lingua::StanfordCoreNLP::PipelineSentenceList.
Abstract superclass of Pipeline{Coreference,Dependency,Sentence,Token}. Contains ID
and methods for getting and comparing it.
Returns a java.util.UUID object which represents the item's ID.
Returns the ID as a string.
identicalTo($b)Returns true if $b has an identical ID to this item.
An object representing a coreference between head-word W1 in sentence S1 and head-word W2 in sentence S2. Note that both sentences and words are zero-indexed, unlike the default outputs of Stanford's tools.
Index of sentence S1.
Index of sentence S2.
Index of word W1 (in S1).
Index of word W2 (in S2).
The Lingua::StanfordCoreNLP::PipelineToken representing W1.
The Lingua::StanfordCoreNLP::PipelineToken representing W2.
equals($b)Returns true if this PipelineCoreference matches $b --- if
their getSourceToken and getTargetToken have the same ID.
Note that it returns true even if the orders of the
coreferences are reversed (if $a->getSourceToken->getID == $b->getTargetToken->getID
and $a->getTargetToken->getID == $b->getSourceToken->getID).
A compact String representation of the coreference --- "Word/Sentence:Head <=> Word/Sentence:Head".
A String representation of the coreference --- "Word/POS-tag [sentence, head] <=> Word/POS-tag [sentence, head]".
Represents a dependency in the Stanford Typed Dependency format. For example, in the fragment "Walk hard", "Walk" is the governor and "hard" is the dependent in the relationship "advmod" ("hard" is an adverbial modifier of "Walk").
The governor in the relation as a Lingua::StanfordCoreNLP::PipelineToken.
The index of the governor within the sentence.
The dependent in the relation as a Lingua::StanfordCoreNLP::PipelineToken.
The index of the dependent within the sentence.
Short name of the relation.
Long description of the relation.
toCompactString($includeIndices)toString($includeIndices)Returns a String representation of the dependency --- "relation(governor-N, dependent-N) [description]".
toCompactString does not include description. The optional parameter $includeIndices controls
whether governor and dependent indices are included, and defaults to true.
(Note that unlike those of, e.g., the Stanford Parser, these indices start at zero, not one.)
An annotated sentence, containing the sentence itself, its dependencies, pos- and ner-tagged tokens, and coreferences.
Returns a string containing the original sentence
A Lingua::StanfordCoreNLP::PipelineTokenList containing the POS- and
NER-tagged and lemmaized tokens of the sentence.
A Lingua::StanfordCoreNLP::PipelineDependencyList containing the dependencies
found in the sentence.
A Lingua::StanfordCoreNLP::PipelineCoreferenceList of the coreferences between
this and other sentences.
A String representation of the sentence, its coreferences, dependencies, and tokens.
toCompactString separates fields by "\n", whereas toString separates them by
"\n\n".
A token, with POS- and NER-tag and lemma.
The textual representation of the token (i.e. the word).
The token's Part-of-Speech tag.
The token's Named-Entity tag.
The lemma of the the token.
toCompactString($lemmaize)A compact String representation of the token --- "word/POS-tag". If the
optional argument $lemmaize is true, returns "lemma/POS-tag".
A String representation of the token --- "word/POS-tag/NER-tag [lemma]".
Lingua::StanfordCoreNLP::PipelineList is a generic list class which
extends java.Util.ArrayList. It is in turn extended by
Pipeline{Coreference,Dependency,Sentence,Token}List (which are the
list-types that Pipeline returns). Note that all lists are zero-indexed.
joinList($sep)joinListCompact($sep)Returns a string containing the output of either the toString or
toCompactString methods of the elements in PipelineList, separated
by $sep.
Return the elements of the list as an array-reference.
Return the list as a java.util.HashMap<String,PipelineItem>, with
items' stringified ID:s as keys.
Returns the elements of the PipelineList as a string containing the output
of either their toCompactString or toString methods, separated by the
default separator (which is "\n" for all lists except PipelineTokenList
which uses " ").
Custom annotator-combinations, so you won't have to load up six different annotator models just to POSTag som text.
Mail any bug-reports or feature-requests to <StanfordCoreNLP@fivebyfive.be>.
Kalle Räisänen <kal@cpan.org>.
Copyright © 2011 Kalle Räisänen.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.
Copyright © 2010-2011 The Board of Trustees of The Leland Stanford Junior University.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses/.
http://nlp.stanford.edu/software/corenlp.shtml, the Text::NLP::Stanford::EntityExtract manpage, the NLP::StanfordParser manpage, the Inline::Java manpage.