Nit wrapper for Stanford CoreNLP

Stanford CoreNLP provides a set of natural language analysis tools which can take raw text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, etc. This wrapper needs the Stanford CoreNLP jars that run on Java 1.8+. See http://nlp.stanford.edu/software/corenlp.shtml.

Usage

~~~nitish var proc = new NLPProcessor("path/to/StanfordCoreNLP/jars") var doc = proc.process("String to analyze") for sentence in doc.sentences do for token in sentence.tokens do print "{token.lemma}: {token.pos}" end end ~~~

Nit API

For ease of use, this wrapper introduce a Nit model to handle CoreNLP XML results.

NLPDocument

NLPDocumentA Document represent a text input given to the NLP processor.

Once processed, it contains a list of sentences that contain tokens.

from_xmlInit self from an xml element.

var xml = """
<root>
  <document>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Stanford</word>
            <lemma>Stanford</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>8</CharacterOffsetEnd>
            <POS>NNP</POS>
          </token>
          <token id="2">
            <word>University</word>
            <lemma>University</lemma>
            <CharacterOffsetBegin>9</CharacterOffsetBegin>
            <CharacterOffsetEnd>19</CharacterOffsetEnd>
            <POS>NNP</POS>
          </token>
        </tokens>
      </sentence>
      <sentence id="2">
        <tokens>
          <token id="1">
            <word>UQAM</word>
            <lemma>UQAM</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>4</CharacterOffsetEnd>
            <POS>NNP</POS>
          </token>
          <token id="2">
            <word>University</word>
            <lemma>University</lemma>
            <CharacterOffsetBegin>5</CharacterOffsetBegin>
            <CharacterOffsetEnd>15</CharacterOffsetEnd>
            <POS>NNP</POS>
          </token>
        </tokens>
      </sentence>
    </sentences>
  </document>
</root>""".to_xml.as(XMLDocument)

var document = new NLPDocument.from_xml(xml)
assert document.sentences.length == 2
assert document.sentences.first.tokens.first.word == "Stanford"
assert document.sentences.last.tokens.first.word == "UQAM"

sentencesNLPSentences contained in self

NLPSentence

NLPSentenceRepresent one sentence in a Document.

tokensNLPTokens contained in self.

NLPToken

NLPTokenRepresent one word (or puncutation mark) in a NLPSentence.

wordOriginal word

posPart Of Speech tag

NLP Processor

NLPProcessorWrapper around StanfordNLP jar.

NLPProcessor provides natural language processing of input text files and an API to handle analysis results.

FIXME this should use the Java FFI.

java_cpClasspath to give to Java when loading the StanfordNLP jars.

processProcess a string and return a new NLPDocument from this.

process_fileProcess the input file and return a new NLPDocument from this.

process_filesBatch mode.

Returns a map of file path associated with their NLPDocument.

Vector Space Model

vectorNLPVector representing self.

cosine_similarityCosine similarity of self and other.

Gives the proximity in the range [0.0 .. 1.0] where 0.0 means that the two vectors are orthogonal and 1.0 means that they are identical.

var v1 = new NLPVector
v1["x"] = 1
v1["y"] = 2
v1["z"] = 3

var v2 = new NLPVector
v2["x"] = 1
v2["y"] = 2
v2["z"] = 3

var v3 = new NLPVector
v3["a"] = 1
v3["b"] = 2
v3["c"] = 3

print v1.cosine_similarity(v2)
#assert v1.cosine_similarity(v2) == 1.0
print v1.cosine_similarity(v3)
assert v1.cosine_similarity(v3) == 0.0

NitNLP binary

The `nitnlp` binary is given as an example of NitNLP client. It compares two strings and display ther cosine similarity value. Usage: ~~~raw nitnlp --cp "/path/to/jars" "sort" "Sorting array data" 0.577 ~~~

TODO

* Use JWrapper * Use options to choose CoreNLP analyzers * Analyze sentences dependencies * Analyze sentiment