Java-自然语言处理-全-

Java 自然语言处理

System.out.println(tagger.tagString("AFAIK she H8 cth!")); 
System.out.println(tagger.tagString( 
    "BTW had a GR8 tym at the party BBIAM."));

mallet-2.0.6$ bin/mallet import-dir --input sample-data/web/en --output tutorial.mallet --keep-sequence --remove-stopwords

try (InputStream is = new FileInputStream( 
        new File(getModelDir(), "en-token.bin"))){ 
    // Insert code to tokenize the text 
} catch (FileNotFoundException ex) { 
    ... 
} catch (IOException ex) { 
    ... 
} 

TokenizerModel model = new TokenizerModel(is); 
Tokenizer tokenizer = new TokenizerME(model); 

String tokens[] = tokenizer.tokenize("He lives at 1511 W." 
  + "Randolph."); 

for (String a : tokens) { 
  System.out.print("[" + a + "] "); 
} 
System.out.println(); 

[He] [lives] [at] [1511] [W.] [Randolph] [.]  

PTBTokenizer ptb = new PTBTokenizer( 
new StringReader("He lives at 1511 W. Randolph."), 
new CoreLabelTokenFactory(), null); 
while (ptb.hasNext()) { 
  System.out.println(ptb.next()); 
} 

He
lives
at
1511
W.
Randolph
.  

List<String> tokenList = new ArrayList<>(); 
List<String> whiteList = new ArrayList<>(); 

String text = "A sample sentence processed \nby \tthe " + 
    "LingPipe tokenizer."; 

Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE. 
tokenizer(text.toCharArray(), 0, text.length()); 

tokenizer.tokenize(tokenList, whiteList); 

for(String element : tokenList) { 
  System.out.print(element + " "); 
} 
System.out.println(); 

A sample sentence processed by the LingPipe tokenizer

String text = "Mr. Smith went to 123 Washington avenue."; 

String tokens[] = text.split("\\s+"); 

for(String token : tokens) { 
  System.out.println(token); 
} 

Mr.
Smith
went
to
123
Washington
avenue.  

String paragraph = "The first sentence. The second sentence."; 

Reader reader = new StringReader(paragraph); 
DocumentPreprocessor documentPreprocessor =  
new DocumentPreprocessor(reader); 

List<String> sentenceList = new LinkedList<String>(); 

for (List<HasWord> element : documentPreprocessor) { 
  StringBuilder sentence = new StringBuilder(); 
  List<HasWord> hasWordList = element; 
  for (HasWord token : hasWordList) { 
      sentence.append(token).append(" "); 
  } 
  sentenceList.add(sentence.toString()); 
} 

for (String sentence : sentenceList) { 
  System.out.println(sentence); 
} 

The first sentence . 
The second sentence .   

String text = "Mr. Smith went to 123 Washington avenue."; 
String target = "Washington"; 
int index = text.indexOf(target); 
System.out.println(index); 

22

try { 
    String[] sentences = { 
         "Tim was a good neighbor. Perhaps not as good a Bob " +  
        "Haywood, but still pretty good. Of course Mr. Adam " +  
        "took the cake!"}; 
    // Insert code to find the names here 
  } catch (IOException ex) { 
    ex.printStackTrace(); 
}

Tokenizer tokenizer = SimpleTokenizer.INSTANCE; 

TokenNameFinderModel model = new TokenNameFinderModel( 
new File("C:\\OpenNLP Models", "en-ner-person.bin")); 

NameFinderME finder = new NameFinderME(model); 

for (String sentence : sentences) { 
    String[] tokens = tokenizer.tokenize(sentence); 
    Span[] nameSpans = finder.find(tokens); 
    System.out.println(Arrays.toString( 
    Span.spansToStrings(nameSpans, tokens))); 
} 

[Tim, Bob Haywood, Adam]  

POSModel model = new POSModelLoader().load( 
    new File("../OpenNLP Models/" "en-pos-maxent.bin")); 

POSTaggerME tagger = new POSTaggerME(model); 

String sentence = "POS processing is useful for enhancing the "  
   + "quality of data sent to other elements of a pipeline."; 

String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(sentence); 

String[] tags = tagger.tag(tokens); 

for(int i=0; i<tokens.length; i++) { 
    System.out.print(tokens[i] + "[" + tags[i] + "] "); 
} 

    POS[NNP] processing[NN] is[VBZ] useful[JJ] for[IN] enhancing[VBG] the[DT] quality[NN] of[IN] data[NNS] sent[VBN] to[TO] other[JJ] elements[NNS] of[IN] a[DT] pipeline.[NN]  

Properties properties = new Properties();         
properties.put("annotators", "tokenize, ssplit, parse"); 

StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); 

Annotation annotation = new Annotation( 
    "The meaning and purpose of life is plain to see."); 

pipeline.annotate(annotation); 
pipeline.prettyPrint(annotation, System.out); 

    Sentence #1 (11 tokens):
    The meaning and purpose of life is plain to see.
    [Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] [Text=meaning CharacterOffsetBegin=4 CharacterOffsetEnd=11 PartOfSpeech=NN] [Text=and CharacterOffsetBegin=12 CharacterOffsetEnd=15 PartOfSpeech=CC] [Text=purpose CharacterOffsetBegin=16 CharacterOffsetEnd=23 PartOfSpeech=NN] [Text=of CharacterOffsetBegin=24 CharacterOffsetEnd=26 PartOfSpeech=IN] [Text=life CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=NN] [Text=is CharacterOffsetBegin=32 CharacterOffsetEnd=34 PartOfSpeech=VBZ] [Text=plain CharacterOffsetBegin=35 CharacterOffsetEnd=40 PartOfSpeech=JJ] [Text=to CharacterOffsetBegin=41 CharacterOffsetEnd=43 PartOfSpeech=TO] [Text=see CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=VB] [Text=. CharacterOffsetBegin=47 CharacterOffsetEnd=48 PartOfSpeech=.] 
    (ROOT
      (S
        (NP
          (NP (DT The) (NN meaning)
            (CC and)
            (NN purpose))
          (PP (IN of)
            (NP (NN life))))
        (VP (VBZ is)
          (ADJP (JJ plain)
            (S
              (VP (TO to)
                (VP (VB see))))))
        (. .)))

    root(ROOT-0, plain-8)
    det(meaning-2, The-1)
    nsubj(plain-8, meaning-2)
    conj_and(meaning-2, purpose-4)
    prep_of(meaning-2, life-6)
    cop(plain-8, is-7)
    aux(see-10, to-9)
    xcomp(plain-8, see-10)

prep_of(meaning-2, life-6)  

javac -encoding Big5

Scanner scanner = new Scanner("Let's pause, and then "
    + " reflect."); 
List<String> list = new ArrayList<>(); 
while(scanner.hasNext()) { 
    String token = scanner.next(); 
    list.add(token); 
} 
for(String token : list) { 
    System.out.println(token); 
} 

Let's
pause,
and
then
reflect.

scanner.useDelimiter("[ ,.]"); 

Let's
pause

and
then
reflect  

String text = "Mr. Smith went to 123 Washington avenue."; 
String tokens[] = text.split("\\s+"); 
for (String token : tokens) { 
    System.out.println(token); 
} 

Mr.
Smith
went
to
123
Washington
avenue.

BreakIterator wordIterator = BreakIterator.getWordInstance(); 
String text = "Let's pause, and then reflect."; 

wordIterator.setText(text); 
int boundary = wordIterator.first();

while (boundary != BreakIterator.DONE) { 
    int begin = boundary; 
    System.out.print(boundary + "-"); 
    boundary = wordIterator.next(); 
    int end = boundary; 
    if(end == BreakIterator.DONE) break; 
    System.out.println(boundary + " [" 
    + text.substring(begin, end) + "]"); 
} 

0-5 [Let's]
5-6 [ ]
6-11 [pause]
11-12 [,]
12-13 [ ]
13-16 [and]
16-17 [ ]
17-21 [then]
21-22 [ ]
22-29 [reflect]
29-30 [.]  

try { 
    StreamTokenizer tokenizer = new StreamTokenizer( 
          newStringReader("Let's pause, and then reflect.")); 
    boolean isEOF = false; 
    while (!isEOF) { 
        int token = tokenizer.nextToken(); 
        switch (token) { 
            case StreamTokenizer.TT_EOF: 
                isEOF = true; 
                break; 
            case StreamTokenizer.TT_EOL: 
                break; 
            case StreamTokenizer.TT_WORD: 
                System.out.println(tokenizer.sval); 
                break; 
            case StreamTokenizer.TT_NUMBER: 
                System.out.println(tokenizer.nval); 
                break; 
            default: 
                System.out.println((char) token); 
        } 
    } 
} catch (IOException ex) { 
    // Handle the exception 
} 

Let
'  

tokenizer.ordinaryChar('\''); 
tokenizer.ordinaryChar(','); 

Let
'
s
pause
,
and
then
reflect.  

StringTokenizerst = new StringTokenizer("Let's pause, and "
     + "then reflect."); 
while (st.hasMoreTokens()) { 
    System.out.println(st.nextToken()); 
}

Let's
pause,
and
then
reflect.

private String paragraph = "Let's pause, \nand then +
     + "reflect.";

SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE; 
String tokens[] = simpleTokenizer.tokenize(paragraph); 
for(String token : tokens) { 
    System.out.println(token); 
} 

    Let
    '
    s
    pause
    ,
    and
    then
    reflect
    .  

String tokens[] = 
 WhitespaceTokenizer.INSTANCE.tokenize(paragraph); 
for (String token : tokens) { 
    System.out.println(token); 
} 

    Let's
    pause,
    and
    then
    reflect.  

try { 
    InputStream modelInputStream = new FileInputStream( 
        new File(getModelDir(), "en-token.bin")); 
    TokenizerModel model = new 
         TokenizerModel(modelInputStream); 
    Tokenizer tokenizer = new TokenizerME(model); 
    String tokens[] = tokenizer.tokenize(paragraph); 
    for (String token : tokens) { 
        System.out.println(token); 
    } 
} catch (IOException ex) { 
    // Handle the exception 
} 

Let
's
pause
,
and
then
reflect
.  

PTBTokenizer ptb = new PTBTokenizer( 
    new StringReader(paragraph), new 
 CoreLabelTokenFactory(),null); 
while (ptb.hasNext()) { 
    System.out.println(ptb.next()); 
} 

Let
's
pause
,
and
then
reflect
.  

PTBTokenizerptb = new PTBTokenizer( 
    new StringReader(paragraph), new WordTokenFactory(), null);

CoreLabelTokenFactory ctf = new CoreLabelTokenFactory(); 
PTBTokenizer ptb = new PTBTokenizer( 
    new StringReader(paragraph),ctf,"invertible=true"); 
while (ptb.hasNext()) { 
    CoreLabel cl = (CoreLabel)ptb.next(); 
    System.out.println(cl.originalText() + " (" +  
        cl.beginPosition() + "-" + cl.endPosition() + ")"); 
} 

Let (0-3)
's (3-5)
pause (6-11)
, (11-12)
and (14-17)
then (18-22)
reflect (23-30)
. (30-31)  

Reader reader = new StringReader(paragraph);

DocumentPreprocessor documentPreprocessor = 
      new DocumentPreprocessor(reader); 

Iterator<List<HasWord>> it = documentPreprocessor.iterator(); 
while (it.hasNext()) { 
    List<HasWord> sentence = it.next(); 
    for (HasWord token : sentence) { 
        System.out.println(token); 
    } 
} 

Let
's
pause
,
and
then
reflect
.  

Properties properties = new Properties(); 
properties.put("annotators", "tokenize, ssplit");

StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); 
Annotation annotation = new Annotation(paragraph); 

pipeline.annotate(annotation); 
pipeline.prettyPrint(annotation, System.out); 

    Sentence #1 (8 tokens):
    Let's pause, 
    and then reflect.
    [Text=Let CharacterOffsetBegin=0 CharacterOffsetEnd=3] [Text='s CharacterOffsetBegin=3 CharacterOffsetEnd=5] [Text=pause CharacterOffsetBegin=6 CharacterOffsetEnd=11] [Text=, CharacterOffsetBegin=11 CharacterOffsetEnd=12] [Text=and CharacterOffsetBegin=14 CharacterOffsetEnd=17] [Text=then CharacterOffsetBegin=18 CharacterOffsetEnd=22] [Text=reflect CharacterOffsetBegin=23 CharacterOffsetEnd=30] [Text=. CharacterOffsetBegin=30 CharacterOffsetEnd=31]

char text[] = paragraph.toCharArray(); 
TokenizerFactory tokenizerFactory = 
 IndoEuropeanTokenizerFactory.INSTANCE; 
Tokenizer tokenizer = tokenizerFactory.tokenizer(text, 0, 
 text.length); 
for (String token : tokenizer) { 
    System.out.println(token); 
}

Let
'
s
pause
,
and
then
reflect
.  

These fields are used to provide further information about how tokens should be identified<SPLIT>.  
They can help identify breaks between numbers<SPLIT>, such as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>. 

BufferedOutputStream modelOutputStream = null; 
try { 
    ... 
} catch (UnsupportedEncodingException ex) { 
    // Handle the exception 
} catch (IOException ex) { 
    // Handle the exception 
} 

ObjectStream<String> lineStream = new PlainTextByLineStream( 
    new FileInputStream("training-data.train"), "UTF-8"); 
ObjectStream<TokenSample> sampleStream =  
    new TokenSampleStream(lineStream); 

TokenizerModel model = TokenizerME.train( 
    "en", sampleStream, true, 5, 100);

BufferedOutputStream modelOutputStream = new 
 BufferedOutputStream( 
    new FileOutputStream(new File("mymodel.bin"))); 
model.serialize(modelOutputStream); 

    Indexing events using cutoff of 5

    Dropped event F:[p=2, s=3.6,, p1=2, p1_num, p2=bok, p1f1=23, f1=3, f1_num, f2=., f2_eos, f12=3.]
    Dropped event F:[p=23, s=.6,, p1=3, p1_num, p2=2, p2_num, p21=23, p1f1=3., f1=., f1_eos, f2=6, f2_num, f12=.6]
    Dropped event F:[p=23., s=6,, p1=., p1_eos, p2=3, p2_num, p21=3., p1f1=.6, f1=6, f1_num, f2=,, f12=6,]
      Computing event counts...  done. 27 events
      Indexing...  done.
    Sorting and merging events... done. Reduced 23 events to 4.
    Done indexing.
    Incorporating indexed data for training...  
    done.
      Number of Event Tokens: 4
          Number of Outcomes: 2
        Number of Predicates: 4
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ...loglikelihood=-15.942385152878742  0.8695652173913043
      2:  ...loglikelihood=-9.223608340603953  0.8695652173913043
      3:  ...loglikelihood=-8.222154969329086  0.8695652173913043
      4:  ...loglikelihood=-7.885816898591612  0.8695652173913043
      5:  ...loglikelihood=-7.674336804488621  0.8695652173913043
      6:  ...loglikelihood=-7.494512270303332  0.8695652173913043
    Dropped event T:[p=23.6, s=,, p1=6, p1_num, p2=., p2_eos, p21=.6, p1f1=6,, f1=,, f2=bok]
      7:  ...loglikelihood=-7.327098298508153  0.8695652173913043
      8:  ...loglikelihood=-7.1676028756216965  0.8695652173913043
      9:  ...loglikelihood=-7.014728408489079  0.8695652173913043
    ...
    100:  ...loglikelihood=-2.3177060257465376  1.0

try { 
    paragraph = "A demonstration of how to train a 
 tokenizer."; 
    InputStream modelIn = new FileInputStream(new File( 
        ".", "mymodel.bin")); 
    TokenizerModel model = new TokenizerModel(modelIn); 
    Tokenizer tokenizer = new TokenizerME(model); 
    String tokens[] = tokenizer.tokenize(paragraph); 
    for (String token : tokens) { 
        System.out.println(token); 
} catch (IOException ex) { 
    ex.printStackTrace(); 
} 

A
demonstration
of
how
to
train
a
tokenizer
.

String text = "A Sample string with acronyms, IBM, and UPPER " 
   + "and lowercase letters."; 
String result = text.toLowerCase(); 
System.out.println(result); 

    a sample string with acronyms, ibm, and upper and lowercase letters.

public class StopWords { 

    private String[] defaultStopWords = {"i", "a", "about", "an", 
       "are", "as", "at", "be", "by", "com", "for", "from", "how", 
       "in", "is", "it", "of", "on", "or", "that", "the", "this", 
       "to", "was", "what", "when", where", "who", "will", "with"}; 

    private static HashSet stopWords  = new HashSet(); 
    ... 
} 

public StopWords() { 
    stopWords.addAll(Arrays.asList(defaultStopWords)); 
} 

public StopWords(String fileName) { 
    try { 
        BufferedReader bufferedreader =  
                new BufferedReader(new FileReader(fileName)); 
        while (bufferedreader.ready()) { 
            stopWords.add(bufferedreader.readLine()); 
        } 
    } catch (IOException ex) { 
        ex.printStackTrace(); 
    } 
}

public void addStopWord(String word) { 
    stopWords.add(word); 
}

public String[] removeStopWords(String[] words) { 
    ArrayList<String> tokens =  
        new ArrayList<String>(Arrays.asList(words)); 
    for (int i = 0; i < tokens.size(); i++) { 
        if (stopWords.contains(tokens.get(i))) { 
            tokens.remove(i); 
        } 
    } 
    return (String[]) tokens.toArray(
         new String[tokens.size()]); 
} 

StopWords stopWords = new StopWords(); 
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE; 
paragraph = "A simple approach is to create a class " 
    + "to hold and remove stopwords."; 

String tokens[] = simpleTokenizer.tokenize(paragraph); 
String list[] = stopWords.removeStopWords(tokens); 
for (String word : list) { 
    System.out.println(word); 
}

A
simple
approach
create
class
hold
remove
stopwords
.  

String paragraph = "A simple approach is to create a class "  
    + "to hold and remove stopwords."; 

TokenizerFactory factory = 
 IndoEuropeanTokenizerFactory.INSTANCE; 
factory = new EnglishStopTokenizerFactory(factory); 

Tokenizer tokenizer = factory.tokenizer(paragraph.toCharArray(), 
   0, paragraph.length());

for (String token : tokenizer) { 
    System.out.println(token); 
} 

A
simple
approach
create
class
hold
remove
stopwords
.  

String words[] = {"bank", "banking", "banks", "banker", "banked", 
     "bankart"}; 
PorterStemmer ps = new PorterStemmer(); 
for(String word : words) { 
    String stem = ps.stem(word); 
    System.out.println("Word: " + word + "  Stem: " + stem); 
} 

Word: bank  Stem: bank
Word: banking  Stem: bank
Word: banks  Stem: bank
Word: banker  Stem: banker
Word: banked  Stem: bank
Word: bankart  Stem: bankart  

TokenizerFactory tokenizerFactory = 
 IndoEuropeanTokenizerFactory.INSTANCE; 
TokenizerFactory porterFactory =  
    new PorterStemmerTokenizerFactory(tokenizerFactory); 

String[] stems = new String[words.length]; 
for (int i = 0; i < words.length; i++) { 
    Tokenization tokenizer = new Tokenization(words[i],porterFactory); 
    stems = tokenizer.tokens(); 
    System.out.print("Word: " + words[i]); 
    for (String stem : stems) { 
        System.out.println("  Stem: " + stem); 
    } 
} 

Word: bank  Stem: bank
Word: banking  Stem: bank
Word: banks  Stem: bank
Word: banker  Stem: banker
Word: banked  Stem: bank
Word: bankart  Stem: bankart  

StanfordCoreNLP pipeline; 
Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit, pos, lemma"); 
pipeline = new StanfordCoreNLP(props);

String paragraph = "Similar to stemming is Lemmatization. "  
    +"This is the process of finding its lemma, its form " +  
    +"as found in a dictionary."; 
Annotation document = new Annotation(paragraph); 
pipeline.annotate(document); 

List<CoreMap> sentences = 
     document.get(SentencesAnnotation.class); 
List<String> lemmas = new LinkedList<>(); 

for (CoreMap sentence : sentences) { 
    for (CoreLabelword : sentence.get(TokensAnnotation.class)) { 
        lemmas.add(word.get(LemmaAnnotation.class)); 
    } 
} 

System.out.print("[");

for (String element : lemmas) { 
    System.out.print(element + " "); 
} 
System.out.println("]"); 

    [similar to stem be lemmatization . this be the process of find its lemma , its form as find in a dictionary . ]

    Similar to stemming is Lemmatization. This is the process of finding its lemma, its form as found in a dictionary. 

try { 
    dictionary = new JWNLDictionary("...\dict\"); 
    paragraph = "Eat, drink, and be merry, for life is but a dream"; 
    ... 
} catch (IOException | JWNLException ex) 
    // 
}

String tokens[] = 
     WhitespaceTokenizer.INSTANCE.tokenize(paragraph); 
for (String token : tokens) { 
    String[] lemmas = dictionary.getLemmas(token, ""); 
    for (String lemma : lemmas) { 
        System.out.println("Token: " + token + "  Lemma: " 
             + lemma); 
    } 
} 

Token: Eat,  Lemma: at
Token: drink,  Lemma: drink
Token: be  Lemma: be
Token: life  Lemma: life
Token: is  Lemma: is
Token: is  Lemma: i
Token: a  Lemma: a
Token: dream  Lemma: dream  

paragraph = "A simple approach is to create a class " 
     + "to hold and remove stopwords."; 
TokenizerFactory factory = 
     IndoEuropeanTokenizerFactory.INSTANCE; 
factory = new LowerCaseTokenizerFactory(factory); 
factory = new EnglishStopTokenizerFactory(factory); 
factory = new PorterStemmerTokenizerFactory(factory); 
Tokenizer tokenizer = 
     factory.tokenizer(paragraph.toCharArray(), 0, 
     paragraph.length()); 
for (String token : tokenizer) { 
    System.out.println(token); 
} 

simpl
approach
creat
class

hold
remov
stopword
.  

private static String paragraph = "When determining the end of sentences " 
    + "we need to consider several factors. Sentences may end with " 
    + "exclamation marks! Or possibly questions marks? Within " 
    + "sentences we may find numbers like 3.14159, abbreviations " 
    + "such as found in Mr. Smith, and possibly ellipses either " 
    + "within a sentence ..., or at the end of a sentence..."; 

String simple = "[.?!]"; 
String[] splitString = (paragraph.split(simple)); 
for (String string : splitString) { 
    System.out.println(string); 
}

    When determining the end of sentences we need to consider several factors
     Sentences may end with exclamation marks
     Or possibly questions marks
     Within sentences we may find numbers like 3
    14159, abbreviations such as found in Mr
     Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence...

    [^.!?\s][^.!?]*(?:.!?[^.!?]*)*[.!?]?['"]?(?=\s|$)

Pattern sentencePattern = Pattern.compile( 
    "# Match a sentence ending in punctuation or EOS.\n" 
    + "[^.!?\\s]    # First char is non-punct, non-ws\n" 
    + "[^.!?]*      # Greedily consume up to punctuation.\n" 
    + "(?:          # Group for unrolling the loop.\n" 
    + "  [.!?]      # (special) inner punctuation ok if\n" 
    + "  (?!['\"]?\\s|$)  # not followed by ws or EOS.\n" 
    + "  [^.!?]*    # Greedily consume up to punctuation.\n" 
    + ")*           # Zero or more (special normal*)\n" 
    + "[.!?]?       # Optional ending punctuation.\n" 
    + "['\"]?       # Optional closing quote.\n" 
    + "(?=\\s|$)", 
    Pattern.MULTILINE | Pattern.COMMENTS); 

Matcher matcher = sentencePattern.matcher(paragraph); 
while (matcher.find()) { 
    System.out.println(matcher.group()); 
} 

    When determining the end of sentences we need to consider several factors.
    Sentences may end with exclamation marks!
    Or possibly questions marks?
    Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr.
    Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence...

BreakIterator sentenceIterator = 
 BreakIterator.getSentenceInstance(); 

Locale currentLocale = new Locale("en", "US"); 
BreakIterator sentenceIterator =  
    BreakIterator.getSentenceInstance(currentLocale); 

sentenceIterator.setText(paragraph); 

int boundary = sentenceIterator.first(); 
while (boundary != BreakIterator.DONE) { 
    int begin = boundary; 
    System.out.print(boundary + "-"); 
    boundary = sentenceIterator.next(); 
    int end = boundary; 
    if (end == BreakIterator.DONE) { 
        break; 
    } 
    System.out.println(boundary + " [" 
        + paragraph.substring(begin, end) + "]"); 
} 

    0-75 [When determining the end of sentences we need to consider several factors. ]
    75-117 [Sentences may end with exclamation marks! ]
    117-146 [Or possibly questions marks? ]
    146-233 [Within sentences we may find numbers like 3.14159 , abbreviations such as found in Mr. ]
    233-319 [Smith, and possibly ellipses either within a sentence ... , or at the end of a sentence...]
    319-

try (InputStream is = new FileInputStream( 
        new File(getModelDir(), "en-sent.bin"))) { 
    SentenceModel model = new SentenceModel(is); 
    SentenceDetectorME detector = new SentenceDetectorME(model); 
    String sentences[] = detector.sentDetect(paragraph); 
    for (String sentence : sentences) { 
        System.out.println(sentence); 
    } 
} catch (FileNotFoundException ex) { 
    // Handle exception 
} catch (IOException ex) { 
    // Handle exception 
}

    When determining the end of sentences we need to consider several factors.
    Sentences may end with exclamation marks!
    Or possibly questions marks?
    Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr. Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence...

paragraph = " This sentence starts with spaces and ends with "  
    + "spaces . This sentence has no spaces between the next " 
    + "one.This is the next one."; 

    This sentence starts with spaces and ends with spaces  .
    This sentence has no spaces between the next one.This is the next one.

double probablities[] = detector.getSentenceProbabilities(); 
for (double probablity : probablities) { 
    System.out.println(probablity); 
} 

    0.9841708738988814
    0.908052385070974
    0.9130082376342675
    1.0

Span spans[] = detector.sentPosDetect(paragraph); 
for (Span span : spans) { 
    System.out.println(span); 
} 

    [0..74)
    [75..116)
    [117..145)
    [146..317)  

for (Span span : spans) { 
    System.out.println(span + "[" + paragraph.substring( 
        span.getStart(), span.getEnd()) +"]"); 
} 

     [0..74)[When determining the end of sentences we need to consider several factors.]
    [75..116)[Sentences may end with exclamation marks!]
    [117..145)[Or possibly questions marks?]
    [146..317)[Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr. Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence...]

PTBTokenizer ptb = new PTBTokenizer(new StringReader(paragraph), 
     new CoreLabelTokenFactory(), null); 

WordToSentenceProcessor wtsp = new WordToSentenceProcessor(); 
List<List<CoreLabel>> sents = wtsp.process(ptb.tokenize());

for (List<CoreLabel> sent : sents) { 
    System.out.println(sent); 
} 

    [When, determining, the, end, of, sentences, we, need, to, consider, several, factors, .]
    [Sentences, may, end, with, exclamation, marks, !]
    [Or, possibly, questions, marks, ?]
    [Within, sentences, we, may, find, numbers, like, 3.14159, ,, abbreviations, such, as, found, in, Mr., Smith, ,, and, possibly, ellipses, either, within, a, sentence, ..., ,, or, at, the, end, of, a, sentence, ...]  

for (List<CoreLabel> sent : sents) { 
    for (CoreLabel element : sent) { 
        System.out.print(element + " "); 
     } 
    System.out.println(); 
} 

    When determining the end of sentences we need to consider several factors . 
    Sentences may end with exclamation marks ! 
    Or possibly questions marks ? 
    Within sentences we may find numbers like 3.14159 , abbreviations such as found in Mr. Smith , and possibly ellipses either within a sentence ... , or at the end of a sentence ... 

for (List<CoreLabel> sent : sents) { 
    for (CoreLabel element : sent) { 
        System.out.print(element.endPosition() + " "); 
     } 
    System.out.println(); 
} 

    4 16 20 24 27 37 40 45 48 57 65 73 74 
    84 88 92 97 109 115 116 
    119 128 138 144 145 
    152 162 165 169 174 182 187 195 196 210 215 218 224 227 231 237 238 242 251 260 267 274 276 285 287 288 291 294 298 302 305 307 316 317

for (List<CoreLabel> sent : sents) { 
    System.out.println(sent.get(0) + " "  
        + sent.get(0).beginPosition()); 
} 

    When 0
    Sentences 75
    Or 117
    Within 146

for (List<CoreLabel> sent : sents) { 
    int size = sent.size(); 
    System.out.println(sent.get(size-1) + " "  
        + sent.get(size-1).endPosition()); 
} 

    . 74
    ! 116
    ? 145
    ... 317  

"americanize=true,normalizeFractions=true,asciiQuotes=true".

paragraph = "The colour of money is green. Common fraction " 
    + "characters such as ½  are converted to the long form 1/2\. " 
    + "Quotes such as "cat" are converted to their simpler form."; 
ptb = new PTBTokenizer( 
    new StringReader(paragraph), new CoreLabelTokenFactory(), 
    "americanize=true,normalizeFractions=true,asciiQuotes=true"); 
wtsp = new WordToSentenceProcessor(); 
sents = wtsp.process(ptb.tokenize()); 
for (List<CoreLabel> sent : sents) { 
    for (CoreLabel element : sent) { 
        System.out.print(element + " "); 
    } 
    System.out.println(); 
} 

    The color of money is green . 
    Common fraction characters such as 1/2 are converted to the long form 1/2 . 
    Quotes such as " cat " are converted to their simpler form . 

Reader reader = new StringReader(paragraph); 
DocumentPreprocessor dp = new DocumentPreprocessor(reader); 
for (List sentence : dp) { 
    System.out.println(sentence); 
} 

    [When, determining, the, end, of, sentences, we, need, to, consider, several, factors, .]
    [Sentences, may, end, with, exclamation, marks, !]
    [Or, possibly, questions, marks, ?]
    [Within, sentences, we, may, find, numbers, like, 3.14159, ,, abbreviations, such, as, found, in, Mr., Smith, ,, and, possibly, ellipses, either, within, a, sentence, ..., ,, or, at, the, end, of, a, sentence, ...]  

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet type="text/xsl"?> 
<document> 
    <sentences> 
        <sentence id="1"> 
            <word>When</word> 
            <word>the</word> 
            <word>day</word> 
            <word>is</word> 
            <word>done</word> 
            <word>we</word> 
            <word>can</word> 
            <word>sleep</word> 
            <word>.</word> 
        </sentence> 
        <sentence id="2"> 
            <word>When</word> 
            <word>the</word> 
            <word>morning</word> 
            <word>comes</word> 
            <word>we</word> 
            <word>can</word> 
            <word>wake</word> 
            <word>.</word> 
        </sentence> 
        <sentence id="3"> 
            <word>After</word> 
            <word>that</word> 
            <word>who</word> 
            <word>knows</word> 
            <word>.</word> 
        </sentence> 
    </sentences> 
</document> 

try { 
    Reader reader = new FileReader("XMLText.xml"); 
    DocumentPreprocessor dp = new DocumentPreprocessor( 
        reader, DocumentPreprocessor.DocType.XML); 
    dp.setElementDelimiter("sentence"); 
    for (List sentence : dp) { 
        System.out.println(sentence); 
    } 
} catch (FileNotFoundException ex) { 
    // Handle exception 
} 

    [When, the, day, is, done, we, can, sleep, .] 
    [When, the, morning, comes, we, can, wake, .]
    [After, that, who, knows, .]  

for (List sentence : dp) { 
    ListIterator list = sentence.listIterator(); 
     while (list.hasNext()) { 
        System.out.print(list.next() + " "); 
    } 
    System.out.println(); 
} 

    When the day is done we can sleep . 
    When the morning comes we can wake . 
    After that who knows . 

    [When]
    [the]
    [day]
    [is]
    [done]
    ...
    [who]
    [knows]
    [.]

Properties properties = new Properties(); 
properties.put("annotators", "tokenize, ssplit"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); 
Annotation annotation = new Annotation(paragraph); 
pipeline.annotate(annotation); 

    Sentence #1 (13 tokens):
    When determining the end of sentences we need to consider several factors.
    [Text=When CharacterOffsetBegin=0 CharacterOffsetEnd=4] [Text=determining CharacterOffsetBegin=5 CharacterOffsetEnd=16] [Text=the CharacterOffsetBegin=17 CharacterOffsetEnd=20] [Text=end CharacterOffsetBegin=21 CharacterOffsetEnd=24] [Text=of CharacterOffsetBegin=25 CharacterOffsetEnd=27] [Text=sentences CharacterOffsetBegin=28 CharacterOffsetEnd=37] [Text=we CharacterOffsetBegin=38 CharacterOffsetEnd=40] [Text=need CharacterOffsetBegin=41 CharacterOffsetEnd=45] [Text=to CharacterOffsetBegin=46 CharacterOffsetEnd=48] [Text=consider CharacterOffsetBegin=49 CharacterOffsetEnd=57] [Text=several CharacterOffsetBegin=58 CharacterOffsetEnd=65] [Text=factors CharacterOffsetBegin=66 CharacterOffsetEnd=73] [Text=. CharacterOffsetBegin=73 CharacterOffsetEnd=74] 

try { 
    pipeline.xmlPrint(annotation, System.out); 
} catch (IOException ex) { 
    // Handle exception 
}

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?> 
<root> 
  <document> 
    <sentences> 
      <sentence id="1"> 
        <tokens> 
          <token id="1"> 
            <word>When</word> 
            <CharacterOffsetBegin>0</CharacterOffsetBegin> 
            <CharacterOffsetEnd>4</CharacterOffsetEnd> 
          </token> 
... 
         <token id="34"> 
            <word>...</word> 
            <CharacterOffsetBegin>316</CharacterOffsetBegin> 
            <CharacterOffsetEnd>317</CharacterOffsetEnd> 
          </token> 
        </tokens> 
      </sentence> 
    </sentences> 
  </document> 
</root> 

TokenizerFactory TOKENIZER_FACTORY= 
 IndoEuropeanTokenizerFactory.INSTANCE; 
com.aliasi.sentences.SentenceModel sentenceModel = new IndoEuropeanSentenceModel(); 

List<String> tokenList = new ArrayList<>(); 
List<String> whiteList = new ArrayList<>(); 
Tokenizer tokenizer= TOKENIZER_FACTORY.tokenizer( 
    paragraph.toCharArray(),0, paragraph.length()); 
tokenizer.tokenize(tokenList, whiteList);

String[] tokens = new String[tokenList.size()]; 
String[] whites = new String[whiteList.size()]; 
tokenList.toArray(tokens); 
whiteList.toArray(whites); 

int[] sentenceBoundaries= 
 sentenceModel.boundaryIndices(tokens, whites); 
for(int boundary : sentenceBoundaries) { 
    System.out.println(boundary); 
} 

    12
    19
    24  

int start = 0; 
for(int boundary : sentenceBoundaries) { 
    while(start<=boundary) { 
        System.out.print(tokenList.get(start) 
     + whiteList.get(start+1)); 
        start++; 
    } 
    System.out.println(); 
} 

    When determining the end of sentences we need to consider several factors. 
    Sentences may end with exclamation marks! 
    Or possibly questions marks?

    When determining the end of sentences we need to consider several factors. 
    Sentences may end with exclamation marks! 
    Or possibly questions marks? 
    Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr. Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence....

TokenizerFactory tokenizerfactory = 
 IndoEuropeanTokenizerFactory.INSTANCE; 
SentenceModel sentenceModel = new IndoEuropeanSentenceModel(); 

SentenceChunker sentenceChunker =  
    new SentenceChunker(tokenizerfactory, sentenceModel); 

Chunking chunking = sentenceChunker.chunk( 
    paragraph.toCharArray(),0, paragraph.length()); 

Set<Chunk> sentences = chunking.chunkSet(); 
String slice = chunking.charSequence().toString();

for (Chunk sentence : sentences) { 
    System.out.println("[" + slice.substring(sentence.start(), 
       sentence.end()) + "]"); 
} 

    [When determining the end of sentences we need to consider several factors.]
    [Sentences may end with exclamation marks!]
    [Or possibly questions marks?]
    [Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr. Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence....]

paragraph = "HepG2 cells were obtained from the American Type 
 Culture "  
    + "Collection (Rockville, MD, USA) and were used only until "  
    + "passage 30\. They were routinely grown at 37°C in Dulbecco's " 
    + "modified Eagle's medium (DMEM) containing 10 % fetal bovine " 
    + "serum (FBS), 2 mM glutamine, 1 mM sodium pyruvate, and 25 " 
    + "mM glucose (Invitrogen, Carlsbad, CA, USA) in a humidified " 
    + "atmosphere containing 5% CO2\. For precursor and 13C-sugar "  
    + "experiments, tissue culture treated polystyrene 35 mm " 
    + "dishes (Corning Inc, Lowell, MA, USA) were seeded with 2 " 
    + "× 106 cells and grown to confluency in DMEM."; 

TokenizerFactory tokenizerfactory = 
     IndoEuropeanTokenizerFactory.INSTANCE; 
MedlineSentenceModel sentenceModel = new 
     MedlineSentenceModel(); 
SentenceChunker sentenceChunker =  
    new SentenceChunker(tokenizerfactory, 
 sentenceModel); 
     = sentenceChunker.chunk( 
    paragraph.toCharArray(), 0, paragraph.length()); 
Set<Chunk> sentences = chunking.chunkSet(); 
String slice = chunking.charSequence().toString(); 
for (Chunk sentence : sentences) { 
    System.out.println("[" 
        + slice.substring(sentence.start(), 
 sentence.end())  
        + "]"); 
} 

    [HepG2 cells were obtained from the American Type Culture Collection (Rockville, MD, USA) and were used only until passage 30.]
    [They were routinely grown at 37°C in Dulbecco's modified Eagle's medium (DMEM) containing 10 % fetal bovine serum (FBS), 2 mM glutamine, 1 mM sodium pyruvate, and 25 mM glucose (Invitrogen, Carlsbad, CA, USA) in a humidified atmosphere containing 5% CO2.]
    [For precursor and 13C-sugar experiments, tissue culture treated polystyrene 35 mm dishes (Corning Inc, Lowell, MA, USA) were seeded with 2 × 106 cells and grown to confluency in DMEM.] 

try { 
    ObjectStream<String> lineStream = new PlainTextByLineStream( 
        new FileReader("sentence.train")); 
    ObjectStream<SentenceSample> sampleStream 
        = new SentenceSampleStream(lineStream); 
    ... 
    } catch (FileNotFoundException ex) { 
        ex.printStackTrace();
        // Handle exception 
    } catch (IOException ex) { 
        ex.printStackTrace(); 
        // Handle exception 
} 

SentenceModel model = SentenceDetectorME.train("en", 
     sampleStream, true, 
    null, TrainingParameters.defaultParams());

OutputStream modelStream = new BufferedOutputStream( 
    new FileOutputStream("modelFile")); 
model.serialize(modelStream); 

    Indexing events using cutoff of 5

        Computing event counts...  done. 93 events
        Indexing...  done.
    Sorting and merging events... done. Reduced 93 events to 63.
    Done indexing.
    Incorporating indexed data for training...  
    done.
        Number of Event Tokens: 63
            Number of Outcomes: 2
          Number of Predicates: 21
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-64.4626877920749    0.9032258064516129
      2:  ... loglikelihood=-31.11084296202819    0.9032258064516129
      3:  ... loglikelihood=-26.418795734248626    0.9032258064516129
      4:  ... loglikelihood=-24.327956749903198    0.9032258064516129
      5:  ... loglikelihood=-22.766489585258565    0.9032258064516129
      6:  ... loglikelihood=-21.46379347841989    0.9139784946236559
      7:  ... loglikelihood=-20.356036369911394    0.9139784946236559
      8:  ... loglikelihood=-19.406935608514992    0.9139784946236559
      9:  ... loglikelihood=-18.58725539754483    0.9139784946236559
     10:  ... loglikelihood=-17.873030559849326    0.9139784946236559
     ...
     99:  ... loglikelihood=-7.214933901940582    0.978494623655914
    100:  ... loglikelihood=-7.183774954664058    0.978494623655914

try (InputStream is = new FileInputStream( 
        new File(getModelDir(), "modelFile"))) { 
    SentenceModel model = new SentenceModel(is); 
    SentenceDetectorME detector = new 
     SentenceDetectorME(model); 
    String sentences[] = detector.sentDetect(paragraph); 
    for (String sentence : sentences) { 
        System.out.println(sentence); 
    } 
} catch (FileNotFoundException ex) { 
    // Handle exception 
} catch (IOException ex) { 
    // Handle exception 
} 

    When determining the end of sentences we need to consider several factors.
    Sentences may end with exclamation marks! Or possibly questions marks?
    Within sentences we may find numbers like 3.14159,
    abbreviations such as found in Mr.
    Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence...

lineStream = new PlainTextByLineStream(
     new FileReader("evalSample")); 
sampleStream = new SentenceSampleStream(lineStream); 

SentenceDetectorEvaluator sentenceDetectorEvaluator 
    = new SentenceDetectorEvaluator(detector, null); 
sentenceDetectorEvaluator.evaluate(sampleStream); 

System.out.println(sentenceDetectorEvaluator.getFMeasure()); 

    Precision: 0.8181818181818182
    Recall: 0.9
    F-Measure: 0.8571428571428572

private static String regularExpressionText 
    = "He left his email address (rgb@colorworks.com) and his " 
    + "phone number,800-555-1234\. We believe his current address " 
    + "is 100 Washington Place, Seattle, CO 12345-1234\. I " 
    + "understand you can also call at 123-555-1234 between " 
    + "8:00 AM and 4:30 most days. His URL is http://example.com " 
    + "and he was born on February 25, 1954 or 2/25/1954.";

String phoneNumberRE = "\\d{3}-\\d{3}-\\d{4}"; 

Pattern pattern = Pattern.compile(phoneNumberRE); 
Matcher matcher = pattern.matcher(regularExpressionText); 
while (matcher.find()) { 
    System.out.println(matcher.group() + " [" + matcher.start() 
        + ":" + matcher.end() + "]"); 
} 

    800-555-1234 [68:80]
    123-555-1234 [196:208]

regularExpressionText =  
    "(888)555-1111 888-SEL-HIGH 888-555-2222-J88-W3S"; 

    888-555-2222 [27:39]

Pattern pattern = Pattern.compile(phoneNumberRE + "|"  
    + timeRE + "|" + emailRegEx); 

    rgb@colorworks.com [27:45]
    800-555-1234 [68:80]
    123-555-1234 [196:208]
    8:00 [217:221]
    4:30 [229:233]

String timeRE =  
   "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?"; 
       Chunker chunker = new RegExChunker(timeRE,"time",1.0); 

Chunking chunking = chunker.chunk(regularExpressionText); 
Set<Chunk> chunkSet = chunking.chunkSet(); 
displayChunkSet(chunker, regularExpressionText); 

public void displayChunkSet(Chunker chunker, String text) { 
    Chunking chunking = chunker.chunk(text); 
    Set<Chunk> set = chunking.chunkSet(); 
    for (Chunk chunk : set) { 
        System.out.println("Type: " + chunk.type() + " Entity: [" 
             + text.substring(chunk.start(), chunk.end()) 
             + "] Score: " + chunk.score()); 
    } 
} 

    Type: time Entity: [8:00] Score: 1.0
    Type: time Entity: [4:30] Score: 1.0+95

public class TimeRegexChunker extends RegExChunker { 
    private final static String TIME_RE =  
      "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?"; 
    private final static String CHUNK_TYPE = "time"; 
    private final static double CHUNK_SCORE = 1.0; 

    public TimeRegexChunker() { 
        super(TIME_RE,CHUNK_TYPE,CHUNK_SCORE); 
    } 
} 

Chunker chunker = new TimeRegexChunker(); 

String sentences[] = {"Joe was the last person to see Fred. ", 
  "He saw him in Boston at McKenzie's pub at 3:00 where he " 
  + " paid $2.45 for an ale. ", 
  "Joe wanted to go to Vermont for the day to visit a cousin who " 
  + "works at IBM, but Sally and he had to look for Fred"}; 

String sentence = "He was the last person to see Fred."; 

try (InputStream tokenStream = new FileInputStream( 
        new File(getModelDir(), "en-token.bin")); 
        InputStream modelStream = new FileInputStream( 
            new File(getModelDir(), "en-ner-person.bin"));) { 
    ... 

} catch (Exception ex) { 
    // Handle exceptions 
} 

    TokenizerModel tokenModel = new TokenizerModel(tokenStream); 
    Tokenizer tokenizer = new TokenizerME(tokenModel); 

TokenNameFinderModel entityModel =  
    new TokenNameFinderModel(modelStream); 
NameFinderME nameFinder = new NameFinderME(entityModel); 

String tokens[] = tokenizer.tokenize(sentence); 
Span nameSpans[] = nameFinder.find(tokens);

for (int i = 0; i < nameSpans.length; i++) { 
    System.out.println("Span: " + nameSpans[i].toString()); 
    System.out.println("Entity: " 
        + tokens[nameSpans[i].getStart()]); 
} 

    Span: [7..9) person
    Entity: Fred

for (String sentence : sentences) { 
    String tokens[] = tokenizer.tokenize(sentence); 
    Span nameSpans[] = nameFinder.find(tokens); 
    for (int i = 0; i < nameSpans.length; i++) { 
        System.out.println("Span: " + nameSpans[i].toString()); 
        System.out.println("Entity: "  
            + tokens[nameSpans[i].getStart()]); 
    } 
    System.out.println(); 
} 

    Span: [0..1) person
    Entity: Joe
    Span: [7..9) person
    Entity: Fred

    Span: [0..1) person
    Entity: Joe
    Span: [19..20) person
    Entity: Sally
    Span: [26..27) person
    Entity: Fred

double[] spanProbs = nameFinder.probs(nameSpans); 

System.out.println("Probability: " + spanProbs[i]); 

    Span: [0..1) person
    Entity: Joe
    Probability: 0.8052914774025202
    Span: [7..9) person
    Entity: Fred
    Probability: 0.9042160889302772

    Span: [0..1) person
    Entity: Joe
    Probability: 0.9620970782763985
    Span: [19..20) person
    Entity: Sally
    Probability: 0.964568603518126
    Span: [26..27) person
    Entity: Fred
    Probability: 0.990383039618594

InputStream modelStream = new FileInputStream( 
    new File(getModelDir(), "en-ner-time.bin"));) { 

try { 
    InputStream tokenStream = new FileInputStream( 
        new File(getModelDir(), "en-token.bin")); 
    TokenizerModel tokenModel = new TokenizerModel(tokenStream); 
    Tokenizer tokenizer = new TokenizerME(tokenModel); 
    ... 
} catch (Exception ex) { 
    // Handle exceptions 
} 

String modelNames[] = {"en-ner-person.bin",  
    "en-ner-location.bin", "en-ner-organization.bin"}; 

ArrayList<String> list = new ArrayList(); 

for(String name : modelNames) { 
    TokenNameFinderModel entityModel = new TokenNameFinderModel( 
        new FileInputStream(new File(getModelDir(), name))); 
    NameFinderME nameFinder = new NameFinderME(entityModel); 
    ... 
} 

for (int index = 0; index < sentences.length; index++) { 
    String tokens[] = tokenizer.tokenize(sentences[index]); 
    Span nameSpans[] = nameFinder.find(tokens); 
    for(Span span : nameSpans) { 
        list.add("Sentence: " + index 
            + " Span: " + span.toString() + " Entity: " 
            + tokens[span.getStart()]); 
    } 
} 

for(String element : list) { 
    System.out.println(element); 
} 

Sentence: 0 Span: [0..1) person Entity: Joe
Sentence: 0 Span: [7..9) person Entity: Fred
Sentence: 2 Span: [0..1) person Entity: Joe
Sentence: 2 Span: [19..20) person Entity: Sally
Sentence: 2 Span: [26..27) person Entity: Fred
Sentence: 1 Span: [4..5) location Entity: Boston
Sentence: 2 Span: [5..6) location Entity: Vermont
Sentence: 2 Span: [16..17) organization Entity: IBM  

String model = getModelDir() +  
    "\\english.conll.4class.distsim.crf.ser.gz"; 

CRFClassifier<CoreLabel> classifier = 
    CRFClassifier.getClassifierNoExceptions(model);

String sentence = ""; 
for (String element : sentences) { 
    sentence += element; 
} 

List<List<CoreLabel>> entityList = classifier.classify(sentence); 

for (List<CoreLabel> internalList: entityList) { 
    for (CoreLabel coreLabel : internalList) { 
        String word = coreLabel.word(); 
        String category = coreLabel.get( 
            CoreAnnotations.AnswerAnnotation.class); 
        System.out.println(word + ":" + category); 
    } 
} 

    Joe:PERSON
    was:O
    the:O
    last:O
    person:O
    to:O
    see:O
    Fred:PERSON
    .:O

 He:O ... look:O for:O Fred:PERSON

if (!"O".equals(category)) { 
    System.out.println(word + ":" + category); 
} 

Joe:PERSON
Fred:PERSON
Boston:LOCATION
McKenzie:PERSON
Joe:PERSON
Vermont:LOCATION
IBM:ORGANIZATION
Sally:PERSON
Fred:PERSON  

try { 
    File modelFile = new File(getModelDir(),  
        "ne-en-news-muc6.AbstractCharLmRescoringChunker"); 
     Chunker chunker = (Chunker)  
        AbstractExternalizable.readObject(modelFile); 
    ... 
} catch (IOException | ClassNotFoundException ex) { 
    // Handle exception 
} 

for (int i = 0; i < sentences.length; ++i) { 
    Chunking chunking = chunker.chunk(sentences[i]); 
    System.out.println("Chunking=" + chunking); 
} 

    Chunking=Joe was the last person to see Fred.  : [0-3:PERSON@-Infinity, 31-35:ORGANIZATION@-Infinity]
    Chunking=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale.  : [14-20:LOCATION@-Infinity, 24-32:PERSON@-Infinity]
    Chunking=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred : [0-3:PERSON@-Infinity, 20-27:ORGANIZATION@-Infinity, 71-74:ORGANIZATION@-Infinity, 109-113:ORGANIZATION@-Infinity]

for (String sentence : sentences) { 
    displayChunkSet(chunker, sentence); 
} 

Type: PERSON Entity: [Joe] Score: -Infinity
Type: ORGANIZATION Entity: [Fred] Score: -Infinity
Type: LOCATION Entity: [Boston] Score: -Infinity
Type: PERSON Entity: [McKenzie] Score: -Infinity
Type: PERSON Entity: [Joe] Score: -Infinity
Type: ORGANIZATION Entity: [Vermont] Score: -Infinity
Type: ORGANIZATION Entity: [IBM] Score: -Infinity
Type: ORGANIZATION Entity: [Fred] Score: -Infinity  

private MapDictionary<String> dictionary;

private static void initializeDictionary() { 
    dictionary = new MapDictionary<String>(); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Joe","PERSON",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Fred","PERSON",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Boston","PLACE",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("pub","PLACE",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Vermont","PLACE",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("IBM","ORGANIZATION",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Sally","PERSON",1.0)); 
} 

initializeDictionary(); 
ExactDictionaryChunker dictionaryChunker 
    = new ExactDictionaryChunker(dictionary, 
        IndoEuropeanTokenizerFactory.INSTANCE, true, false); 

for (String sentence : sentences) { 
    System.out.println("\nTEXT=" + sentence); 
    displayChunkSet(dictionaryChunker, sentence); 
} 

TEXT=Joe was the last person to see Fred. 
Type: PERSON Entity: [Joe] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0

TEXT=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. 
Type: PLACE Entity: [Boston] Score: 1.0
Type: PLACE Entity: [pub] Score: 1.0

TEXT=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred
Type: PERSON Entity: [Joe] Score: 1.0
Type: PLACE Entity: [Vermont] Score: 1.0
Type: ORGANIZATION Entity: [IBM] Score: 1.0
Type: PERSON Entity: [Sally] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0  

Joe was the last person to see Fred. He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred.

> java -jar annotator.jar

<START:person> Joe <END> was the last person to see <START:person> Fred <END>.  
He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale.  
<START:person> Joe <END> wanted to go to Vermont for the day to visit a cousin who works at IBM, but <START:person> Sally <END> and he had to look for <START:person> Fred <END>. 

try (OutputStream modelOutputStream = new BufferedOutputStream( 
        new FileOutputStream(new File("modelFile")));) { 
    ... 
} catch (IOException ex) { 
    // Handle exception 
} 

ObjectStream<String> lineStream = new PlainTextByLineStream( 
    new FileInputStream("en-ner-person.train"), "UTF-8"); 

ObjectStream<NameSample> sampleStream =  
    new NameSampleDataStream(lineStream); 

TokenNameFinderModel model = NameFinderME.train( 
    "en", "person",  sampleStream,  
    Collections.<String, Object>emptyMap(), 100, 5);

model.serialize(modelOutputStream); 

    Indexing events using cutoff of 5

      Computing event counts...  done. 53 events
      Indexing...  done.
    Sorting and merging events... done. Reduced 53 events to 46.
    Done indexing.
    Incorporating indexed data for training...  
    done.
      Number of Event Tokens: 46
          Number of Outcomes: 2
        Number of Predicates: 34
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-36.73680056967707  0.05660377358490566
      2:  ... loglikelihood=-17.499660626361216  0.9433962264150944
      3:  ... loglikelihood=-13.216835449617108  0.9433962264150944
      4:  ... loglikelihood=-11.461783667999262  0.9433962264150944
      5:  ... loglikelihood=-10.380239416084963  0.9433962264150944
      6:  ... loglikelihood=-9.570622475692486  0.9433962264150944
      7:  ... loglikelihood=-8.919945779143012  0.9433962264150944
    ...
     99:  ... loglikelihood=-3.513810438211968  0.9622641509433962
    100:  ... loglikelihood=-3.507213816708068  0.9622641509433962

<START:person> Bill <END> went to the farm to see <START:person> Sally <END>.  
Unable to find <START:person> Sally <END> he went to town. 
There he saw <START:person> Fred <END> who had seen <START:person> Sally <END> at the book store with <START:person> Mary <END>. 

TokenNameFinderEvaluator evaluator =  
    new TokenNameFinderEvaluator(new NameFinderME(model));     
lineStream = new PlainTextByLineStream( 
    new FileInputStream("en-ner-person.eval"), "UTF-8"); 
sampleStream = new NameSampleDataStream(lineStream); 
evaluator.evaluate(sampleStream); 

FMeasure result = evaluator.getFMeasure(); 
System.out.println(result.toString()); 

Precision: 0.5 Recall: 0.25 F-Measure: 0.3333333333333333  

 The/DT cow/NN jumped/VBD over/IN the/DT moon./NN

    Well/UH what/WP do/VBP you/PRP think/VB about/IN
    the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/VBG
    to/TO do/VB public/JJ service/NN work/NN for/IN a/DT
    year/NN ?/.

    Bill/NNP used/VBD the/DT force/NN to/TO force/VB the/DT manger/NN to/TO tear/VB the/DT bill/NN in/IN two./PRP$

    AFAIK/NNS she/PRP H8/CD cth!/.
    BTW/NNP had/VBD a/DT GR8/CD tym/NN at/IN the/DT party/NN BBIAM./.

Whether/IN "Blue"/NNP was/VBD correct/JJ or/CC not/RB (it's/JJ not)/NN is/VBZ debatable/VBG

private String[] sentence = {"The", "voyage", "of", "the",  
    "Abraham", "Lincoln", "was", "for", "a", "long", "time", "marked",  
    "by", "no", "special", "incident."};

String theSentence = "The voyage of the Abraham Lincoln was for a "  
    + "long time marked by no special incident.";

public String[] tokenizeSentence(String sentence) { 
    String words[] = sentence.split("S+"); 
    return words; 
}

String words[] = tokenizeSentence(theSentence); 
for(String word : words) { 
    System.out.print(word + " ");  
} 
System.out.println(); 

The voyage of the Abraham Lincoln was for a long time marked by no special incident.

String words[] = 
     WhitespaceTokenizer.INSTANCE.tokenize(sentence);

try (InputStream modelIn = new FileInputStream( 
    new File(getModelDir(), "en-pos-maxent.bin"));) { 
    ... 
} 
catch (IOException e) { 
    // Handle exceptions 
} 

POSModel model = new POSModel(modelIn); 
POSTaggerME tagger = new POSTaggerME(model); 

String tags[] = tagger.tag(sentence); 

for (int i = 0; i<sentence.length; i++) { 
    System.out.print(sentence[i] + "/" + tags[i] + " "); 
} 

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident./NN

Sequence topSequences[] = tagger.topKSequences(sentence); 
for (inti = 0; i<topSequences.length; i++) { 
    System.out.println(topSequences[i]); 
}

    -0.5563571615737618 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, JJ, NN]
    -2.9886144610050907 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, JJ, .]
    -3.771930515521527 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, NN, NN]

for (int i = 0; i<topSequences.length; i++) { 
    List<String> outcomes = topSequences[i].getOutcomes(); 
    double probabilities[] = topSequences[i].getProbs(); 
    for (int j = 0; j <outcomes.size(); j++) {  
        System.out.printf("%s/%5.3f ",outcomes.get(j), 
        probabilities[j]); 
    } 
    System.out.println(); 
} 
System.out.println();

    DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 JJ/0.919 NN/0.832 
    DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 JJ/0.919 ./0.073 
    DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 NN/0.073 NN/0.419

try ( 
        InputStream posModelStream = new FileInputStream( 
            getModelDir() + "\\en-pos-maxent.bin"); 
        InputStream chunkerStream = new FileInputStream( 
            getModelDir() + "\\en-chunker.bin");) { 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
}

POSModel model = new POSModel(posModelStream); 
POSTaggerME tagger = new POSTaggerME(model); 

String tags[] = tagger.tag(sentence); 
for(int i=0; i<tags.length; i++) { 
    System.out.print(sentence[i] + "/" + tags[i] + " "); 
} 
System.out.println();

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident./NN

ChunkerModel chunkerModel = new 
     ChunkerModel(chunkerStream); 
ChunkerME chunkerME = new ChunkerME(chunkerModel); 
String result[] = chunkerME.chunk(sentence, tags);

for (int i = 0; i < result.length; i++) { 
    System.out.println("[" + sentence[i] + "] " + result[i]); 
}

    [The] B-NP
    [voyage] I-NP
    [of] B-PP
    [the] B-NP
    [Abraham] I-NP
    [Lincoln] I-NP
    [was] B-VP
    [for] B-PP
    [a] B-NP
    [long] I-NP
    [time] I-NP
    [marked] B-VP
    [by] B-PP
    [no] B-NP
    [special] I-NP
    [incident.] I-NP

Span[] spans = chunkerME.chunkAsSpans(sentence, tags); 
for (Span span : spans) { 
    System.out.print("Type: " + span.getType() + " - "  
        + " Begin: " + span.getStart()  
        + " End:" + span.getEnd() 
        + " Length: " + span.length() + "  ["); 
    for (int j = span.getStart(); j < span.getEnd(); j++) { 
        System.out.print(sentence[j] + " "); 
    } 
    System.out.println("]"); 
}

    Type: NP -  Begin: 0 End:2 Length: 2  [The voyage ]
    Type: PP -  Begin: 2 End:3 Length: 1  [of ]
    Type: NP -  Begin: 3 End:6 Length: 3  [the Abraham Lincoln ]
    Type: VP -  Begin: 6 End:7 Length: 1  [was ]
    Type: PP -  Begin: 7 End:8 Length: 1  [for ]
    Type: NP -  Begin: 8 End:11 Length: 3  [a long time ]
    Type: VP -  Begin: 11 End:12 Length: 1  [marked ]
    Type: PP -  Begin: 12 End:13 Length: 1  [by ]
    Type: NP -  Begin: 13 End:16 Length: 3  [no special incident. ]

try (InputStream modelIn = new FileInputStream( 
        new File(getModelDir(), "en-pos-maxent.bin"));) { 
    POSModel model = new POSModel(modelIn); 
    POSTaggerFactory posTaggerFactory = model.getFactory(); 
    ... 
} catch (IOException e) { 
    //Handle exceptions 
}

MutableTagDictionary tagDictionary =  
  (MutableTagDictionary)posTaggerFactory.getTagDictionary(); 

String tags[] = tagDictionary.getTags("force"); 
for (String tag : tags) { 
    System.out.print("/" + tag); 
} 
System.out.println(); 

/NN/VBP/VB

String oldTags[] = tagDictionary.put("force", "newTag"); 
for (String tag : oldTags) { 
    System.out.print("/" + tag); 
} 
System.out.println();

/NN/VBP/VB

tags = tagDictionary.getTags("force"); 
for (String tag : tags) { 
    System.out.print("/" + tag); 
} 
System.out.println();

 /newTag

String newTags[] = new String[tags.length+1]; 
for (int i=0; i<tags.length; i++) { 
    newTags[i] = tags[i]; 
} 
newTags[tags.length] = "newTag"; 
oldTags = tagDictionary.put("force", newTags);

 /NN/VBP/VB/newTag  

POSTaggerFactory newFactory = new POSTaggerFactory(); 
newFactory.setTagDictionary(tagDictionary); 

tags = newFactory.getTagDictionary().getTags("force"); 
for (String tag : tags) { 
    System.out.print("/" + tag); 
} 
System.out.println(); 

 /NN/VBP/VB/newTag

<dictionary case_sensitive="false"> 
    <entry tags="JJ VB"> 
        <token>strong</token> 
    </entry> 
    <entry tags="NN VBP VB"> 
        <token>force</token> 
    </entry> 
</dictionary>

try (InputStream dictionaryIn =  
      new FileInputStream(new File("dictionary.txt"));) { 
    POSDictionary dictionary = 
     POSDictionary.create(dictionaryIn); 
    ... 
} catch (IOException e) { 
    // Handle exceptions 
}

Iterator<String> iterator = dictionary.iterator(); 
while (iterator.hasNext()) { 
    String entry = iterator.next(); 
    String tags[] = dictionary.getTags(entry); 
    System.out.print(entry + " "); 
    for (String tag : tags) { 
        System.out.print("/" + tag); 
    } 
    System.out.println(); 
}

  strong /JJ/VB
  force /NN/VBP/VB

try { 
    MaxentTagger tagger = new MaxentTagger(getModelDir() +  
        "//wsj-0-18-bidirectional-distsim.tagger"); 
    List<List<HasWord>> sentences = MaxentTagger.tokenizeText( 
        new BufferedReader(new FileReader("sentences.txt"))); 
    ... 
} catch (FileNotFoundException ex) { 
    // Handle exceptions 
}

The voyage of the Abraham Lincoln was for a long time marked by no special incident. 
But one circumstance happened which showed the wonderful dexterity of Ned Land, and proved what confidence we might place in him. 
The 30th of June, the frigate spoke some American whalers, from whom we learned that they knew nothing about the narwhal. 
But one of them, the captain of the Monroe, knowing that Ned Land had shipped on board the Abraham Lincoln, begged for his help in chasing a whale they had in sight.

List<TaggedWord> taggedSentence = 
     tagger.tagSentence(sentence); 
for (List<HasWord> sentence : sentences) { 
    List<TaggedWord> taggedSentence= 
         tagger.tagSentence(sentence); 
    System.out.println(taggedSentence); 
}

    [The/DT, voyage/NN, of/IN, the/DT, Abraham/NNP, Lincoln/NNP, was/VBD, for/IN, a/DT, long/JJ, --- time/NN, marked/VBN, by/IN, no/DT, special/JJ, incident/NN, ./.]
     [But/CC, one/CD, circumstance/NN, happened/VBD, which/WDT, showed/VBD, the/DT, wonderful/JJ, dexterity/NN, of/IN, Ned/NNP, Land/NNP, ,/,, and/CC, proved/VBD, what/WP, confidence/NN, we/PRP, might/MD, place/VB, in/IN, him/PRP, ./.]
    [The/DT, 30th/JJ, of/IN, June/NNP, ,/,, the/DT, frigate/NN, spoke/VBD, some/DT, American/JJ, whalers/NNS, ,/,, from/IN, whom/WP, we/PRP, learned/VBD, that/IN, they/PRP, knew/VBD, nothing/NN, about/IN, the/DT, narwhal/NN, ./.]
    [But/CC, one/CD, of/IN, them/PRP, ,/,, the/DT, captain/NN, of/IN, the/DT, Monroe/NNP, ,/,, knowing/VBG, that/IN, Ned/NNP, Land/NNP, had/VBD, shipped/VBN, on/IN, board/NN, the/DT, Abraham/NNP, Lincoln/NNP, ,/,, begged/VBN, for/IN, his/PRP$, help/NN, in/IN, chasing/VBG, a/DT, whale/NN, they/PRP, had/VBD, in/IN, sight/NN, ./.]

List<TaggedWord> taggedSentence = 
     tagger.tagSentence(sentence); 
for (List<HasWord> sentence : sentences) { 
    List<TaggedWord> taggedSentence= 
         tagger.tagSentence(sentence); 
    System.out.println(Sentence.listToString(taggedSentence, false)); 
}

    The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident/NN ./.
    But/CC one/CD circumstance/NN happened/VBD which/WDT showed/VBD the/DT wonderful/JJ dexterity/NN of/IN Ned/NNP Land/NNP ,/, and/CC proved/VBD what/WP confidence/NN we/PRP might/MD place/VB in/IN him/PRP ./.
    The/DT 30th/JJ of/IN June/NNP ,/, the/DT frigate/NN spoke/VBD some/DT American/JJ whalers/NNS ,/, from/IN whom/WP we/PRP learned/VBD that/IN they/PRP knew/VBD nothing/NN about/IN the/DT narwhal/NN ./.
    But/CC one/CD of/IN them/PRP ,/, the/DT captain/NN of/IN the/DT Monroe/NNP ,/, knowing/VBG that/IN Ned/NNP Land/NNP had/VBD shipped/VBN on/IN board/NN the/DT Abraham/NNP Lincoln/NNP ,/, begged/VBN for/IN his/PRP$ help/NN in/IN chasing/VBG a/DT whale/NN they/PRP had/VBD in/IN sight/NN ./.

List<TaggedWord> taggedSentence = 
     tagger.tagSentence(sentence); 
for (TaggedWord taggedWord : taggedSentence) { 
    System.out.print(taggedWord.word() + "/" + 
         taggedWord.tag() + " "); 
} 
System.out.println();

List<TaggedWord> taggedSentence = 
     tagger.tagSentence(sentence); 
for (TaggedWord taggedWord : taggedSentence) { 
    if (taggedWord.tag().startsWith("NN")) { 
        System.out.print(taggedWord.word() + " "); 
    } 
} 
System.out.println();

    NN Tagged: voyage Abraham Lincoln time incident 
    NN Tagged: circumstance dexterity Ned Land confidence 
    NN Tagged: June frigate whalers nothing narwhal 
    NN Tagged: captain Monroe Ned Land board Abraham Lincoln help whale sight

MaxentTagger tagger = new MaxentTagger(getModelDir()  
    + "//gate-EN-twitter.model"); 

System.out.println(tagger.tagString("AFAIK she H8 cth!")); System.out.println(tagger.tagString( "BTW had a GR8 tym at the party BBIAM.")); 

    AFAIK_NNP she_PRP H8_VBP cth!_NN 
    BTW_UH had_VBD a_DT GR8_NNP tym_NNP at_IN the_DT party_NN BBIAM._NNP

Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit, pos"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 

Annotation document = new Annotation(theSentence); 
pipeline.annotate(document);

List<CoreMap> sentences = 
     document.get(SentencesAnnotation.class);

for (CoreMap sentence : sentences) { 
    for (CoreLabel token : sentence.get(TokensAnnotation.class)) { 
        String word = token.get(TextAnnotation.class); 
        String pos = token.get(PartOfSpeechAnnotation.class); 
        System.out.print(word + "/" + pos + " "); 
    } 
    System.out.println(); 
}

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident/NN ./.

props.put("pos.model", 
"C:/.../Models/english-caseless-left3words-distsim.tagger"); 

try { 
    pipeline.xmlPrint(document, System.out); 
} catch (IOException ex) { 
    // Handle exceptions 
}

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
    <root>
    <document>
    <sentences>
    <sentence id="1">
    <tokens>
    <token id="1">
    <word>The</word>
    <CharacterOffsetBegin>0</CharacterOffsetBegin>
    <CharacterOffsetEnd>3</CharacterOffsetEnd>
    <POS>DT</POS>
    </token>
    <token id="2">
    <word>voyage</word>
    <CharacterOffsetBegin>4</CharacterOffsetBegin>
    <CharacterOffsetEnd>10</CharacterOffsetEnd>
    <POS>NN</POS>
    </token>
             ...
    <token id="17">
    <word>.</word>
    <CharacterOffsetBegin>83</CharacterOffsetBegin>
    <CharacterOffsetEnd>84</CharacterOffsetEnd>
    <POS>.</POS>
    </token>
    </tokens>
    </sentence>
    </sentences>
    </document>
    </root>

pipeline.prettyPrint(document, System.out); 

    The voyage of the Abraham Lincoln was for a long time marked by no special incident.
    [Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] 
    [Text=voyage CharacterOffsetBegin=4 CharacterOffsetEnd=10 PartOfSpeech=NN] 
    [Text=of CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=IN] 
    [Text=the CharacterOffsetBegin=14 CharacterOffsetEnd=17 PartOfSpeech=DT] 
    [Text=Abraham CharacterOffsetBegin=18 CharacterOffsetEnd=25 PartOfSpeech=NNP]
     [Text=Lincoln CharacterOffsetBegin=26 CharacterOffsetEnd=33 PartOfSpeech=NNP]
     [Text=was CharacterOffsetBegin=34 CharacterOffsetEnd=37 PartOfSpeech=VBD]
     [Text=for CharacterOffsetBegin=38 CharacterOffsetEnd=41 PartOfSpeech=IN]
     [Text=a CharacterOffsetBegin=42 CharacterOffsetEnd=43 PartOfSpeech=DT]
     [Text=long CharacterOffsetBegin=44 CharacterOffsetEnd=48 PartOfSpeech=JJ]
     [Text=time CharacterOffsetBegin=49 CharacterOffsetEnd=53 PartOfSpeech=NN]
     [Text=marked CharacterOffsetBegin=54 CharacterOffsetEnd=60 PartOfSpeech=VBN]
     [Text=by CharacterOffsetBegin=61 CharacterOffsetEnd=63 PartOfSpeech=IN] 
    [Text=no CharacterOffsetBegin=64 CharacterOffsetEnd=66 PartOfSpeech=DT]
     [Text=special CharacterOffsetBegin=67 CharacterOffsetEnd=74 PartOfSpeech=JJ]
     [Text=incident CharacterOffsetBegin=75 CharacterOffsetEnd=83 PartOfSpeech=NN]
     [Text=. CharacterOffsetBegin=83 CharacterOffsetEnd=84 PartOfSpeech=.]

try ( 
        FileInputStream inputStream =  
            new FileInputStream(getModelDir() 
            + "//pos-en-general-brown.HiddenMarkovModel"); 
        ObjectInputStream objectStream = 
            new ObjectInputStream(inputStream);) { 
    HiddenMarkovModel hmm = (HiddenMarkovModel) 
        objectStream.readObject(); 
    HmmDecoder decoder = new HmmDecoder(hmm); 
    ... 
} catch (IOException ex) { 
 // Handle exceptions 
} catch (ClassNotFoundException ex) { 
 // Handle exceptions 
};

TokenizerFactory TOKENIZER_FACTORY =  
    IndoEuropeanTokenizerFactory.INSTANCE; 
char[] charArray = theSentence.toCharArray(); 
Tokenizer tokenizer =  
    TOKENIZER_FACTORY.tokenizer( 
      charArray, 0, charArray.length); 
String[] tokens = tokenizer.tokenize();

List<String> tokenList = Arrays.asList(tokens); 
Tagging<String> tagString = decoder.tag(tokenList);

for (int i = 0; i < tagString.size(); ++i) { 
    System.out.print(tagString.token(i) + "/"  
    + tagString.tag(i) + " "); 
}

The/at voyage/nn of/in the/at Abraham/np Lincoln/np was/bedz for/in a/at long/jj time/nn marked/vbn by/in no/at special/jj incident/nn ./.

String[] sentence = {"Bill", "used", "the", "force", 
     "to", "force", "the", "manager", "to",  
    "tear", "the", "bill","in", "to."}; 
List<String> tokenList = Arrays.asList(sentence); 

int maxResults = 5;

Iterator<ScoredTagging<String>> iterator =  
    decoder.tagNBest(tokenList, maxResults); 

while (iterator.hasNext()) { 
    ScoredTagging<String> scoredTagging = iterator.next(); 
    System.out.printf("Score: %7.3f   Sequence: ", 
        scoredTagging.score()); 
    for (int i = 0; i < tokenList.size(); ++i) { 
        System.out.print(scoredTagging.token(i) + "/"  
            + scoredTagging.tag(i) + " "); 
    } 
    System.out.println(); 
}

    Score: -148.796   Sequence: Bill/np used/vbd the/at force/nn to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn 
    Score: -154.434   Sequence: Bill/np used/vbn the/at force/nn to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn 
    Score: -154.781   Sequence: Bill/np used/vbd the/at force/nn to/in force/nn the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn 
    Score: -157.126   Sequence: Bill/np used/vbd the/at force/nn to/to force/vb the/at manager/jj to/to tear/vb the/at bill/nn in/in two./nn 
    Score: -157.340   Sequence: Bill/np used/vbd the/at force/jj to/to force/vb the/at manager/nn to/to tear/vb the/at bill/nn in/in two./nn  

TagLattice<String> lattice = decoder.tagMarginal(tokenList); 
for (int index = 0; index < tokenList.size(); index++) { 
    ConditionalClassification classification =  
        lattice.tokenClassification(index); 
    ... 
}

System.out.printf("%-8s",tokenList.get(index)); 
for (int i = 0; i < 4; ++i) { 
    double score = classification.score(i); 
    String tag = classification.category(i); 
    System.out.printf("%7.3f/%-3s ",score,tag); 
} 
System.out.println(); 

    Bill      0.974/np    0.018/nn    0.006/rb    0.001/nps 
    used      0.935/vbd   0.065/vbn   0.000/jj    0.000/rb  
    the       1.000/at    0.000/jj    0.000/pps   0.000/pp$$ 
    force     0.977/nn    0.016/jj    0.006/vb    0.001/rb  
    to        0.944/to    0.055/in    0.000/rb    0.000/nn  
    force     0.945/vb    0.053/nn    0.002/rb    0.001/jj  
    the       1.000/at    0.000/jj    0.000/vb    0.000/nn  
    manager   0.982/nn    0.018/jj    0.000/nn$   0.000/vb  
    to        0.988/to    0.012/in    0.000/rb    0.000/nn  
    tear      0.991/vb    0.007/nn    0.001/rb    0.001/jj  
    the       1.000/at    0.000/jj    0.000/vb    0.000/nn  
    bill      0.994/nn    0.003/jj    0.002/rb    0.001/nns 
    in        0.990/in    0.004/rp    0.002/nn    0.001/jj  
    two.      0.960/nn    0.013/np    0.011/nns   0.008/rb

    The_DT voyage_NN of_IN the_DT Abraham_NNP Lincoln_NNP was_VBD for_IN a_DT long_JJ time_NN marked_VBN by_IN no_DT special_JJ incident._NN
    But_CC one_CD circumstance_NN happened_VBD which_WDT showed_VBD the_DT wonderful_JJ dexterity_NN of_IN Ned_NNP Land,_NNP and_CC proved_VBD what_WP confidence_NN we_PRP might_MD place_VB in_IN him._PRP$ 
    The_DT 30th_JJ of_IN June,_NNP the_DT frigate_NN spoke_VBD some_DT American_NNP whalers,_, from_IN whom_WP we_PRP learned_VBD that_IN they_PRP knew_VBD nothing_NN about_IN the_DT narwhal._NN 
    But_CC one_CD of_IN them,_PRP$ the_DT captain_NN of_IN the_DT Monroe,_NNP knowing_VBG that_IN Ned_NNP Land_NNP had_VBD shipped_VBN on_IN board_NN the_DT Abraham_NNP Lincoln,_NNP begged_VBD for_IN his_PRP$ help_NN in_IN chasing_VBG a_DT whale_NN they_PRP had_VBD in_IN sight._NN

POSModel model = null;

try (InputStream dataIn = new FileInputStream("sample.train");) { 
    ... 
} catch (IOException e) { 
    // Handle exceptions 
}

ObjectStream<String> lineStream =  
    new PlainTextByLineStream(dataIn, "UTF-8"); 
ObjectStream<POSSample> sampleStream =  
    new WordTagSampleStream(lineStream); 

model = POSTaggerME.train("en", sampleStream, 
    TrainingParameters.defaultParams(), null, null); 

    Indexing events using cutoff of 5

      Computing event counts...  done. 90 events
      Indexing...  done.
    Sorting and merging events... done. Reduced 90 events to 82.
    Done indexing.
    Incorporating indexed data for training...  
    done.
      Number of Event Tokens: 82
          Number of Outcomes: 17
        Number of Predicates: 45
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-254.98920096505964  0.14444444444444443
      2:  ... loglikelihood=-201.19283975630537  0.6
      3:  ... loglikelihood=-174.8849213436524  0.6111111111111112
      4:  ... loglikelihood=-157.58164262220754  0.6333333333333333
      5:  ... loglikelihood=-144.69272379986646  0.6555555555555556
    ...
     99:  ... loglikelihood=-33.461128002846024  0.9333333333333333
    100:  ... loglikelihood=-33.29073273669207  0.9333333333333333

try (OutputStream modelOut = new BufferedOutputStream( 
        new FileOutputStream(new File("en_pos_verne.bin")));) { 
    model.serialize(modelOut); 
} catch (IOException e) { 
    // Handle exceptions 
}

        String sampletext = "This is n-gram model";
        System.out.println(sampletext);

        StringList tokens = new             StringList(WhitespaceTokenizer.INSTANCE.tokenize(sampletext));
        System.out.println("Tokens " + tokens);

        NGramModel nGramModel = new NGramModel();
        nGramModel.add(tokens,3,4); 

        System.out.println("Total ngrams: " + nGramModel.numberOfGrams());
        for (StringList ngram : nGramModel) {
            System.out.println(nGramModel.getCount(ngram) + " - " + ngram);
        }

This is n-gram model
Tokens [This,is,n-gram,model]
Total ngrams: 3
1 - [is,n-gram,model]
1 - [This,is,n-gram]
1 - [This,is,n-gram,model]

This is n-gram model
Tokens [This,is,n-gram,model]
Total ngrams: 6
1 - [is,n-gram,model]
1 - [n-gram,model]
1 - [This,is,n-gram]
1 - [This,is,n-gram,model]
1 - [is,n-gram]
1 - [This,is]

human interface computer
survey user computer system response time
eps user interface system
system human system eps
user response time
trees
graph trees
graph minors trees
graph minors survey
I like graph and stuff
I like trees and stuff
Sometimes I build a graph
Sometimes I build trees

INFO: Building vocabulary complete.. There are 19 terms
Iteration #1 , cost = 0.4109707480627031
Iteration #2 , cost = 0.37748817335537205
Iteration #3 , cost = 0.3563396433036622
Iteration #4 , cost = 0.3483667149265019
Iteration #5 , cost = 0.3434632969758875
Iteration #6 , cost = 0.33917154339742045
Iteration #7 , cost = 0.3304641363014488
Iteration #8 , cost = 0.32717383183159243
Iteration #9 , cost = 0.3240225514512226
Iteration #10 , cost = 0.32196412138868596
@trees
@minors
@computer
@a
@like
@survey
@eps
@interface
@and
@human
@user
@time
@response
@system
@Sometimes

        String file = "test.txt";

        Options options = new Options(); 
        options.debug = true;

        Vocabulary vocab = GloVe.build_vocabulary(file, options);

        options.window_size = 3;
        List<Cooccurrence> c =  GloVe.build_cooccurrence(vocab, file, options);

        options.iterations = 10;
        options.vector_size = 10;
        options.debug = true;
        DoubleMatrix W = GloVe.train(vocab, c, options);  

        List<String> similars = Methods.most_similar(W, vocab, "graph", 15);
        for(String similar : similars) {
            System.out.println("@" + similar);
        }

loading embeddings and creating word2vec...
[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
[main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 2
[main] INFO org.reflections.Reflections - Reflections took 410 ms to scan 1 urls, producing 29 keys and 189 values 
[main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 2
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [4]; Memory: [5.3GB];
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
[main] INFO org.reflections.Reflections - Reflections took 373 ms to scan 1 urls, producing 373 keys and 1449 values 
done...
kill    1.0000001192092896
kills    0.6048964262008667
killing    0.6003166437149048
destroy    0.5964594483375549
exterminate    0.5908634066581726
decapitate    0.5677944421768188
assassinate    0.5450955629348755
behead    0.532557487487793
terrorize    0.5281200408935547
commit_suicide    0.5269641280174255
0.10049013048410416
0.1868356168270111

> git clone https://github.com/lejon/T-SNE-Java.git
> cd T-SNE-Java
> mvn install
> cd tsne-demo
> java -jar target/tsne-demos-2.4.0.jar -nohdr -nolbls src/main/resources/datasets/iris_X.txt 

TSneCsv: Running 2000 iterations of t-SNE on src/main/resources/datasets/iris_X.txt
NA string is: null
Loaded CSV with: 150 rows and 4 columns.
Dataset types:[class java.lang.Double, class java.lang.Double, class java.lang.Double, class java.lang.Double]
 V0             V1             V2             V3
 0     5.10000000     3.50000000     1.40000000     0.20000000
 1     4.90000000     3.00000000     1.40000000     0.20000000
 2     4.70000000     3.20000000     1.30000000     0.20000000
 3     4.60000000     3.10000000     1.50000000     0.20000000
 4     5.00000000     3.60000000     1.40000000     0.20000000
 5     5.40000000     3.90000000     1.70000000     0.40000000
 6     4.60000000     3.40000000     1.40000000     0.30000000
 7     5.00000000     3.40000000     1.50000000     0.20000000
 8     4.40000000     2.90000000     1.40000000     0.20000000
 9     4.90000000     3.10000000     1.50000000     0.10000000

Dim:150 x 4
000: [5.1000, 3.5000, 1.4000, 0.2000...]
001: [4.9000, 3.0000, 1.4000, 0.2000...]
002: [4.7000, 3.2000, 1.3000, 0.2000...]
003: [4.6000, 3.1000, 1.5000, 0.2000...]
004: [5.0000, 3.6000, 1.4000, 0.2000...]
 .
 .
 .
145: [6.7000, 3.0000, 5.2000, 2.3000]
146: [6.3000, 2.5000, 5.0000, 1.9000]
147: [6.5000, 3.0000, 5.2000, 2.0000]
148: [6.2000, 3.4000, 5.4000, 2.3000]
149: [5.9000, 3.0000, 5.1000, 1.8000]
X:Shape is = 150 x 4
Using no_dims = 2, perplexity = 20.000000, and theta = 0.500000
Computing input similarities...
Done in 0.06 seconds (sparsity = 0.472756)!
Learning embedding...
Iteration 50: error is 64.67259135061494 (50 iterations in 0.19 seconds)
Iteration 100: error is 61.50118570075227 (50 iterations in 0.20 seconds)
Iteration 150: error is 61.373758889762875 (50 iterations in 0.20 seconds)
Iteration 200: error is 55.78219488135168 (50 iterations in 0.09 seconds)
Iteration 250: error is 2.3581173593529687 (50 iterations in 0.09 seconds)
Iteration 300: error is 2.2349608757095827 (50 iterations in 0.07 seconds)
Iteration 350: error is 1.9906437450336596 (50 iterations in 0.07 seconds)
Iteration 400: error is 1.8958764344779482 (50 iterations in 0.08 seconds)
Iteration 450: error is 1.7360726540960958 (50 iterations in 0.08 seconds)
Iteration 500: error is 1.553250634564741 (50 iterations in 0.09 seconds)
Iteration 550: error is 1.294981722012944 (50 iterations in 0.06 seconds)
Iteration 600: error is 1.0985607573299603 (50 iterations in 0.03 seconds)
Iteration 650: error is 1.0810715645272573 (50 iterations in 0.04 seconds)
Iteration 700: error is 0.8168399675722107 (50 iterations in 0.05 seconds)
Iteration 750: error is 0.7158739920771124 (50 iterations in 0.03 seconds)
Iteration 800: error is 0.6911748222330966 (50 iterations in 0.04 seconds)
Iteration 850: error is 0.6123536061655738 (50 iterations in 0.04 seconds)
Iteration 900: error is 0.5631133416913786 (50 iterations in 0.04 seconds)
Iteration 950: error is 0.5905547118496892 (50 iterations in 0.03 seconds)
Iteration 1000: error is 0.5053631170520657 (50 iterations in 0.04 seconds)
Iteration 1050: error is 0.44752244538411406 (50 iterations in 0.04 seconds)
Iteration 1100: error is 0.40661841893114614 (50 iterations in 0.03 seconds)
Iteration 1150: error is 0.3267394426152807 (50 iterations in 0.05 seconds)
Iteration 1200: error is 0.3393774577158965 (50 iterations in 0.03 seconds)
Iteration 1250: error is 0.37023103950965025 (50 iterations in 0.04 seconds)
Iteration 1300: error is 0.3192975790641602 (50 iterations in 0.04 seconds)
Iteration 1350: error is 0.28140161036965816 (50 iterations in 0.03 seconds)
Iteration 1400: error is 0.30413739839879855 (50 iterations in 0.04 seconds)
Iteration 1450: error is 0.31755361125826165 (50 iterations in 0.04 seconds)
Iteration 1500: error is 0.36301524742916624 (50 iterations in 0.04 seconds)
Iteration 1550: error is 0.3063801941900375 (50 iterations in 0.03 seconds)
Iteration 1600: error is 0.2928584822753138 (50 iterations in 0.03 seconds)
Iteration 1650: error is 0.2867502934852756 (50 iterations in 0.03 seconds)
Iteration 1700: error is 0.470469997545481 (50 iterations in 0.04 seconds)
Iteration 1750: error is 0.4792376115843584 (50 iterations in 0.04 seconds)
Iteration 1800: error is 0.5100126924750723 (50 iterations in 0.06 seconds)
Iteration 1850: error is 0.37855035406353427 (50 iterations in 0.04 seconds)
Iteration 1900: error is 0.32776847081948496 (50 iterations in 0.04 seconds)
Iteration 1950: error is 0.3875134029990107 (50 iterations in 0.04 seconds)
Iteration 1999: error is 0.32560416632168365 (50 iterations in 0.04 seconds)
Fitting performed in 2.29 seconds.
TSne took: 2.43 seconds

dog The most interesting feature of a dog is its ...

dog The most widespread form of interspecies bonding occurs ... 
dog There have been two major trends in the changing status of  ... 
dog There are a vast range of commodity forms available to  ... 
dog An Australian Cattle Dog in reindeer antlers sits on Santa's lap ... 
dog A pet dog taking part in Christmas traditions ... 
dog The majority of contemporary people with dogs describe their  ... 
dog Another study of dogs' roles in families showed many dogs have  ... 
dog According to statistics published by the American Pet Products  ... 
dog The latest study using Magnetic resonance imaging (MRI) ... 
cat Cats are common pets in Europe and North America, and their  ... 
cat Although cat ownership has commonly been associated  ... 
cat The concept of a cat breed appeared in Britain during ... 
cat Cats come in a variety of colors and patterns. These are physical  ... 
cat A natural behavior in cats is to hook their front claws periodically  ... 
cat Although scratching can serve cats to keep their claws from growing  ... 

DoccatModel model = null; 
try (InputStream dataIn =  
            new FileInputStream("en-animal.train"); 
        OutputStream dataOut =  
            new FileOutputStream("en-animal.model");) { 
    ObjectStream<String> lineStream 
        = new PlainTextByLineStream(dataIn, "UTF-8"); 
    ObjectStream<DocumentSample> sampleStream =  
        new DocumentSampleStream(lineStream);             
    model = DocumentCategorizerME.train("en", sampleStream); 
    ... 
} catch (IOException e) { 
// Handle exceptions   
} 

    Indexing events using cutoff of 5

      Computing event counts...  done. 12 events
      Indexing...  done.
    Sorting and merging events... done. Reduced 12 events to 12.
    Done indexing.
    Incorporating indexed data for training...  
    done.
      Number of Event Tokens: 12
          Number of Outcomes: 2
        Number of Predicates: 30
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-8.317766166719343  0.75
      2:  ... loglikelihood=-7.1439957443937265  0.75
      3:  ... loglikelihood=-6.560690872956419  0.75
      4:  ... loglikelihood=-6.106743124066829  0.75
      5:  ... loglikelihood=-5.721805583104927  0.8333333333333334
      6:  ... loglikelihood=-5.3891508904777785  0.8333333333333334
      7:  ... loglikelihood=-5.098768040466029  0.8333333333333334
    ...
     98:  ... loglikelihood=-1.4117372921765519  1.0
     99:  ... loglikelihood=-1.4052738190352423  1.0
    100:  ... loglikelihood=-1.398916120150312  1.0

OutputStream modelOut = null; 
modelOut = new BufferedOutputStream(dataOut); 
model.serialize(modelOut);

try (InputStream modelIn =  
        new FileInputStream(new File("en-animal.model"));) { 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
} 

DoccatModel model = new DoccatModel(modelIn); 
DocumentCategorizerME categorizer =  
    new DocumentCategorizerME(model); 

double[] outcomes = categorizer.categorize(inputText); 
for (int i = 0; i<categorizer.getNumberOfCategories(); i++) { 
    String category = categorizer.getCategory(i); 
    System.out.println(category + " - " + outcomes[i]); 
} 

String toto = "Toto belongs to Dorothy Gale, the heroine of "  
        + "the first and many subsequent books. In the first " 
        + "book, he never spoke, although other animals, native " 
        + "to Oz, did. In subsequent books, other animals " 
        + "gained the ability to speak upon reaching Oz or " 
        + "similar lands, but Toto remained speechless."; 

String calico = "This cat is also known as a calimanco cat or " 
        + "clouded tiger cat, and by the abbreviation 'tortie'. " 
        + "In the cat fancy, a tortoiseshell cat is patched " 
        + "over with red (or its dilute form, cream) and black " 
        + "(or its dilute blue) mottled throughout the coat.";  

    dog - 0.5870711529777994
    cat - 0.41292884702220056  

    dog - 0.28960436044424276
    cat - 0.7103956395557574

System.out.println(categorizer.getBestCategory(outcomes)); 
System.out.println(categorizer.getAllResults(outcomes)); 

cat
dog[0.2896]  cat[0.7104]

useClassFeature=true 
1.realValued=true 
2.realValued=true 
3.realValued=true 
trainFile=.box.train 
testFile=.box.test 

small  2.34  1.60  1.50

ColumnDataClassifier cdc =  
    new ColumnDataClassifier("box.prop"); 
Classifier<String, String> classifier =  
    cdc.makeClassifier(cdc.readTrainingExamples("box.train")); 

    3.realValued = true
    testFile = .box.test
    ...
    trainFile = .box.train

    Reading dataset from box.train ... done [0.1s, 60 items].
    numDatums: 60
    numLabels: 3 [small, medium, large]
    ...
    AVEIMPROVE     The average improvement / current value
    EVALSCORE      The last available eval score
    Iter ## evals ## <SCALING> [LINESEARCH] VALUE TIME |GNORM| {RELNORM} AVEIMPROVE EVALSCORE

    Iter 1 evals 1 <D> [113M 3.107E-4] 5.985E1 0.00s |3.829E1| {1.959E-1} 0.000E0 - 
    Iter 2 evals 5 <D> [M 1.000E0] 5.949E1 0.01s |1.862E1| {9.525E-2} 3.058E-3 - 
    Iter 3 evals 6 <D> [M 1.000E0] 5.923E1 0.01s |1.741E1| {8.904E-2} 3.485E-3 - 
    ...
    Iter 21 evals 24 <D> [1M 2.850E-1] 3.306E1 0.02s |4.149E-1| {2.122E-3} 1.775E-4 - 
    Iter 22 evals 26 <D> [M 1.000E0] 3.306E1 0.02s
    QNMinimizer terminated due to average improvement: | newest_val - previous_val | / |newestVal| < TOL 
    Total time spent in optimization: 0.07s

for (String line :  
        ObjectBank.getLineIterator("box.test", "utf-8")) { 
    ... 
} 

Datum<String, String> datum = cdc.makeDatumFromLine(line); 
System.out.println("Datum: {"  
    + line + "]\tPredicted Category: "  
    + classifier.classOf(datum)); 

    Datum: {small  1.33  3.50  5.43]  Predicted Category: medium
    Datum: {small  1.18  1.73  3.14]  Predicted Category: small
    ...
    Datum: {large  6.01  9.35  16.64]  Predicted Category: large
    Datum: {large  6.76  9.66  15.44]  Predicted Category: large

String sample[] = {"", "6.90", "9.8", "15.69"}; 
Datum<String, String> datum =  
    cdc.makeDatumFromStrings(sample); 
System.out.println("Category: " + classifier.classOf(datum)); 

Category: large

String review = "An overly sentimental film with a somewhat " 
    + "problematic message, but its sweetness and charm " 
    + "are occasionally enough to approximate true depth " 
    + "and grace. "; 

String sam = "Sam was an odd sort of fellow. Not prone " 
    + "to angry and not prone to merriment. Overall, " 
    + "an odd fellow."; 

String mary = "Mary thought that custard pie was the " 
    + "best pie in the world. However, she loathed " 
    + "chocolate pie."; 

Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit, parse, sentiment"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 

Annotation annotation = new Annotation(review); 
pipeline.annotate(annotation); 

String[] sentimentText = {"Very Negative", "Negative",  
    "Neutral", "Positive", "Very Positive"};

for (CoreMap sentence : annotation.get( 
        CoreAnnotations.SentencesAnnotation.class)) { 
    Tree tree = sentence.get( 
        SentimentCoreAnnotations.AnnotatedTree.class); 
    int score = RNNCoreAnnotations.getPredictedClass(tree); 
    System.out.println(sentimentText[score]); 
} 

Positive  

Neutral
Negative
Neutral  

Positive
Neutral  

String[] categories = {"soc.religion.christian", 
    "talk.religion.misc","alt.atheism","misc.forsale"}; 

int nGramSize = 6; 
DynamicLMClassifier<NGramProcessLM> classifier =  
    DynamicLMClassifier.createNGramProcess( 
        categories, nGramSize); 

String directory = ".../demos"; 
File trainingDirectory = new File(directory  
    + "/data/fourNewsGroups/4news-train"); 

for (int i = 0; i < categories.length; ++i) { 
    File classDir =  
        new File(trainingDirectory, categories[i]); 
    String[] trainingFiles = classDir.list(); 
    // Inner for-loop 
} 

for (int j = 0; j < trainingFiles.length; ++j) { 
    try { 
        File file = new File(classDir, trainingFiles[j]); 
        String text = Files.readFromFile(file, "ISO-8859-1"); 
        Classification classification =  
            new Classification(categories[i]); 
        Classified<CharSequence> classified =  
            new Classified<>(text, classification); 
        classifier.handle(classified); 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 
} 

try { 
    AbstractExternalizable.compileTo( (Compilable) classifier, 
        new File("classifier.model"));

} catch (IOException ex) { 
    // Handle exceptions 
} 

String forSale =  
    "Finding a home for sale has never been " 
    + "easier. With Homes.com, you can search new " 
    + "homes, foreclosures, multi-family homes, " 
    + "as well as condos and townhouses for sale. " 
    + "You can even search our real estate agent " 
    + "directory to work with a professional " 
    + "Realtor and find your perfect home."; 
String martinLuther =  
    "Luther taught that salvation and subsequently " 
    + "eternity in heaven is not earned by good deeds " 
    + "but is received only as a free gift of God's " 
    + "grace through faith in Jesus Christ as redeemer " 
    + "from sin and subsequently eternity in Hell."; 

LMClassifier classifier = null; 
try { 
    classifier = (LMClassifier)  
        AbstractExternalizable.readObject( 
            new File("classifier.model")); 
} catch (IOException | ClassNotFoundException ex) { 
    // Handle exceptions 
} 

JointClassification classification =  
    classifier.classify(text); 
System.out.println("Text: " + text); 
String bestCategory = classification.bestCategory(); 
System.out.println("Best Category: " + bestCategory);

    Text: Finding a home for sale has never been easier. With Homes.com, you can search new homes, foreclosures, multi-family homes, as well as condos and townhouses for sale. You can even search our real estate agent directory to work with a professional Realtor and find your perfect home.
    Best Category: misc.forsale

    Text: Luther taught that salvation and subsequently eternity in heaven is not earned by good deeds but is received only as a free gift of God's grace through faith in Jesus Christ as redeemer from sin and subsequently eternity in Hell.
    Best Category: soc.religion.christian

categories = new String[2]; 
categories[0] = "neg"; 
categories[1] = "pos"; 
nGramSize = 8; 
classifier = DynamicLMClassifier.createNGramProcess( 
    categories, nGramSize); 

String directory = "..."; 
File trainingDirectory = new File(directory, "txt_sentoken"); 
for (int i = 0; i < categories.length; ++i) { 
    Classification classification =  
        new Classification(categories[i]); 
    File file = new File(trainingDirectory, categories[i]); 
    File[] trainingFiles = file.listFiles(); 
    for (int j = 0; j < trainingFiles.length; ++j) { 
        try { 
            String review = Files.readFromFile( 
                trainingFiles[j], "ISO-8859-1"); 
            Classified<CharSequence> classified =  
                new Classified<>(review, classification); 
            classifier.handle(classified); 
        } catch (IOException ex) { 
            ex.printStackTrace(); 
        } 
    } 
} 

String review = "An overly sentimental film with a somewhat " 
    + "problematic message, but its sweetness and charm " 
    + "are occasionally enough to approximate true depth " 
    + "and grace. "; 

Classification classification = classifier.classify(review); 
String bestCategory = classification.bestCategory(); 
System.out.println("Best Category: " + bestCategory); 

Best Category: pos  

String text = "An overly sentimental film with a somewhat " 
    + "problematic message, but its sweetness and charm " 
    + "are occasionally enough to approximate true depth " 
    + "and grace. "; 
System.out.println("Text: " + text); 

LMClassifier classifier = null; 
try { 
    classifier = (LMClassifier)  
        AbstractExternalizable.readObject( 
            new File(".../langid-leipzig.classifier")); 
} catch (IOException | ClassNotFoundException ex) { 
    // Handle exceptions 
}

Classification classification = classifier.classify(text); 
String bestCategory = classification.bestCategory(); 
System.out.println("Best Language: " + bestCategory); 

    Text: An overly sentimental film with a somewhat problematic message, but its sweetness and charm are occasionally enough to approximate true depth and grace. 
    Best Language: en

text = "Svenska är ett östnordiskt språk som talas av cirka " 
    + "tio miljoner personer[1], främst i Finland " 
    + "och Sverige."; 

    Text: Svenska är ett östnordiskt språk som talas av cirka tio miljoner personer[1], främst i Finland och Sverige.
    Best Language: se

mallet-2.0.6$ bin/mallet import-dir --input sample-data/web/en --output tutorial.mallet --keep-sequence --remove-stopwords

mallet-2.0.6$ bin/mallet train-topics --input tutorial.mallet --num-topics 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_compostion.txt

mallet-2.0.6$ bin/mallet import-dir --input mydata/ --output mytutorial.mallet --keep-sequence --remove-stopwords

mallet-2.0.6$ bin/mallet train-topics  --input mytutorial.mallet --num-topics 2 --output-state mytopic-state.gz --output-topic-keys mytutorial_keys.txt --output-doc-topics mytutorial_compostion.txt

He was the last person to see Fred. 

Span: [7..9) person
Entity: Fred 

    (TOP (S (NP (PRP He)) (VP (VBD was) (NP (NP (DT the) (JJ last) (NN person)) (SBAR (S (VP (TO to) (VP (VB see))))))) (. Fred.)))  

The cow jumped over the moon. 

    (TOP (S (NP (DT The) (NN cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon))))))

The cow jumped over the moon. 

    (ROOT
      (S
        (NP (DT The) (NN cow))
        (VP (VBD jumped)
          (PP (IN over)
            (NP (DT the) (NN moon))))
        (. .)))

    (TOP (S (NP (DT The) (NN cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon))))))
    (TOP (S (NP (DT The) (NN cow)) (VP (VP (VBD jumped) (PRT (RP over))) (NP (DT the) (NN moon)))))
    (TOP (S (NP (DT The) (NNS cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon)))))) 

String fileLocation = getModelDir() +  
    "/en-parser-chunking.bin"; 
try (InputStream modelInputStream =  
            new FileInputStream(fileLocation);) { 
     ParserModel model = new ParserModel(modelInputStream); 
    Parser parser = ParserFactory.create(model); 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
} 

String sentence = "The cow jumped over the moon"; 
Parse parses[] = ParserTool.parseLine(sentence, parser, 3); 

for(Parse parse : parses) { 
    parse.show(); 
    System.out.println("Probability: " + parse.getProb()); 
} 

    (TOP (S (NP (DT The) (NN cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon))))))
    Probability: -1.043506016751117
    (TOP (S (NP (DT The) (NN cow)) (VP (VP (VBD jumped) (PRT (RP over))) (NP (DT the) (NN moon)))))
    Probability: -4.248553665013661
    (TOP (S (NP (DT The) (NNS cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon))))))
    Probability: -4.761071294573854

    (TOP 
          (S 
              (NP 
                   (DT The) 
                   (NN cow)
              )
              (VP 
                   (VBD jumped) 
                   (PP 
                        (IN over)
                        (NP 
                             (DT the)
                             (NN moon)
                         )
                   )
               )
         )
    )

parse.showCodeTree(); 

[0] S -929208263 -> -929208263 TOP The cow jumped over the moon
[0.0] NP -929237012 -> -929208263 S The cow
[0.0.0] DT -929242488 -> -929237012 NP The
[0.0.0.0] TK -929242488 -> -929242488 DT The
[0.0.1] NN -929034400 -> -929237012 NP cow
[0.0.1.0] TK -929034400 -> -929034400 NN cow
[0.1] VP -928803039 -> -929208263 S jumped over the moon
[0.1.0] VBD -928822205 -> -928803039 VP jumped
[0.1.0.0] TK -928822205 -> -928822205 VBD jumped
[0.1.1] PP -928448468 -> -928803039 VP over the moon
[0.1.1.0] IN -928460789 -> -928448468 PP over
[0.1.1.0.0] TK -928460789 -> -928460789 IN over
[0.1.1.1] NP -928195203 -> -928448468 PP the moon
[0.1.1.1.0] DT -928202048 -> -928195203 NP the
[0.1.1.1.0.0] TK -928202048 -> -928202048 DT the
[0.1.1.1.1] NN -927992591 -> -928195203 NP moon
[0.1.1.1.1.0] TK -927992591 -> -927992591 NN moon  

Parse children[] = parse.getChildren(); 
for (Parse parseElement : children) { 
    System.out.println(parseElement.getText()); 
    System.out.println(parseElement.getType()); 
    Parse tags[] = parseElement.getTagNodes(); 
    System.out.println("Tags"); 
    for (Parse tag : tags) { 
        System.out.println("[" + tag + "]"  
            + " type: " + tag.getType()  
            + "  Probability: " + tag.getProb()  
            + "  Label: " + tag.getLabel()); 
    } 
} 

The cow jumped over the moon
S
Tags
[The] type: DT  Probability: 0.9380626549164167  Label: null
[cow] type: NN  Probability: 0.9574993337971017  Label: null
[jumped] type: VBD  Probability: 0.9652983971550483  Label: S-VP
[over] type: IN  Probability: 0.7990638213315913  Label: S-PP
[the] type: DT  Probability: 0.9848023215770413  Label: null
[moon] type: NN  Probability: 0.9942338356992393  Label: null  

String parserModel = ".../models/lexparser/englishPCFG.ser.gz"; 
LexicalizedParser lexicalizedParser =  
   LexicalizedParser.loadModel(parserModel);

String[] senetenceArray = {"The", "cow", "jumped", "over",  
    "the", "moon", "."}; 
List<CoreLabel> words =  
    Sentence.toCoreLabelList(senetenceArray); 

Tree parseTree = lexicalizedParser.apply(words); 

parseTree.pennPrint(); 

    (ROOT
      (S
        (NP (DT The) (NN cow))
        (VP (VBD jumped)
          (PP (IN over)
            (NP (DT the) (NN moon))))
        (. .)))

TreePrint treePrint =  
    new TreePrint("typedDependenciesCollapsed"); 
treePrint.printTree(parseTree); 

det(cow-2, The-1)
nsubj(jumped-3, cow-2)
root(ROOT-0, jumped-3)
det(moon-6, the-5)
prep_over(jumped-3, moon-6)  

    (ROOT (S (NP (DT The) (NN cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon)))) (. .)))

    dep(cow-2,The-1)
    dep(jumped-3,cow-2)
    dep(null-0,jumped-3,root)
    dep(jumped-3,over-4)
    dep(moon-6,the-5)
    dep(over-4,moon-6)

    "penn,typedDependenciesCollapsed"  

String sentence = "The cow jumped over the moon."; 
TokenizerFactory<CoreLabel> tokenizerFactory =  
    PTBTokenizer.factory(new CoreLabelTokenFactory(), ""); 
Tokenizer<CoreLabel> tokenizer =  
    tokenizerFactory.getTokenizer(new StringReader(sentence)); 
List<CoreLabel> wordList = tokenizer.tokenize(); 
parseTree = lexicalizedParser.apply(wordList); 

TreebankLanguagePack tlp =  
    lexicalizedParser.treebankLanguagePack; 
GrammaticalStructureFactory gsf =  
    tlp.grammaticalStructureFactory(); 
GrammaticalStructure gs =  
    gsf.newGrammaticalStructure(parseTree); 
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed(); 

System.out.println(tdl);

    [det(cow-2, The-1), nsubj(jumped-3, cow-2), root(ROOT-0, jumped-3), det(moon-6, the-5), prep_over(jumped-3, moon-6)]  

for(TypedDependency dependency : tdl) { 
    System.out.println("Governor Word: [" + dependency.gov()  
        + "] Relation: [" + dependency.reln().getLongName() 
        + "] Dependent Word: [" + dependency.dep() + "]"); 
} 

    Governor Word: [cow/NN] Relation: [determiner] Dependent Word: [The/DT]
    Governor Word: [jumped/VBD] Relation: [nominal subject] Dependent Word: [cow/NN]
    Governor Word: [ROOT] Relation: [root] Dependent Word: [jumped/VBD]
    Governor Word: [moon/NN] Relation: [determiner] Dependent Word: [the/DT]
    Governor Word: [jumped/VBD] Relation: [prep_collapsed] Dependent Word: [moon/NN]  

String sentence = "He took his cash and she took her change "  
    + "and together they bought their lunch."; 
Properties props = new Properties(); 
props.put("annotators",  
    "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 
Annotation annotation = new Annotation(sentence); 
pipeline.annotate(annotation); 

Map<Integer, CorefChain> corefChainMap =  
    annotation.get(CorefChainAnnotation.class); 

Set<Integer> set = corefChainMap.keySet(); 
Iterator<Integer> setIterator = set.iterator(); 
while(setIterator.hasNext()) { 
    CorefChain corefChain =  
        corefChainMap.get(setIterator.next()); 
    System.out.println("CorefChain: " + corefChain); 
} 

CorefChain: CHAIN1-["He" in sentence 1, "his" in sentence 1]
CorefChain: CHAIN2-["his cash" in sentence 1]
CorefChain: CHAIN4-["she" in sentence 1, "her" in sentence 1]
CorefChain: CHAIN5-["her change" in sentence 1]
CorefChain: CHAIN7-["they" in sentence 1, "their" in sentence 1]
CorefChain: CHAIN8-["their lunch" in sentence 1]

System.out.print("ClusterId: " + corefChain.getChainID()); 
CorefMention mention = corefChain.getRepresentativeMention(); 
System.out.println(" CorefMention: " + mention  
    + " Span: [" + mention.mentionSpan + "]"); 

List<CorefMention> mentionList =  
    corefChain.getMentionsInTextualOrder(); 
Iterator<CorefMention> mentionIterator =  
    mentionList.iterator(); 
while(mentionIterator.hasNext()) { 
    CorefMention cfm = mentionIterator.next(); 
    System.out.println("\tMention: " + cfm  
        + " Span: [" + mention.mentionSpan + "]"); 
    System.out.print("\tMention Mention Type: "  
        + cfm.mentionType + " Gender: " + cfm.gender); 
    System.out.println(" Start: " + cfm.startIndex  
        + " End: " + cfm.endIndex); 
} 
System.out.println(); 

    CorefChain: CHAIN1-["He" in sentence 1, "his" in sentence 1]
    ClusterId: 1 CorefMention: "He" in sentence 1 Span: [He]
      Mention: "He" in sentence 1 Span: [He]
      Mention Type: PRONOMINAL Gender: MALE Start: 1 End: 2
      Mention: "his" in sentence 1 Span: [He]
      Mention Type: PRONOMINAL Gender: MALE Start: 3 End: 4
    ...
    CorefChain: CHAIN8-["their lunch" in sentence 1]
    ClusterId: 8 CorefMention: "their lunch" in sentence 1 Span: [their lunch]
      Mention: "their lunch" in sentence 1 Span: [their lunch]
      Mention Type: NOMINAL Gender: UNKNOWN Start: 14 End: 16

String question =  
    "Who is the 32nd president of the United States?";

String parserModel = ".../englishPCFG.ser.gz"; 
LexicalizedParser lexicalizedParser =  
    LexicalizedParser.loadModel(parserModel); 

TokenizerFactory<CoreLabel> tokenizerFactory =  
    PTBTokenizer.factory(new CoreLabelTokenFactory(), ""); 
Tokenizer<CoreLabel> tokenizer =  
    tokenizerFactory.getTokenizer(new StringReader(question)); 
List<CoreLabel> wordList = tokenizer.tokenize(); 
Tree parseTree = lexicalizedParser.apply(wordList); 

TreebankLanguagePack tlp =  
    lexicalizedParser.treebankLanguagePack(); 
GrammaticalStructureFactory gsf =  
    tlp.grammaticalStructureFactory(); 
GrammaticalStructure gs =  
    gsf.newGrammaticalStructure(parseTree); 
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed(); 
System.out.println(tdl); 
for (TypedDependency dependency : tdl) { 
    System.out.println("Governor Word: [" + dependency.gov()  
        + "] Relation: [" + dependency.reln().getLongName() 
        + "] Dependent Word: [" + dependency.dep() + "]"); 
} 

    [root(ROOT-0, Who-1), cop(Who-1, is-2), det(president-5, the-3), amod(president-5, 32nd-4), nsubj(Who-1, president-5), det(States-9, the-7), nn(States-9, United-8), prep_of(president-5, States-9)]
    Governor Word: [ROOT] Relation: [root] Dependent Word: [Who/WP]
    Governor Word: [Who/WP] Relation: [copula] Dependent Word: [is/VBZ]
    Governor Word: [president/NN] Relation: [determiner] Dependent Word: [the/DT]
    Governor Word: [president/NN] Relation: [adjectival modifier] Dependent Word: [32nd/JJ]
    Governor Word: [Who/WP] Relation: [nominal subject] Dependent Word: [president/NN]
    Governor Word: [States/NNPS] Relation: [determiner] Dependent Word: [the/DT]
    Governor Word: [States/NNPS] Relation: [nn modifier] Dependent Word: [United/NNP]
    Governor Word: [president/NN] Relation: [prep_collapsed] Dependent Word: [States/NNPS]

for (TypedDependency dependency : tdl) { 
    if ("nominal subject".equals( dependency.reln().getLongName()) 
        && "who".equalsIgnoreCase( dependency.gov().originalText())) { 
        processWhoQuestion(tdl); 
    } 
} 

    Who is the 32nd president of the United States?
    Who was the 32nd president of the United States?
    The 32nd president of the United States was who?
    The 32nd president is who of the United States?

    What was the 3rd President's party?
    When was the 12th president inaugurated?
    Where is the 30th president's home town?

    George Washington   (1789-1797) 

public List<President> createPresidentList() { 
    ArrayList<President> list = new ArrayList<>(); 
    String line = null; 
    try (FileReader reader = new FileReader("PresidentList"); 
            BufferedReader br = new BufferedReader(reader)) { 
        while ((line = br.readLine()) != null) { 
            SimpleTokenizer simpleTokenizer =  
                SimpleTokenizer.INSTANCE; 
            String tokens[] = simpleTokenizer.tokenize(line); 
            String name = ""; 
            String start = ""; 
            String end = ""; 
            int i = 0; 
            while (!"(".equals(tokens[i])) { 
                name += tokens[i] + " "; 
                i++; 
            } 
            start = tokens[i + 1]; 
            end = tokens[i + 3]; 
            if (end.equalsIgnoreCase("present")) { 
                end = start; 
            } 
            list.add(new President(name,  
                Integer.parseInt(start), 
                Integer.parseInt(end))); 
        } 
     } catch (IOException ex) { 
        // Handle exceptions 
    } 
    return list; 
} 

public class President { 
    private String name; 
    private int start; 
    private int end; 

    public President(String name, int start, int end) { 
        this.name = name; 
        this.start = start; 
        this.end = end; 
    } 
    ... 
} 

public void processWhoQuestion(List<TypedDependency> tdl) { 
    List<President> list = createPresidentList(); 
    for (TypedDependency dependency : tdl) { 
        if ("president".equalsIgnoreCase( 
                dependency.gov().originalText()) 
                && "adjectival modifier".equals( 
                  dependency.reln().getLongName())) { 
            String positionText =  
                dependency.dep().originalText(); 
            int position = getOrder(positionText)-1; 
            System.out.println("The president is "  
                + list.get(position).getName()); 
        } 
    } 
}

private static int getOrder(String position) { 
    String tmp = ""; 
    int i = 0; 
    while (Character.isDigit(position.charAt(i))) { 
        tmp += position.charAt(i++); 
    } 
    return Integer.parseInt(tmp); 
} 

The president is Franklin D . Roosevelt

try{
            URL url = new URL("https://en.wikipedia.org/wiki/Berlin");
            HTMLDocument htmldoc = HTMLFetcher.fetch(url);
            InputSource is = htmldoc.toInputSource();
            TextDocument document = new BoilerpipeSAXInput(is).getTextDocument();
            System.out.println(document.getText(true, true));
        } catch (MalformedURLException ex) {
            System.out.println(ex);
        } catch (IOException ex) {
            System.out.println(ex);
        } catch (SAXException | BoilerpipeProcessingException ex) {
            System.out.println(ex);
        }

Berlin
From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
This article is about the capital of Germany. For other uses, see Berlin (disambiguation) .
State of Germany in Germany
Berlin
State of Germany
From top: Skyline including the TV Tower ,
City West skyline with Kaiser Wilhelm Memorial Church , Brandenburg Gate ,
East Side Gallery ( Berlin Wall ),
Oberbaum Bridge over the Spree ,
Reichstag building ( Bundestag )
.......
This page was last edited on 18 June 2018, at 11:18 (UTC).
Text is available under the Creative Commons Attribution-ShareAlike License ; additional terms may apply.  By using this site, you agree to the Terms of Use and Privacy Policy . Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. , a non-profit organization.
Privacy policy
About Wikipedia
Disclaimers
Contact Wikipedia
Developers
Cookie statement
Mobile view

private static String getResourcePath(){
        File currDir = new File(".");
        String path = currDir .getAbsolutePath();
        path = path.substring(0, path.length()-2);
        String resourcePath = path + File.separator  + "src/chapter11/TestDocument.docx";
        return resourcePath;
    }
    public static void main(String args[]){
        try {
            FileInputStream fis = new FileInputStream(getResourcePath());
            POITextExtractor textExtractor = ExtractorFactory.createExtractor(fis);
            System.out.println(textExtractor.getText());
        } catch (FileNotFoundException ex) {
            Logger.getLogger(WordDocExtractor.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            System.out.println(ex);
        } catch (OpenXML4JException ex) {
            System.out.println(ex);
        } catch (XmlException ex) {
            System.out.println(ex);
        }   
    }

Jump to navigation Jump to search
Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
5,673,388 articles in English
Arts
Biography
Geography
History
Mathematics
Science
Society
Technology
All portals
From today's featured article George Steiner The Portage to San Cristobal of A.H. is a 1981 literary and philosophical novella by George Steiner (pictured). The story is about Jewish Nazi hunters who find a fictional Adolf Hitler (A.H.) alive in the Amazon jungle thirty years after the end of World War II. The book was controversial, particularly among reviewers and Jewish scholars, because the author allows Hitler to defend himself when he is put on trial in the jungle by his captors. There Hitler maintains that Israel owes its existence to the Holocaust and that he is the "benefactor of the Jews". A central theme of The Portage is the nature of language, and revolves around Steiner's lifelong work on the subject and his fascination in the power and terror of human speech. Other themes include the philosophical and moral analysis of history, justice, guilt and revenge. Despite the controversy, it was a 1983 finalist in the PEN/Faulkner Award for Fiction. It was adapted for the theatre by British playwright Christopher Hampton. (Full article...) Recently featured: Monroe Edwards C. R. M. F. Cruttwell Russulaceae Archive By email More featured articles Did you know... Maria Bengtsson ... that a reviewer found Maria Bengtsson (pictured) believable and expressive when she first performed the title role of Arabella by Strauss? ... that the 2018 Osaka earthquake disrupted train services during the morning rush hour, forcing passengers to walk between the tracks? ... that funding for Celia Brackenridge's research into child protection in football was ended because the sport "was not ready for a gay former lacrosse international rummaging through its dirty linen"? ... that the multi-armed Heliaster helianthus sheds several of its arms when attacked by the six-armed predatory starfish Meyenaster gelatinosus? ... that if elected, Democratic candidate Deb Haaland would be the first Native American woman to become a member of the United States House of Representatives? ... that 145 Vietnamese civilians were killed during the 1967 Thuy Bo massacre? ... that Velvl Greene, a University of Minnesota professor of public health, taught more than 30,000 students? ... that a group of Fijians placed a newspaper ad to recruit skiers for Fiji at the 2002 Olympic Games after discussing it at a New Year's Eve party? Archive Start a new article Nominate an article In the news Lake Toba Saudi Arabia lifts its ban on women driving. Canada legalizes the cultivation of cannabis for recreational use with effect from October 2018, making it the second country to do so. An overloaded tourist ferry capsizes in Lake Toba (pictured), Indonesia, killing at least 3 people and leaving 193 others missing. In golf, Brooks Koepka wins the U.S. Open at the Shinnecock Hills Golf Club. Ongoing: FIFA World Cup Recent deaths: Joe Jackson Richard Harrison Yan Jizhou John Mack Nominate an article On this day June 28: Vidovdan in Serbia Anna Pavlova as Giselle 1776 – American Revolutionary War: South Carolina militia repelled a British attack on Charleston. 1841 – Giselle (Anna Pavlova pictured in the title role), a ballet by French composer Adolphe Adam, was first performed at the Théâtre de l'Académie Royale de Musique in Paris. 1911 – The first meteorite to suggest signs of aqueous processes on Mars fell to Earth in Abu Hummus, Egypt. 1978 – In Regents of the Univ. of Cal. v. Bakke, the U.S. Supreme Court barred quota systems in college admissions but declared that affirmative action programs giving advantage to minorities are constitutional. 2016 – Gunmen attacked Istanbul's Atatürk Airport, killing 45 people and injuring more than 230 others. Primož Trubar (d. 1586) · Paul Broca (b. 1824) · Yvonne Sylvain (b. 1907) More anniversaries: June 27 June 28 June 29 Archive By email List of historical anniversaries

Today's featured picture
    Henry VIII of England (1491–1547) was King of England from 1509 until his death. Henry was the second Tudor monarch, succeeding his father, Henry VII. Perhaps best known for his six marriages, his disagreement with the Pope on the question of annulment led Henry to initiate the English Reformation, separating the Church of England from papal authority and making the English monarch the Supreme Head of the Church of England. He also instituted radical changes to the English Constitution, expanded royal power, dissolved monasteries, and united England and Wales. In this, he spent lavishly and frequently quelled unrest using charges of treason and heresy. Painting: Workshop of Hans Holbein the Younger Recently featured: Lion of Al-lāt Sagittarius Japanese destroyer Yamakaze (1936) Archive More featured pictures

Other areas of Wikipedia
Community portal – Bulletin board, projects, resources and activities covering a wide range of Wikipedia areas.
Help desk – Ask questions about using Wikipedia.

POITextExtractor metaExtractor = textExtractor.getMetadataTextExtractor();
            System.out.println(metaExtractor.getText());

Created = Thu Jun 28 06:36:00 UTC 2018
CreatedString = 2018-06-28T06:36:00Z
Creator = Ashish
LastModifiedBy = Ashish
LastPrintedString = 
Modified = Thu Jun 28 06:37:00 UTC 2018
ModifiedString = 2018-06-28T06:37:00Z
Revision = 1
Application = Microsoft Office Word
AppVersion = 12.0000
Characters = 26588
CharactersWithSpaces = 31190
Company = 
HyperlinksChanged = false
Lines = 221
LinksUpToDate = false
Pages = 8
Paragraphs = 62
Template = Normal.dotm
TotalTime = 1

fis = new FileInputStream(getResourcePath());
            POIXMLPropertiesTextExtractor properties = new POIXMLPropertiesTextExtractor(new XWPFDocument(fis));
            CoreProperties coreProperties = properties.getCoreProperties();
            System.out.println(properties.getCorePropertiesText());

            ExtendedProperties extendedProperties = properties.getExtendedProperties();
            System.out.println(properties.getExtendedPropertiesText());

Created = Thu Jun 28 06:36:00 UTC 2018
CreatedString = 2018-06-28T06:36:00Z
Creator = Ashish
LastModifiedBy = Ashish
LastPrintedString = 
Modified = Thu Jun 28 06:37:00 UTC 2018
ModifiedString = 2018-06-28T06:37:00Z
Revision = 1

Application = Microsoft Office Word
AppVersion = 12.0000
Characters = 26588
CharactersWithSpaces = 31190
Company = 
HyperlinksChanged = false
Lines = 221
LinksUpToDate = false
Pages = 8
Paragraphs = 62
Template = Normal.dotm
TotalTime = 1

File file = new File(getResourcePath());
PDDocument pd = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
String text= stripper.getText(pd);
System.out.println(text);

Jump to navigation Jump to search  
Welcome to Wikipedia, 
the free encyclopedia that anyone can edit. 
5,673,388 articles in English 
 Arts 
 Biography 
 Geography 
 History 
 Mathematics 
 Science 
 Society 
 Technology 
 All portals 
From today's featured article 

George Steiner 
The Portage to San Cristobal of A.H. is a 1981 
literary and philosophical novella by George Steiner 
(pictured). The story is about Jewish Nazi hunters 
who find a fictional Adolf Hitler (A.H.) alive in the 
Amazon jungle thirty years after the end of World 
War II. The book was controversial, particularly 
among reviewers and Jewish scholars, because the 
author allows Hitler to defend himself when he is 
put on trial in the jungle by his captors. There Hitler 
maintains that Israel owes its existence to the 
Holocaust and that he is the "benefactor of the 
Jews". A central theme of The Portage is the nature 
of language, and revolves around Steiner's lifelong 
work on the subject and his fascination in the power 
and terror of human speech. Other themes include 
the philosophical and moral analysis of history, 
justice, guilt and revenge. Despite the controversy, it 
was a 1983 finalist in the PEN/Faulkner Award for 
Fiction. It was adapted for the theatre by British 

In the news 

Lake Toba 
 Saudi Arabia lifts its ban on 
women driving. 
 Canada legalizes the cultivation of 
cannabis for recreational use 
with effect from October 2018, 
making it the second country to do 
so. 
 An overloaded tourist ferry 
capsizes in Lake Toba (pictured), 
Indonesia, killing at least 3 people 
and leaving 193 others missing. 
 In golf, Brooks Koepka wins the 
U.S. Open at the Shinnecock Hills 
Golf Club. 
Ongoing:  
 FIFA World Cup
.....

File file = new File("TestDocument.pdf");            
Tika tika = new Tika();
String filetype = tika.detect(file);

System.out.println(filetype);
System.out.println(tika.parseToString(file));            

application/pdf
Jump to navigation Jump to search  

Welcome to Wikipedia, 
the free encyclopedia that anyone can edit. 

5,673,388 articles in English 

 Arts 

 Biography 

 Geography 

 History 

 Mathematics 

 Science 

 Society 

 Technology 

 All portals 

From today's featured article 

George Steiner 

The Portage to San Cristobal of A.H. is a 1981 

literary and philosophical novella by George Steiner 

(pictured). The story is about Jewish Nazi hunters 

who find a fictional Adolf Hitler (A.H.) alive in the 

Amazon jungle thirty years after the end of World 

War II. The book was controversial, particularly 
....

 props.put("annotators", "tokenize, ssplit, pos");

*java.lang.IllegalArgumentException: annotator "pos" requires  annotator "ssplit"*

String text = "The robber took the cash and ran";
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        Annotation annotation = new Annotation(text);

        System.out.println("Before annotate method executed ");
        Set<Class<?>> annotationSet = annotation.keySet();
        for(Class c : annotationSet) {
            System.out.println("\tClass: " + c.getName());
        }

        pipeline.annotate(annotation);

        System.out.println("After annotate method executed ");
        annotationSet = annotation.keySet();
        for(Class c : annotationSet) {
            System.out.println("\tClass: " + c.getName());
        }
        List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
        for (CoreMap sentence : sentences) {
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                String word = token.get(TextAnnotation.class); 
                String pos = token.get(PartOfSpeechAnnotation.class); 
                System.out.println(word);
                System.out.println(pos);
            }
        }

Before annotate method executed 
    Class: edu.stanford.nlp.ling.CoreAnnotations$TextAnnotation
After annotate method executed 
    Class: edu.stanford.nlp.ling.CoreAnnotations$TextAnnotation
    Class: edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation
    Class: edu.stanford.nlp.ling.CoreAnnotations$SentencesAnnotation
    Class: edu.stanford.nlp.ling.CoreAnnotations$MentionsAnnotation
    Class: edu.stanford.nlp.coref.CorefCoreAnnotations$CorefMentionsAnnotation
    Class: edu.stanford.nlp.ling.CoreAnnotations$CorefMentionToEntityMentionMappingAnnotation
    Class: edu.stanford.nlp.ling.CoreAnnotations$EntityMentionToCorefMentionMappingAnnotation
    Class: edu.stanford.nlp.coref.CorefCoreAnnotations$CorefChainAnnotation
The
DT
robber
NN
took
VBD
the
DT
cash
NN
and
CC
ran
VBD

Annotation annotation1 = new Annotation("The robber took the cash and ran.");
Annotation annotation2 = new Annotation("The policeman chased him down the street.");
Annotation annotation3 = new Annotation("A passerby, watching the action, tripped the thief "
            + "as he passed by.");
Annotation annotation4 = new Annotation("They all lived happily ever after, except for the thief "
            + "of course.");

ArrayList<Annotation> list = new ArrayList();
list.add(annotation1);
list.add(annotation2);
list.add(annotation3);
list.add(annotation4);
Iterable<Annotation> iterable = list;
pipeline.annotate(iterable);
List<CoreMap> sentences1 = annotation2.get(SentencesAnnotation.class);

for (CoreMap sentence : sentences1) {
    for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
                String word = token.get(TextAnnotation.class);
                String pos = token.get(PartOfSpeechAnnotation.class);
                System.out.println("Word: " + word + " POS Tag: " + pos);
            }
        }

Word: The POS Tag: DT
Word: policeman POS Tag: NN
Word: chased POS Tag: VBD
Word: him POS Tag: PRP
Word: down POS Tag: RP
Word: the POS Tag: DT
Word: street POS Tag: NN
Word: . POS Tag: 

try {
            InputStream is = new FileInputStream(new File(getResourcePath() + "en-sent.bin"));
            FileReader fr = new FileReader(getResourcePath() + "pg164.txt");
            BufferedReader br = new BufferedReader(fr);
            System.out.println(getResourcePath() + "en-sent.bin");
            SentenceModel model = new SentenceModel(is);
            SentenceDetectorME detector = new SentenceDetectorME(model);

            String line;
            StringBuilder sb = new StringBuilder();
            while((line = br.readLine())!=null){
                sb.append(line + " ");
            }
            String sentences[] = detector.sentDetect(sb.toString());
            for (int i = 0; i < sentences.length; i++) {
                sentences[i] = sentences[i].toLowerCase();
            }

//            StopWords stopWords = new StopWords("stop-words_english_2_en.txt");
//            for (int i = 0; i < sentences.length; i++) {
//                sentences[i] = stopWords.removeStopWords(sentences[i]);
//            }

            HashMap<String, Word> wordMap = new HashMap();
            for (int sentenceIndex = 0; sentenceIndex < sentences.length; sentenceIndex++) {
            String words[] = WhitespaceTokenizer.INSTANCE.tokenize(sentences[sentenceIndex]);
            Word word;
            for (int wordIndex = 0; 
                    wordIndex < words.length; wordIndex++) {
                String newWord = words[wordIndex];
                if (wordMap.containsKey(newWord)) {
                     word = wordMap.remove(newWord);
                } else {
                    word = new Word();
                }
                word.addWord(newWord, sentenceIndex, wordIndex);
                wordMap.put(newWord, word);
            }

            Word sword = wordMap.get("sea");
            ArrayList<Positions> positions = sword.getPositions();
            for (Positions position : positions) {
                System.out.println(sword.getWord() + " is found at line " 
                    + position.sentence + ", word " 
                    + position.position);
            }
        }

        } catch (FileNotFoundException ex) {
            Logger.getLogger(SearchText.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(SearchText.class.getName()).log(Level.SEVERE, null, ex);
        }

class Positions {
    int sentence;
    int position;

    Positions(int sentence, int position) {
        this.sentence = sentence;
        this.position = position;
    }
}

public class Word {
    private String word;
    private final ArrayList<Positions> positions;

    public Word() {
        this.positions = new ArrayList();
    }

    public void addWord(String word, int sentence, 
            int position) {
        this.word = word;
        Positions counts = new Positions(sentence, position);
        positions.add(counts);
    }

    public ArrayList<Positions> getPositions() {
        return positions;
    }

    public String getWord() {
        return word;
    }
}

SentenceModel model = new SentenceModel(is);
SentenceDetectorME detector = new SentenceDetectorME(model);

String line;
StringBuilder sb = new StringBuilder();
while((line = br.readLine())!=null){
    sb.append(line + " ");
}
String sentences[] = detector.sentDetect(sb.toString());
for (int i = 0; i < sentences.length; i++) {
    sentences[i] = sentences[i].toLowerCase();
}

class Positions {
    int sentence;
    int position;

    Positions(int sentence, int position) {
        this.sentence = sentence;
        this.position = position;
    }
}

public class Word {
    private String word;
    private final ArrayList<Positions> positions;

    public Word() {
        this.positions = new ArrayList();
    }

    public void addWord(String word, int sentence, 
            int position) {
        this.word = word;
        Positions counts = new Positions(sentence, position);
        positions.add(counts);
    }

    public ArrayList<Positions> getPositions() {
        return positions;
    }

    public String getWord() {
        return word;
    }
}

HashMap<String, Word> wordMap = new HashMap();
            for (int sentenceIndex = 0; sentenceIndex < sentences.length; sentenceIndex++) {
            String words[] = WhitespaceTokenizer.INSTANCE.tokenize(sentences[sentenceIndex]);
            Word word;
            for (int wordIndex = 0; 
                    wordIndex < words.length; wordIndex++) {
                String newWord = words[wordIndex];
                if (wordMap.containsKey(newWord)) {
                     word = wordMap.remove(newWord);
                } else {
                    word = new Word();
                }
                word.addWord(newWord, sentenceIndex, wordIndex);
                wordMap.put(newWord, word);
            }

Word sword = wordMap.get("sea");
            ArrayList<Positions> positions = sword.getPositions();
            for (Positions position : positions) {
                System.out.println(sword.getWord() + " is found at line " 
                    + position.sentence + ", word " 
                    + position.position);
            }

sea is found at line 0, word 7
sea is found at line 2, word 6
sea is found at line 2, word 37
sea is found at line 3, word 5
sea is found at line 20, word 11
sea is found at line 39, word 3
sea is found at line 46, word 6
sea is found at line 57, word 4
sea is found at line 133, word 2
sea is found at line 229, word 3
sea is found at line 281, word 14
sea is found at line 292, word 12
sea is found at line 320, word 22
sea is found at line 328, word 21
sea is found at line 355, word 22
sea is found at line 363, word 1
sea is found at line 391, word 13
sea is found at line 395, word 6
sea is found at line 450, word 12
sea is found at line 460, word 6
.....

<?xml version="1.0" encoding="UTF-8"?>
<aiml>
</aiml>

<?xml version="1.0" encoding="UTF-8"?>
<aiml>
    <category>
        <pattern>Hello</pattern>
        <template> Hello, How are you ? </template>
    </category>
</aiml>

<?xml version="1.0" encoding="UTF-8"?>
<aiml>
    <category>
        <pattern>I like *.</pattern>
        <template>Ok, so you like <star/></template>
    </category>
</aiml>

<?xml version="1.0" encoding="UTF-8"?>
<aiml>
    <category>
    <pattern>I like * and *</pattern>
        <template> Ok, so you like <star index="1"/> and <star index="2"/></template>
    </category>
</aiml>

<?xml version="1.0" encoding="UTF-8"?>
<aiml>
    <category>
        <pattern>I WANT TO BOOK AN APPOINTMENT</pattern>
        <template>Are you sure</template>
    </category>
    <category>
        <pattern>Can I *</pattern>
        <template><srai>I want to <star/></srai></template>
    </category>    
    <category>
        <pattern>May I * </pattern>
        <template>
            <srai>I want to <star/></srai>
        </template>
    </category>
</aiml>

program-ab-0.0.4.3$ java -cp lib/Ab.jar Main bot = test action=chat trace=false

Human :

<?xml version="1.0" encoding="UTF-8"?>
<aiml>

<category><pattern>I WANT TO BOOK AN APPOINTMENT</pattern>
<template>Are you sure you want to book an appointment</template>
</category>
<category><pattern>YES</pattern><that>ARE YOU SURE YOU WANT TO BOOK AN APPOINTMENT</that>
<template>Can you tell me date and time</template>
</category>
<category><pattern>NO</pattern><that>ARE YOU SURE YOU WANT TO BOOK AN APPOINTMENT</that>
<template>No Worries.</template>
</category>
<category><pattern>DATE * TIME *</pattern><that>CAN YOU TELL ME DATE AND TIME</that>
<template>You want appointment on <set name="udate"><star index="1"/> </set> and time <set name="utime"><star index="2"/></set>. Should i confirm.</template>
</category>
<category><pattern>YES</pattern><that>SHOULD I CONFIRM</that>
<template><get name="username"/>, your appointment is confirmed for <get name="udate"/> : <get name="utime"/></template>
</category>
<category><pattern>I AM *</pattern>
<template>Hello <set name="username"> <star/>! </set></template>
</category>
<category><pattern>BYE</pattern>
<template>Bye <get name="username"/> Thanks for the conversation!</template>
</category>
</aiml>

<category><pattern>I AM *</pattern>
<template>Hello <set name="username"> <star/>! </set></template>
</category>

<category><pattern>I WANT TO BOOK AN APPOINTMENT</pattern>
<template>Are you sure you want to book an appointment</template>
</category>

<category><pattern>YES</pattern><that>ARE YOU SURE YOU WANT TO BOOK AN APPOINTMENT</that>
<template>Can you tell me date and time</template>
</category>
<category><pattern>NO</pattern><that>ARE YOU SURE YOU WANT TO BOOK AN APPOINTMENT</that>
<template>No Worries.</template>
</category>

<category><pattern>DATE * TIME *</pattern><that>CAN YOU TELL ME DATE AND TIME</that>
<template>You want appointment on <set name="udate"><star index="1"/> </set> and time <set name="utime"><star index="2"/></set>. Should i confirm.</template>
</category>
<category><pattern>YES</pattern><that>SHOULD I CONFIRM</that>
<template><get name="username"/>, your appointment is confirmed for <get name="udate"/> : <get name="utime"/></template>
</category>

Robot : Hello, I am your appointment scheduler May i know your name
Human : 
I am ashish
Robot : Hello ashish!
Human : 
I want to book an appointment
Robot : Are you sure you want to book an appointment
Human : 
yes
Robot : Can you tell me date and time
Human : 
Date 24/06/2018 time 4 pm
Robot : You want appointment on 24/06/2018 and time 4 pm. Should i confirm.
Human : 
yes
Robot : ashish!, your appointment is confirmed for 24/06/2018 : 4 pm

public class GenerateAIML {

        private static final boolean TRACE_MODE = false;
        static String botName = "appointment";

    public static void main(String[] args) {
        try {

            String resourcesPath = getResourcesPath();
            System.out.println(resourcesPath);
            MagicBooleans.trace_mode = TRACE_MODE;
            Bot bot = new Bot("appointment", resourcesPath);

            bot.writeAIMLFiles();

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static String getResourcesPath(){
        File currDir = new File(".");
        String path = currDir .getAbsolutePath();
        path = path.substring(0, path.length()-2);
        System.out.println(path);
        String resourcePath = path + File.separator  + "src/chapter12/mybot";
        return resourcePath;
    }
}

public class Mychatbotdemo {
    private static final boolean TRACE_MODE = false;
    static String botName = "appointment";
    private static String getResourcePath(){
        File currDir = new File(".");
        String path = currDir .getAbsolutePath();
        path = path.substring(0, path.length()-2);
        System.out.println(path);
            String resourcePath = path + File.separator  + "src/chapter12/mybot";
        return resourcePath;
    }
    public static void main(String args[]){
        try
        {
            String resourcePath = getResourcePath();
            System.out.println(resourcePath);
            MagicBooleans.trace_mode = TRACE_MODE;
            Bot bot = new Bot(botName, resourcePath);
            Chat chatSession = new Chat(bot);
            bot.brain.nodeStats();
            String textLine = "";
            System.out.println("Robot : Hello, I am your appointment scheduler May i know your name");
            while(true){

                System.out.println("Human : ");
                textLine = IOUtils.readInputTextLine();
                if ((textLine==null) || (textLine.length()<1)){
                    textLine = MagicStrings.null_input;
                }
                if(textLine.equals("q")){
                    System.exit(0);
                } else if (textLine.equals("wq")){
                    bot.writeQuit();
                } else {
                    String request = textLine;
                    if(MagicBooleans.trace_mode)
                        System.out.println("STATE=" + request + ":THAT" + ((History)chatSession.thatHistory.get(0)).get(0) + ": Topic" + chatSession.predicates.get("topic"));
                    String response = chatSession.multisentenceRespond(request);
                    while(response.contains("&lt;"))
                        response = response.replace("&lt;", "<");
                    while(response.contains("&gt;"))
                        response = response.replace("&gt;", ">");
                    System.out.println("Robot : " + response);
                }
            }
        }
        catch(Exception e){
            e.printStackTrace();
        }

    }
}

字符	意为
Unicode 空格字符	(空格 _ 分隔符、行 _ 分隔符或段落 _ 分隔符)
`\t`	U+0009 水平制表
`\n`	U+000A 馈线
`\u000B`	U+000B 垂直制表
`\f`	U+000C 换页
`\r`	U+000D 回车
`\u001C`	U+001C 文件分隔符
`\u001D`	U+001D 组分隔符
`\u001E`	U+001E 记录分隔符
`\u001F`	U+001F 单元分离器

注释者	要执行的操作
`tokenize`	标记化
`ssplit`	分句
`pos`	词性标注
`lemma`	词汇化
`ner`	NER
`parse`	句法分析
`dcoref`	共指消解

标签	描述
姐姐(网络用语)ˌ法官ˌ裁判员(judges)	形容词
神经网络	名词，单数，还是复数
NNS	Noun, plural
NNP	专有名词，单数
NNPS	专有名词，复数
刷卡机	所有格结尾
富含血小板血浆	人称代词
铷	副词
菲律宾共和国	颗粒
动词	动词，基本形式
VBD	动词，过去式
VBG	动词、动名词或现在分词

实体类型	正则表达式	输出
统一资源定位器	`\b(https?\|ftp\|file\|ldap)😕/[-A-Za-z0-9+&@#/%?
=_{_\|!:,.;]*[-A-Za-z0-9+&@#/%=}_\|]`	`http://example.com [256:274]`
邮政区码	`[0-9]{5}(\\-?[0-9]{4})?`	`12345-1234 [150:160]`
电子邮件	`[a-zA-Z0-9'._%+-]+@(?:[a-zA-Z0-9-]+\\.)+[a-zA-Z]{2,4}`	`rgb@colorworks.com [27:45]`
时间	`(([0-1]?[0-9])\|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?`	8:00 [217:221]``4:30 [229:233]
日期	`((0?[13578]\|10\|12)(-\|\/)
(([1-9])\|(0[1-9])\|([12])([0-9]?)\|(3[01]?))(-\|\/)
((19)([2-9])(\d{1})\|(20)([01])(\d{1})\|([8901])
(\d{1}))\|(0?[2469]\|11)(-\|\/)(([1-9])
\|(0[1-9])\|([12])([0-9]?)\|(3[0]?))
(-\|\/)((19)([2-9])(\d{1})\|(20)([01])
(\d{1})\|([8901])(\d{1})))`	`2/25/1954 [315:324]`

型号	输出
`en-ner-location.bin`	Span: [4..5) location``Entity: Boston``Probability: 0.8656908776583051``Span: [5..6) location``Entity: Vermont``Probability: 0.9732488014011262
`en-ner-money.bin`	Span: [14..16) money``Entity: 2.45``Probability: 0.7200919701507937
`en-ner-organization.bin`	Span: [16..17) organization``Entity: IBM``Probability: 0.9256970736336729
`en-ner-time.bin`	模型无法检测此文本序列中的时间

标签	描述	标签	描述
抄送	并列连词	PRP 元	所有格代名词
激光唱片	基数	铷	副词
暗行扫描(Dark Trace)	限定词	RBR	副词，比较
前妻；前夫	存在主义	随机阻塞系统（Random Barrage System 的缩写）	副词，最高级
转发	外来词	菲律宾共和国	颗粒
在…里	介词或从属连词	符号	标志
姐姐(网络用语)ˌ法官ˌ裁判员(judges)	形容词	到	到
JJR	形容词，比较级	哦	感叹词
JJS	形容词，最高级	动词	动词，基本形式
莱索托	列表项目标记	VBD	动词，过去式
医学博士	情态的	VBG	动词、动名词或现在分词
神经网络	名词，单数，还是复数	VBN	动词，过去分词
NNS	Noun, plural	VBP	动词，非第三人称单数现在时
NNP	专有名词，单数	VBZ	动词，第三人称单数现在时
NNPS	专有名词，复数	禁水试验	疑问限定词
太平洋夏季时间	前限定词	文字处理	疑问代词
刷卡机	所有格结尾	WP$	所有格 wh 代词
富含血小板血浆	人称代词	战时难民事务委员会（War Refugee Board）	疑问副词

龙哥盟

掠夺·扩张·投机·博弈

Java 自然语言处理（全）

零、前言

这本书是给谁的

这本书涵盖的内容

从这本书中获得最大收益

下载示例代码文件

下载彩色图像

使用的惯例

取得联系

复习

一、自然语言处理简介

什么是 NLP？

为什么要用 NLP？

为什么 NLP 这么难？

自然语言处理工具综述

Apache OpenNLP

斯坦福 NLP

灵管

大门

UIMA

Apache Lucene 核心

面向 Java 的深度学习

文本处理任务概述

查找部分文本

寻找句子

特征工程

寻找人和事物

检测词类

文本和文档分类

提取关系

使用综合方法

了解 NLP 模型

确定任务

选择模型

构建和训练模型

验证模型

使用模型

准备数据

摘要

二、查找部分文本

理解文本的各个部分

什么是标记化？

标记化器的使用

简单的 Java 标记化器

使用 Scanner 类

指定分隔符

使用拆分方法

使用 BreakIterator 类

使用 StreamTokenizer 类

使用 StringTokenizer 类

Java 核心令牌化的性能考虑

NLP 标记器 API

使用 OpenNLPTokenizer 类

使用 SimpleTokenizer 类

使用 WhitespaceTokenizer 类

使用 TokenizerME 类

使用斯坦福记号赋予器

使用 PTBTokenizer 类

使用 document 预处理程序类

使用管道

使用 LingPipe 记号赋予器

训练分词器查找部分文本

比较标记化器

理解标准化

转换成小写

删除停用词

创建停用字词类

使用 LingPipe 删除停用词

使用词干

使用波特斯特梅尔

用 LingPipe 堵塞

使用词汇化

使用 StanfordLemmatizer 类

在 OpenNLP 中使用词汇化

使用管道进行规范化

摘要

三、搜索语句