Is there a java implementation of Porter2 stemmer

Do you know any java implementation of the Porter2 stemmer(or any better stemmer written in java)? I know that there is a java version of Porter(not Porter2) here :

http://tartarus.org/~martin/PorterStemmer/java.txt

but on http://tartarus.org/~martin/PorterStemmer/ the author mentions that the Porter is bit outdated and recommends to use Porter2, available at

http://snowball.tartarus.org/algorithms/english/stemmer.html

However, the problem with me is that this Porter2 is written in snowball(I never heard of it before, so don't know anything about it). What I am exactly looking for is a java version of it.

Thanks. Your help will he highly appreciated.

-------------Problems Reply------------

The Snowball algo is available as a Java download

And from snowball.tartarus.org:

Feb 2002 - Java support Richard has modified the snowball code generator to produce Java output as well as ANSI C output. This means that pure Java systems can now use the snowball stemmers.

This is what you want, right?

You can create an instance of it like so:

Class stemClass = Class.forName("org.tartarus.snowball.ext." + lang + "Stemmer");
stemmer = (SnowballProgram) stemClass.newInstance();
stemmer.setCurrent("your_word");
stemmer.stem();
String your_stemmed_word = stemmer.getCurrent();

/*

Porter stemmer in Java. The original paper is in

Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
no. 3, pp 130-137,

See also http://www.tartarus.org/~martin/PorterStemmer

History:

Release 1

Bug 1 (reported by Gonzalo Parra 16/10/99) fixed as marked below.
The words 'aed', 'eed', 'oed' leave k at 'a' for step 3, and b[k-1]
is then out outside the bounds of b.

Release 2

Similarly,

Bug 2 (reported by Steve Dyrdahl 22/2/00) fixed as marked below.
'ion' by itself leaves j = -1 in the test for 'ion' in step 5, and
b[j] is then outside the bounds of b.

Release 3

Considerably revised 4/9/00 in the light of many helpful suggestions
from Brian Goetz of Quiotix Corporation ([email protected]).

Release 4

*/

import java.io.*;

/**
* Stemmer, implementing the Porter Stemming Algorithm
*
* The Stemmer class transforms a word into its root form. The input
* word can be provided a character at time (by calling add()), or at once
* by calling one of the various stem(something) methods.
*/

class Stemmer
{ private char[] b;
private int i, /* offset into b */
i_end, /* offset to end of stemmed word */
j, k;
private static final int INC = 50;
/* unit of size whereby b is increased */
public Stemmer()
{ b = new char[INC];
i = 0;
i_end = 0;
}

/**
* Add a character to the word being stemmed. When you are finished
* adding characters, you can call stem(void) to stem the word.
*/

public void add(char ch)
{ if (i == b.length)
{ char[] new_b = new char[i+INC];
for (int c = 0; c < i; c++) new_b[c] = b[c];
b = new_b;
}
b[i++] = ch;
}

/** Adds wLen characters to the word being stemmed contained in a portion
* of a char[] array. This is like repeated calls of add(char ch), but
* faster.
*/

public void add(char[] w, int wLen)
{ if (i+wLen >= b.length)
{ char[] new_b = new char[i+wLen+INC];
for (int c = 0; c < i; c++) new_b[c] = b[c];
b = new_b;
}
for (int c = 0; c < wLen; c++) b[i++] = w[c];
}

/**
* After a word has been stemmed, it can be retrieved by toString(),
* or a reference to the internal buffer can be retrieved by getResultBuffer
* and getResultLength (which is generally more efficient.)
*/
public String toString() { return new String(b,0,i_end); }

/**
* Returns the length of the word resulting from the stemming process.
*/
public int getResultLength() { return i_end; }

/**
* Returns a reference to a character buffer containing the results of
* the stemming process. You also need to consult getResultLength()
* to determine the length of the result.
*/
public char[] getResultBuffer() { return b; }

/* cons(i) is true <=> b[i] is a consonant. */

private final boolean cons(int i)
{ switch (b[i])
{ case 'a': case 'e': case 'i': case 'o': case 'u': return false;
case 'y': return (i==0) ? true : !cons(i-1);
default: return true;
}
}

/* m() measures the number of consonant sequences between 0 and j. if c is
a consonant sequence and v a vowel sequence, and <..> indicates arbitrary
presence,

<c><v> gives 0
<c>vc<v> gives 1
<c>vcvc<v> gives 2
<c>vcvcvc<v> gives 3
....
*/

private final int m()
{ int n = 0;
int i = 0;
while(true)
{ if (i > j) return n;
if (! cons(i)) break; i++;
}
i++;
while(true)
{ while(true)
{ if (i > j) return n;
if (cons(i)) break;
i++;
}
i++;
n++;
while(true)
{ if (i > j) return n;
if (! cons(i)) break;
i++;
}
i++;
}
}

/* vowelinstem() is true <=> 0,...j contains a vowel */

private final boolean vowelinstem()
{ int i; for (i = 0; i <= j; i++) if (! cons(i)) return true;
return false;
}

/* doublec(j) is true <=> j,(j-1) contain a double consonant. */

private final boolean doublec(int j)
{ if (j < 1) return false;
if (b[j] != b[j-1]) return false;
return cons(j);
}

/* cvc(i) is true <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. this is used when trying to
restore an e at the end of a short word. e.g.

cav(e), lov(e), hop(e), crim(e), but
snow, box, tray.

*/

private final boolean cvc(int i)
{ if (i < 2 || !cons(i) || cons(i-1) || !cons(i-2)) return false;
{ int ch = b[i];
if (ch == 'w' || ch == 'x' || ch == 'y') return false;
}
return true;
}

private final boolean ends(String s)
{ int l = s.length();
int o = k-l+1;
if (o < 0) return false;
for (int i = 0; i < l; i++) if (b[o+i] != s.charAt(i)) return false;
j = k-l;
return true;
}

/* setto(s) sets (j+1),...k to the characters in the string s, readjusting
k. */

private final void setto(String s)
{ int l = s.length();
int o = j+1;
for (int i = 0; i < l; i++) b[o+i] = s.charAt(i);
k = j+l;
}

/* r(s) is used further down. */

private final void r(String s) { if (m() > 0) setto(s); }

/* step1() gets rid of plurals and -ed or -ing. e.g.

caresses -> caress
ponies -> poni
ties -> ti
caress -> caress
cats -> cat

feed -> feed
agreed -> agree
disabled -> disable

matting -> mat
mating -> mate
meeting -> meet
milling -> mill
messing -> mess

meetings -> meet

*/

private final void step1()
{ if (b[k] == 's')
{ if (ends("sses")) k -= 2; else
if (ends("ies")) setto("i"); else
if (b[k-1] != 's') k--;
}
if (ends("eed")) { if (m() > 0) k--; } else
if ((ends("ed") || ends("ing")) && vowelinstem())
{ k = j;
if (ends("at")) setto("ate"); else
if (ends("bl")) setto("ble"); else
if (ends("iz")) setto("ize"); else
if (doublec(k))
{ k--;
{ int ch = b[k];
if (ch == 'l' || ch == 's' || ch == 'z') k++;
}
}
else if (m() == 1 && cvc(k)) setto("e");
}
}

/* step2() turns terminal y to i when there is another vowel in the stem. */

private final void step2() { if (ends("y") && vowelinstem()) b[k] = 'i'; }

/* step3() maps double suffices to single ones. so -ization ( = -ize plus
-ation) maps to -ize etc. note that the string before the suffix must give
m() > 0. */

private final void step3() { if (k == 0) return; /* For Bug 1 */ switch (b[k-1])
{
case 'a': if (ends("ational")) { r("ate"); break; }
if (ends("tional")) { r("tion"); break; }
break;
case 'c': if (ends("enci")) { r("ence"); break; }
if (ends("anci")) { r("ance"); break; }
break;
case 'e': if (ends("izer")) { r("ize"); break; }
break;
case 'l': if (ends("bli")) { r("ble"); break; }
if (ends("alli")) { r("al"); break; }
if (ends("entli")) { r("ent"); break; }
if (ends("eli")) { r("e"); break; }
if (ends("ousli")) { r("ous"); break; }
break;
case 'o': if (ends("ization")) { r("ize"); break; }
if (ends("ation")) { r("ate"); break; }
if (ends("ator")) { r("ate"); break; }
break;
case 's': if (ends("alism")) { r("al"); break; }
if (ends("iveness")) { r("ive"); break; }
if (ends("fulness")) { r("ful"); break; }
if (ends("ousness")) { r("ous"); break; }
break;
case 't': if (ends("aliti")) { r("al"); break; }
if (ends("iviti")) { r("ive"); break; }
if (ends("biliti")) { r("ble"); break; }
break;
case 'g': if (ends("logi")) { r("log"); break; }
} }

/* step4() deals with -ic-, -full, -ness etc. similar strategy to step3. */

private final void step4() { switch (b[k])
{
case 'e': if (ends("icate")) { r("ic"); break; }
if (ends("ative")) { r(""); break; }
if (ends("alize")) { r("al"); break; }
break;
case 'i': if (ends("iciti")) { r("ic"); break; }
break;
case 'l': if (ends("ical")) { r("ic"); break; }
if (ends("ful")) { r(""); break; }
break;
case 's': if (ends("ness")) { r(""); break; }
break;
} }

/* step5() takes off -ant, -ence etc., in context <c>vcvc<v>. */

private final void step5()
{ if (k == 0) return; /* for Bug 1 */ switch (b[k-1])
{ case 'a': if (ends("al")) break; return;
case 'c': if (ends("ance")) break;
if (ends("ence")) break; return;
case 'e': if (ends("er")) break; return;
case 'i': if (ends("ic")) break; return;
case 'l': if (ends("able")) break;
if (ends("ible")) break; return;
case 'n': if (ends("ant")) break;
if (ends("ement")) break;
if (ends("ment")) break;
/* element etc. not stripped before the m */
if (ends("ent")) break; return;
case 'o': if (ends("ion") && j >= 0 && (b[j] == 's' || b[j] == 't')) break;
/* j >= 0 fixes Bug 2 */
if (ends("ou")) break; return;
/* takes care of -ous */
case 's': if (ends("ism")) break; return;
case 't': if (ends("ate")) break;
if (ends("iti")) break; return;
case 'u': if (ends("ous")) break; return;
case 'v': if (ends("ive")) break; return;
case 'z': if (ends("ize")) break; return;
default: return;
}
if (m() > 1) k = j;
}

/* step6() removes a final -e if m() > 1. */

private final void step6()
{ j = k;
if (b[k] == 'e')
{ int a = m();
if (a > 1 || a == 1 && !cvc(k-1)) k--;
}
if (b[k] == 'l' && doublec(k) && m() > 1) k--;
}

/** Stem the word placed into the Stemmer buffer through calls to add().
* Returns true if the stemming process resulted in a word different
* from the input. You can retrieve the result with
* getResultLength()/getResultBuffer() or toString().
*/
public void stem()
{ k = i - 1;
if (k > 1) { step1(); step2(); step3(); step4(); step5(); step6(); }
i_end = k+1; i = 0;
}

/** Test program for demonstrating the Stemmer. It reads text from a
* a list of files, stems each word, and writes the result to standard
* output. Note that the word stemmed is expected to be in lower case:
* forcing lower case must be done outside the Stemmer class.
* Usage: Stemmer file-name file-name ...
*/
public static void main(String[] args)
{
char[] w = new char[501];
Stemmer s = new Stemmer();
for (int i = 0; i < args.length; i++)
try
{
FileInputStream in = new FileInputStream(args[i]);

try
{ while(true)

{ int ch = in.read();
if (Character.isLetter((char) ch))
{
int j = 0;
while(true)
{ ch = Character.toLowerCase((char) ch);
w[j] = (char) ch;
if (j < 500) j++;
ch = in.read();
if (!Character.isLetter((char) ch))
{
/* to test add(char ch) */
for (int c = 0; c < j; c++) s.add(w[c]);

/* or, to test add(char[] w, int j) */
/* s.add(w, j); */

s.stem();
{ String u;

/* and now, to test toString() : */
u = s.toString();

/* to test getResultBuffer(), getResultLength() : */
/* u = new String(s.getResultBuffer(), 0, s.getResultLength()); */

System.out.print(u);
}
break;
}
}
}
if (ch < 0) break;
System.out.print((char)ch);
}
}
catch (IOException e)
{ System.out.println("error reading " + args[i]);
break;
}
}
catch (FileNotFoundException e)
{ System.out.println("file " + args[i] + " not found");
break;
}
}
}

It is available as a part of MG4J.

See the documentation for EnglishStemmer, i.e. Porter2. Use method processTerm(MutableString ms)

MG4J also gives you java versions of other stemmers. See the snowball package. All these stemmers can be used independently.

Maybe not a direct answer, but there are stemmers in many NLP toolkits - see http://en.wikipedia.org/wiki/Natural_language_processing_toolkits. There's a related question here Tokenizer, Stop Word Removal, Stemming in Java with several answers that might be useful.

We use OpenNLP which is written in Java and may provide the functionality. I wouldn't expect the variation between stemmers to be critical if you are working in English.

Seems like Lucene integrates, in one form or another, some stemming algorithms. You may find what you're looking for starting at package org.apache.lucene.analysis. I however fear the stemming code to be deeply integrated into analysis components, making as a consequence quite hard its extraction ...

The following link contains snowball stemmer api.It has the porter stemmer2 implementation. http://preciselyconcise.com/apis_and_installations/snowball_stemmer.php

Here is a lightweight wrapper I made that is easy to re-use and available on Maven Central.

Category:java Views:0 Time:2010-12-09

Related post

  • Is there a type-safe Java implementation of 'reduce'? 2008-10-21

    I often need to run reduce (also called foldl / foldr, depending on your contexts) in java to aggregate elements of an Itterable. Reduce takes a collection/iterable/etc, a function of two parameters, and an optional start value (depending on the impl

  • Free C/C++ and Java implementations of PPP? 2009-04-26

    Are there free C/C++ and Java implementations of the point-to-point protocol (PPP) for use over a serial line? The C/C++ implementation will go into embedded hardware so portability is a concern. I'm not looking for a full TCP/IP stack, just somethin

  • Java implementation for Min-Max Heap? 2009-07-08

    Do you know of a popular library (apache collections, google collections, etc...) which has a reliable Java implementation for a Min-Max heap? I.e. a heap which allows to peek at its minimum and maximum value in O(1) and to remove at O(logn). I did a

  • java: implementation of topological sort, from a reputable source 2009-09-28

    I'm looking for a reputable Java implementation of a topological sort, given a directed graph of dependencies (node #7 depends on node #2, node #2 depends on note #4, etc.), that will detect the presence of a cycle so I can report an error if a cycle

  • ebXml OpenSource java implementation 2009-10-29

    In our project we are looking for an OpenSource java implementation of the OASIS ebXml Registry 3.0 Specification (spec). It seems there is not a lot of OpenSource initiative for this standard, actually we only found freebXml Registry which is self-n

  • How does Java implement hash tables? 2009-10-29

    Does anyone know how Java implements its hash tables (HashSet or HashMap)? Given the various types of objects that one may want to put in a hash table, it seems very difficult to come up with a hash function that would work well for all cases. ------

  • Java implementation of time to words conversion (RoR's distance_of_time_in_words and time_ago_in_words) 2010-02-25

    Do you know if it exists a java implementation of distance_of_time_in_words and time_ago_in_words? For those who don't know, these methods return a human readable description of the time interval between two dates or between a date and now. By exampl

  • OLEDate java implementation 2010-04-08

    I need a good OLEDate java implementation, and this one does not seem to be working. Is there any known good opensource implementations (like in apache commons)? If not, where do I read about it, so that I write my own implementation? --------------S

  • Is there any open source ISO 11703 java implementation? 2010-07-12

    Hey, I'm looking for an java implementation of the ISO 11073 standard (Health informatics - Point of care medical device communication), especially the Medical package of it. --------------Solutions------------- My intuition and google tell me, there

  • java implementation for LDPC codes 2011-03-17

    Is there any open source java implementation for LDPC(Low Density Parity Check)codes,i found only MAT lab codes. My scenario is i will take text file and divide into block and i will delete some data in text file, and by using LDPC codes i need to re

  • JSONRPC Java implementation 2011-04-05

    I've found tons of JSONRPC Java implementations out there, most of them quite old. Hence my question: which library is up to date and futureproof? --------------Solutions------------- JPoxy (at http://code.google.com/p/jpoxy/) seems to be up-to-date

  • How to express 2n as sum of n variables (Java implementation?) 2011-04-14

    I wonder if there is an elegant way to derive all compositions of 2n as the sum of n non-negative integer variables. For example, for n = 2 variables x and y, there are 5 compositions with two parts : x = 0 y = 4; x = 1 y = 3; x = 2 y = 2; x = 3 y =

  • Is there a standard Java implementation of a Fibonacci heap? 2011-06-08

    I was looking at the different kind of heap data structures. The Fibonacci heap seems to have the better worst case complexity for (1) insertion, (2) deletion and (2) finding the minimum element. I have found that in Java there is a class PriorityQue

  • Webservice Notification - what java implementation would you recommend? 2011-06-25

    I'm looking to listen to WSN produced by a .net webservice using java. What java implementation for WSN would you recommend in this case? --------------Solutions------------- Notification (or even Solicit response message exchange patterns) are not s

  • Java implementation of Long.numberOfTrailingZeros() 2011-06-28

    Link to documentation: http://download.oracle.com/javase/6/docs/api/java/lang/Long.html#numberOfTrailingZeros%28long%29 Here is the Java implementation source code: /** * Returns the number of zero bits following the lowest-order ("rightmost") * one-

  • Bidirectional JSON-RPC over TCP socket Java implementation 2011-06-29

    Does anyone know if some Java implementation of the JSON-RPC protocol exists with bidirectional support (there is no Client/Server, both entities can send and receive the same messages). I know it exists in Python: bjsonrpc Thanks! --------------Solu

  • Java implementation of singular value decomposition for large sparse matrices 2011-07-25

    I'm just wondering if anyone out there knows of a java implementation of singular value decomposition (SVD) for large sparse matrices? I need this implementation for latent semantic analysis (LSA). I tried the packages from UJMP and JAMA but they cho

  • Where can i get a Java implementation of Dijkstra's algorithm? 2011-08-25

    I am looking for a generic Java implementation of Dijkstra's algorithm. I've tried coding this up on my own, but I keep running into problems. If it helps, I know for a fact that the graph is always connected. Does anyone know of such an implementati

  • Java implementations of SCIM 2011-09-02

    SCIM is a fresh standard for user provisioning put forward by Google, Salesforce, Ping Identity..etc.. Are there existing java implementations to support this? --------------Solutions------------- Nice to here that someone has found the scimproxy pro

Copyright (C) dskims.com, All Rights Reserved.

processed in 0.093 (s). 11 q(s)