Top Menu

Custom Search

Monday, July 9, 2012

Handling the unrecognizable characters in Stanford Parser

Handling the unrecognizable characters in Stanford Parser/Tokenizer for warnings like :
WARNING: Unrecognizable .... U+FFD ... for some character.
You may find solution here :
http://nlp.stanford.edu/software/tokenizer.shtml
 i.e. untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is "firstDelete".

No comments: