Skip to main content


Showing posts from October, 2012

Cleaning Parsing and Extracting with Tika

I normally used htmlCleaner or jsoup for parsing and cleaning. Also have tried Boilerpipe, but wasn't satisfied for various reasons. These all have their pros and cons and one of thesse can be choosen as per need of application.
Got a chance to try Tika by apache and it's very good at parsing and cleaning hmtml, scripts. Apart from this it can also format many other formats like PDF, DOC, ODF, etc.
Also solr tika for all these tasks, so being in sync with solr will be more appropriate if something is being done on top of the solr.

know more about tika parser :

[Logs for myself  :) ]