Skip to main content

Cleaning Parsing and Extracting with Tika

I normally used htmlCleaner or jsoup for parsing and cleaning. Also have tried Boilerpipe, but wasn't satisfied for various reasons. These all have their pros and cons and one of thesse can be choosen as per need of application.
Got a chance to try Tika by apache and it's very good at parsing and cleaning hmtml, scripts. Apart from this it can also format many other formats like PDF, DOC, ODF, etc.
Also solr tika for all these tasks, so being in sync with solr will be more appropriate if something is being done on top of the solr.

know more about tika parser : http://tika.apache.org/
 

[Logs for myself  :) ]

Comments

Popular posts from this blog

Publishing business basics

Basic Steps:
1. Decide name for the company
2. Register the company with ministry - you will need an attorney (Lawyer for that)
3. Register with Registrar of News Papers in India if it's a magazine/News paper. 
4. Study the relevant acts in general or get them known from the lawyer
5. Start publishing

Following are details regarding the same (not that well written) :

-----
Some starts and books;
* Start Your Own Self-Publishing Business (Entrepreneur Magazine's Start Up) by Entrepreneur Press 
* How To Start And Run A Small Book Publishing Company: A Small Business Guide To Self-Publishing And Independent Publishing by Peter I. Hupalo * Art & Science Of Book Publishing by Herbert S., Jr. Bailey * This Business of Books: A Complete Overview of the Industry from Concept Through Sales by Claudia Suzanne
Raja Rammohun Roy National Agency for ISBN
West Block-I, Wing-6, 2nd Floor,
Sector -I, R.K. Puram,
New Delhi-110066


Some new things and the initiatives in the area : Pothi.com

Starting it is …

काही सुंदर अशी मराठी गाणी

>Suhasya tuze manasi mohi
http://www.esnips.com/displayimage.php?pid=22969903

>Jenvha Tuzya batanna udawi mujor wara
http://www.esnips.com/displayimage.php?pid=22969806

>Bhay ithale sampat nahi
http://www.esnips.com/displayimage.php?pid=22969806

>Pahile na mi tula
http://www.esnips.com/displayimage.php?pid=22969877

>Te sparsha chandanyanche
http://www.esnips.com/displayimage.php?pid=2086815

Installing SyntaxNet on ubuntu - Deep learning - tensorflow

1. Install Java8 (Java7(deprecated)
2. Install Brazel:
$ echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list $ curl https://storage.googleapis.com/bazel-apt/doc/apt-key.pub.gpg | sudo apt-key add - sudo apt-get update && sudo apt-get install bazel 3. sudo apt-get install swig
4. sudo pip install -U protobuf==3.0.0b2 5. sudo pip install asciitree 6. sudo pip install numpy Then you must have git installed : sudo apt-get install git and then built and test

git clone --recursive https://github.com/tensorflow/models.git cd models/syntaxnet/tensorflow ./configure cd .. bazel test syntaxnet/... util/utf8/... # On Mac, run the following: bazel test --linkopt=-headerpad_max_install_names \ syntaxnet/... util/utf8/...