Ontotext provides a crawling tool called RSS Feeder, which is integrated with KIM. Thus, you can manage RSS feeds to be crawled and populated to a running KIM server. However, it is a standalone application which is distributed and runs separately from KIM.
When downloading KIM, there should be a download link to the RSS Feeder as well.
After acquiring the zip file just extract it somewhere. To start it navigate to the extracted folder and execute:
- bin/ncf.sh for Linux/Mac
- bin/ncf.bat for Windows
By default the RSS Feeder starts with no configured feeds. You can configure such through KIM's management section.
The RSS Feeder will run with a universal boilerplate removal algorithm based on a patched Java port of Arc90's readability.
This tutorial supposes you already have a local KIM server running.
- Navigate to the KIM UI - http://localhost:8080/KIM
- Click on Manage, as shown in the picture below. The default credentials are user: admin, pass: admin.
- Now click on Manage RSS feeds.
- If your setup is correct you will see a page like this and you will be able to add/remove and search through currently configured feeds.
- In case the RSS Feeder isn't connected to KIM the following page will be displayed
In a very common situation the KIM server and the RSS Feeder won't run on their default host and port, and will be also located on different machines. To configure where the KIM server is running edit the file kim_connection.properties, which is located in the config folder of the RSS Feeder distribution.
One may not like the default feeds, which are quite a lot and cover some general topics like finance, healthcare, technology and politics. Instead, a person would probably like to subscribe only to a few specific feeds. The default feeds are configured in the file feeds.xml which is also located in the config folder of the RSS Feeder distribution. To erase them, just remove all defined feed tags.