SIOC profile for 'sioc-project.org'
A SIOC profile describes the structure and contents of a community site (e.g., weblog) in a machine processable form. For more information refer to the <a href="http://rdfs.org/sioc">SIOC project page</a>
SIOC - Incremental crawling
In a recent blog post [1] I described a SIMILE Timeline based on SIOC
data.
[1]
http://captsolo.net/info/blog_a.php/2006/07/14/sioc_sparql_and_timeline
The post contains more information about the timeline (e.g., scripts
used) and on problems encountered. One of the problems - once crawled
SIOC data get old quickly. An obvious solution is incremental crawling
- download only the new data.
Now incremental crawling is available in our SIOC / RDF crawler [2].
Other features:
- can limit to the same domain (default:on)
- can exclude comments / replies (default:off)
How it works:
- run the crawler ( ./run ) and it's crawling results are saved to
'result.rdf'
- for incremental crawling copy result file 'result.rdf' into
'input.rdf'
- do crawling again and only new posts should be crawled.
( incremental crawling is on by default, but only has effect if
'input.rdf' is present )
Please try it out. :)
If you want to know more about how it works and what are its
limitations, please write or look at the code. Bugs can be recorded at:
http://esw.w3.org/topic/SIOC/ToDoList#crawler
[2]
http://sw.deri.org/svn/sw/2005/08/sioc/crawler/releases/crawler_v0.7.tar.gz
(requires Python and Redland)
Uldis
[ http://captsolo.net/info/ ]
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "SIOC-Dev" group.
To post to this group, send email to sioc-dev@googlegroups.com
To unsubscribe from this group, send email to sioc-dev-unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sioc-dev
-~----------~----~----~----~------~----~------~--~---
2006-07-25T19:02:31+01:00
2006-07-25T19:02:31+01:00