#Try the Sample flow ReadDocumentBasic.flow
for both the nullWriter and ESPWriter
Don't use ctsAnnotationsImporter and related scope processing stage(scopifier and xmlifier),especially for CJK,
It seems not working right now,otherwise will get "FIXML has illegal UTF-8 byte sequences".
Create a pipeline named CrawlerCts using sitesearch pipeline as template
1.Instance a staqe named CtsParserCrawler based on CtsParser
GenerateScopesFromAnnotations 0
2.Put CtsParserCrawler after Docinit
3.Remove the follow stages:
DocumentRetriever
URLProcessor
Decompressor
FormatDetector
SimpleConverter
FlashConverter
PDFConverter
XPSConverter
SearchExportConverter
FastHTMLParser
LanguageAndEncodingDetector
EncodingNormalizer
#If you need WebAnalyser,Don't remove
WAAttributeLookup
WALinkRankAnchorTextFormatter
WACrawlerLinkFilter
WARankDocument
Tips: define your collection name in the Mapper operator
#Using ESP Crawler with CTS
c:/esp/etc/CrawlerGlobalDefaults.xml
...
<section name="cde">
<attrib name="contentdistributors" type="list-string">
<member> localhost:17078 </member>
</attrib>
</section>
...
nctrl stop crawler
nctrl start crawler
Configure crawler's feeding destinations parameter on the Admin GUI
(What the FSIS document said will not work,Because if no feeding destination define,the
export config file will be empty for this group parameters)
name:cde
Target Collection:cntv1;fsistraining.crawlingvideo
Destination:cde
Pause ESP feeding:no
Primary:yes
crawleradmin.exe -G cntv1 > crawler_cntv1.xml
notepad ./cawler_cntv1.xml
#Confirm the feeding destination parameter
section name="feeding">
<section name="cde">
<attrib name="collection" type="string"> cntv1;fsistraining.crawlingvideo </attrib>
<attrib name="destination" type="string"> cde </attrib>
<attrib name="paused" type="boolean"> no </attrib>
<attrib name="primary" type="boolean"> yes </attrib>
</section>
</section>
#change the start_uris and include_uris to define where you are going to craw
<attrib name="start_uris" type="list-string">
<member>
http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</attrib>
<section name="include_uris">
<attrib name="exact" type="list-string">
<member>
http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</attrib>
</section>
Run the flow from VS or FSIS Admin GUI
Remove the crawler datasource definition from the collection cntv1
crawleradmin -f ./cawler_cntv1.xml
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Added collection config(s): Scheduled collection for crawling
#Watching crawler from command winodws
crawleradmin --status
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Collection Status Feed Status Active Sites Stored Docs Doc Rate
-------------------------------------------------------------------------------
cntv10 Idle Feeding 0 1 N/A
cntv11 Idle Feeding 0 1 N/A
cntv12 Idle Feeding 0 1 N/A
cntv13 Idle Feeding 0 1 N/A
cntv14 Idle Feeding 0 1 N/A
cntv8 Idle Feeding 0 2 N/A
cntv9 Idle Feeding 0 5 N/A
0 12 0.0 dps
#Watching doclog
doclog -l
doclog -a http://xxx/xxxx/
#Watching CTS Flow log from
C:/Users/FSIS Service/AppData/Local/FSIS/Nodes/Fsis/ContentEngineNode1/Logs/ContentProcessing
#Watching Crawler log from
C:/esp/var/log/crawler
#Adding Spy Stage into ESP pipeline to monitor