现在的位置: 首页 > 综合 > 正文

CTS flow使用小贴士

2013年09月18日 ⁄ 综合 ⁄ 共 3855字 ⁄ 字号 评论关闭
#Run CTS Flow
#Try the Sample flow ReadDocumentBasic.flow
#Develop a CTS flow using ESP crawler
Need to enable callback
for both the nullWriter and ESPWriter
 
#configure a pipeline to collaborate with CTS flow
Get CTS Stages ctsAnnotationsImporter and ctsParser from feedingOverlay Package.
Don't use ctsAnnotationsImporter and related scope processing stage(scopifier and xmlifier),especially for CJK,
It seems not working right now,otherwise will get "FIXML has illegal UTF-8 byte sequences".

Create a pipeline named CrawlerCts using sitesearch pipeline as template
1.Instance a staqe named CtsParserCrawler based on CtsParser
GenerateScopesFromAnnotations 0

2.Put CtsParserCrawler after Docinit
3.Remove the follow stages:
DocumentRetriever
URLProcessor
Decompressor
FormatDetector
SimpleConverter
FlashConverter
PDFConverter
XPSConverter
SearchExportConverter
FastHTMLParser

 
#For CJK,Don't remove

LanguageAndEncodingDetector
EncodingNormalizer
 
#If you need WebAnalyser,Don't remove
WAAttributeLookup
WALinkRankAnchorTextFormatter
WACrawlerLinkFilter
WARankDocument
4.The CtsAnnotationsImporter is not necessary If you don't need scope searching
 
Tips: define your collection name in the Mapper operator
 
#Using ESP Crawler with CTS
c:/esp/etc/CrawlerGlobalDefaults.xml
...
<section name="cde"> 
  <attrib name="contentdistributors" type="list-string">
    <member> localhost:17078 </member>
  </attrib>
</section>
...
nctrl stop crawler
nctrl start crawler
Configure crawler's feeding destinations parameter on the Admin GUI
(What the FSIS document said will not work,Because if no feeding destination define,the

export config file will be empty for this group parameters)
name:cde
Target Collection:cntv1;fsistraining.crawlingvideo
Destination:cde
Pause ESP feeding:no
Primary:yes
 

crawleradmin.exe -G cntv1 > crawler_cntv1.xml
notepad ./cawler_cntv1.xml
#Confirm the feeding destination parameter
section name="feeding">
            <section name="cde">
                <attrib name="collection" type="string"> cntv1;fsistraining.crawlingvideo </attrib>
                <attrib name="destination" type="string"> cde </attrib>
                <attrib name="paused" type="boolean"> no </attrib>
                <attrib name="primary" type="boolean"> yes </attrib>
            </section>
</section>
#change the start_uris and include_uris to define where you are going to craw
<attrib name="start_uris" type="list-string">
    <member>
http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml
</member>
</attrib>
<section name="include_uris">
            <attrib name="exact" type="list-string">
                <member>
http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml
</member>
            </attrib>
</section>
Run the flow from VS or FSIS Admin GUI

Remove the crawler datasource definition from the collection cntv1
crawleradmin -f ./cawler_cntv1.xml
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Added collection config(s): Scheduled collection for crawling
#Watching crawler from command winodws
crawleradmin --status
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Collection       Status     Feed Status  Active Sites  Stored Docs     Doc Rate
-------------------------------------------------------------------------------
cntv10           Idle       Feeding      0             1               N/A
cntv11           Idle       Feeding      0             1               N/A
cntv12           Idle       Feeding      0             1               N/A
cntv13           Idle       Feeding      0             1               N/A
cntv14           Idle       Feeding      0             1               N/A
cntv8            Idle       Feeding      0             2               N/A
cntv9            Idle       Feeding      0             5               N/A
                                         0             12              0.0 dps
          
          
          
#Watching doclog
doclog -l
doclog -a http://xxx/xxxx/

#Watching CTS Flow log from
C:/Users/FSIS Service/AppData/Local/FSIS/Nodes/Fsis/ContentEngineNode1/Logs/ContentProcessing
#Watching Crawler log from
C:/esp/var/log/crawler
#Adding Spy Stage into ESP pipeline to monitor

抱歉!评论已关闭.