本书介绍了信息检索(1R)中的关键问题。以及这些问题如何影响搜索引擎的设计与实现,并且用数学模型强化了重要的概念。对于网络搜索引擎这一重要的话题,书中主要涵盖了在网络上广泛使用的搜索技术。 本书适用于高等院校计算机科学或计算机工程专业的本科生、研究生,对于专业人士而言,本书也不失为一本理想的入门教材。 作者简介: W.BruceCroft马萨诸塞大学阿默斯特分校计算机科学特聘教授、ACM会士。他创建了智能信息检索研究中心,发表了200余篇论文,多次获奖,其中包括2003年由ACMSIGIR颁发的GerardSalton奖。 目录: 1SearchEnginesandInformationRetrieva l 1.1WhatIsInformationRetrieva l? 1.2TheBigIssues 1.3SearchEngines 1.4SearchEngineers 2ArchitectureofaSearchEngine 2.1WhatIsanArchitecture? 2.2BasicBuildingBlocks 2.3BreakingItDown 2.3.1TextAcquisition 2.3.2TextTransformation 2.3.3IndexCreation 2.3.4UserInteraction 2.3.5Ranking 2.3.6eva luation1SearchEnginesandInformationRetrieva l 1.1WhatIsInformationRetrieva l? 1.2TheBigIssues 1.3SearchEngines 1.4SearchEngineers 2ArchitectureofaSearchEngine 2.1WhatIsanArchitecture? 2.2BasicBuildingBlocks 2.3BreakingItDown 2.3.1TextAcquisition 2.3.2TextTransformation 2.3.3IndexCreation 2.3.4UserInteraction 2.3.5Ranking 2.3.6eva luation 2.4HowDoesItReallyWork? 3CrawlsandFeeds 3.1DecidingWhattoSearch 3.2CrawlingtheWeb 3.2.1RetrievingWebPages 3.2.2TheWebCrawler 3.2.3Freshness 3.2.4FocusedCrawling 3.2.5DeepWeb 3.2.6Sitemaps 3.2.7DistributedCrawling 3.3CrawlingDocumentsandEmail 3.4DocumentFeeds 3.5TheConversionProblem 3.5.1CharacterEncodings 3.6StoringtheDocuments 3.6,1UsingaDatabaseSystem 3.6.2RandomAccess 3.6.3CompressionandLargeFiles 3.6.4Update 3.6.5BigTable 3.7DetectingDuplicates 3.8RemovingNoise 4ProcessingText 4.1FromWordstoTerms 4.2TextStatistics 4.2.1VocabularyGrowth 4.2.2EstimatingCollectionandResultSetSizes 4.3DocumentParsing 4.3.1Overview 4.3.2Tokenizing 4.3.3Stopping 4.3.4Stemming 4.3.5PhrasesandN-grams 4.4DocumentStructureandMarkup 4.5LinkAnalysis 4.5.1AnchorText 4.5.2PageRank 4.5.3LinkQuality 4.6InformationExtraction 4.6.1HiddenMarkovModelsforExtraction 4.7Internationalization 5RankingwithIndexes 6QueriesandInterfaces 7Retrieva lModels 8eva luatingSearchEngines 9ClassificationandClustering 10SocialSearch 11BeyondBagofWords Reverences Index
|