用lucene.net根据关键字检索本地word文档

2020-12-13 01:50

阅读:548

标签:winform   Lucene   style   blog   class   code   

目前在做一个winform小软件,其中有一个功能是能根据关键字检索本地保存的word文档。第一次是用com读取word方式(见上一篇文章),先遍历文件夹下的word文档,读取每个文档时循环关键字查找,结果可想而知效率很慢。检索结果是一条接一条显示出来的o(>_

大致了解了下,我是用C#的,所以要用lucene.net框架,还需要有分词器,lucene.net可以在nuget组件管理中搜到,直接下载到项目中,分词器nuget上也可以搜到。。大概的思路是:遍历文件夹,读取文档内容,进行分词、建索引……

代码:

 

一、声明索引文件位置,名称,分词器

soscw.com,搜素材
1 string filesDirectory = System.IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Files");
2  static   string indexDirectory = System.IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Index");
3   // Analyzer analyzer = new Lucene.Net.Analysis.Cn.ChineseAnalyzer();
4 static Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);//用standardAnalyzer分词器
soscw.com,搜素材

StandardAnalyzer分词器与ChineseAnalyzer分词器的区别:

对“123木头人”这句话进行分词,StandardAnalyzer的分词结果:1 2 3 木 头 人;ChineseAnalyzer的分词结果:木 头 人,把数字过滤了。

 

二、遍历文件夹及文件,创建索引

soscw.com,搜素材
  1         /// 
  2         /// 创建索引按钮事件
  3         /// 
  4         private void btnSetIndex_Click(object sender, EventArgs e)
  5         {
  6             this.folderBrowserDialog1.ShowDialog();
  7             string rootPath = this.folderBrowserDialog1.SelectedPath; 
  8             if(rootPath!="")
  9             {
 10                 GetAllFiles(rootPath);
 11                 MessageBox.Show("创建索引完成!");
 12                 }
 13         }
 14 
 15         /// 
 16         /// 获取指定根目录下的子目录及其文档
 17         /// 
 18         /// 检索的文档根目录
 19         private void GetAllFiles(string rootPath)
 20         {
 21             List files = new List(); //声明一个files包,用来存储遍历出的word文档
 22             if (!System.IO.Directory.Exists(rootPath))
 23             {
 24                 MessageBox.Show("指定的目录不存在");
 25                 return;
 26             }
 27             GetAllFiles(rootPath, files);
 28             CreateIndex(files); //创建索引方法       
 29         }
 30 
 31         /// 
 32         /// 获取指定根目录下的子目录及其文档
 33         /// 
 34         /// 根目录路径
 35         /// word文档存储包
 36          private void GetAllFiles(string rootPath,Listfiles)
 37          {
 38             DirectoryInfo dir = new DirectoryInfo(rootPath);
 39             string[] dirs = System.IO.Directory.GetDirectories(rootPath);//得到所有子目录
 40             foreach (string di in dirs)
 41             {
 42                 GetAllFiles(di,files); //递归调用
 43             }
 44             FileInfo[] file = dir.GetFiles("*.doc"); //查找word文件        
 45             //遍历每个word文档            
 46             foreach (FileInfo fi in file)
 47             {
 48                 string filename = fi.Name;
 49                 string filePath = fi.FullName;
 50                 object filepath = filePath;
 51                 files.Add(fi);
 52             }
 53         }
 54 
 55        
 56         /// 
 57         /// 创建索引
 58         /// 
 59         /// 获得的文档包
 60          private void CreateIndex(List files)
 61          {
 62              bool isCreate = false;
 63              //判断是创建索引还是增量索引
 64              if (!System.IO.Directory.Exists(indexDirectory))
 65              {
 66                  isCreate = true;
 67              }
 68              IndexWriter writer = new IndexWriter(FSDirectory.Open(indexDirectory),analyzer,isCreate,IndexWriter.MaxFieldLength.UNLIMITED);  //FSDirectory表示索引存放在硬盘上,RAMDirectory表示放在内存上
 69              for (int i = 0; i )
 70              {
 71                  //读取word文档内容
 72                  Microsoft.Office.Interop.Word.ApplicationClass wordapp = new Microsoft.Office.Interop.Word.ApplicationClass();
 73                  string filename = files[i].Name;
 74                  object file = files[i].DirectoryName + "\\" + filename;
 75                  object isreadonly = true;
 76                  object nullobj = System.Reflection.Missing.Value;
 77                  Microsoft.Office.Interop.Word._Document doct = wordapp.Documents.Open(ref file, ref nullobj, ref isreadonly, ref nullobj, ref nullobj, ref nullobj, 
ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj); 78 //doct.ActiveWindow.Selection.WholeStory(); 79 //doct.ActiveWindow.Selection.Copy(); 80 //IDataObject data = Clipboard.GetDataObject(); 81 ////读出的内容赋给content变量 82 //string content = data.GetData(DataFormats.Text).ToString(); 83 string content = doct.Content.Text; 84 FileInfo fi = new FileInfo(file.ToString()); 85 string createTime = fi.CreationTime.ToString(); 86 string filemark = files[i].DirectoryName + createTime; 87 //关闭word 88 object missingValue = Type.Missing; 89 object miss = System.Reflection.Missing.Value; 90 object saveChanges = WdSaveOptions.wdDoNotSaveChanges; 91 doct.Close(ref saveChanges, ref missingValue, ref missingValue); 92 wordapp.Quit(ref saveChanges, ref miss, ref miss); 93 // StreamReader reader = new StreamReader(fileInfo.FullName);读取txt文件的方法,如读word会出现乱码,不适用于word的读取 94 Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document(); 95 96 writer.DeleteDocuments(new Term("filemark", filemark)); //当索引文件中含有与filemark相等的field值时,会先删除再添加,以防出现重复 97 doc.Add(new Lucene.Net.Documents.Field("filemark", filemark, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED)); //不分词建索引 98 doc.Add(new Lucene.Net.Documents.Field("FileName", filename, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED)); //ANALYZED分词建索引 99 doc.Add(new Lucene.Net.Documents.Field("Content", content, Lucene.Net.Documents.Field.Store.NO, Lucene.Net.Documents.Field.Index.ANALYZED)); 100 doc.Add(new Lucene.Net.Documents.Field("Path", file.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED)); 101 writer.AddDocument(doc); 102 writer.Optimize();//优化索引 103 } 104 writer.Dispose(); 105 }
soscw.com,搜素材

 

三、根据关键字检索

soscw.com,搜素材
 1         /// 
 2         /// 检索关键字
 3         /// 
 4         /// 关键字包
 5         private void SearchKey(Liststring> strKey)
 6         {
 7             int num = 10;
 8             if (strKey.Count != 0)
 9             {
10                 IndexReader reader = null;
11                 IndexSearcher searcher = null;
12                 try
13                 {
14                     if (!System.IO.Directory.Exists(indexDirectory))
15                     {
16                         MessageBox.Show("首次使用该软件检索 必须先创建索引!"+ "\r\n"+"请点击右边【创建索引】按钮,选择要检索的文件夹进行创建索引。");
17                         return;
18                     }
19                     reader = IndexReader.Open(FSDirectory.Open(new DirectoryInfo(indexDirectory)), false);
20                     searcher = new IndexSearcher(reader);
21 
22                     //创建查询
23                     PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(analyzer);
24                     wrapper.AddAnalyzer("FileName", analyzer);
25                     wrapper.AddAnalyzer("Content", analyzer);
26                     wrapper.AddAnalyzer("Path", analyzer);
27                     string[] fields = { "FileName", "Content", "Path" };
28                     QueryParser parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, fields, wrapper);
29                     BooleanQuery Bquery = new BooleanQuery();
30                     for (int i = 0; i )
31                     {
32                         Query query = parser.Parse(strKey[i]);
33                         Bquery.Add(query, Occur.MUST);
34                     }
35 
36                     TopScoreDocCollector collector = TopScoreDocCollector.Create(num, true);
37                     searcher.Search(Bquery, collector);
38                     var hits = collector.TopDocs().ScoreDocs;
39                     //以后就可以对获取到的collector数据进行操作
40                     int resultNum = 0; //计算检索结果数量
41                     for (int i = 0; i )
42                     {
43                         var hit = hits[i];
44                         Lucene.Net.Documents.Document doc = searcher.Doc(hit.Doc);
45                         Lucene.Net.Documents.Field fileNameField = doc.GetField("FileName");
46                         Lucene.Net.Documents.Field contentField = doc.GetField("Content");
47                         Lucene.Net.Documents.Field pathField = doc.GetField("Path");
48                         if (!System.IO.File.Exists(pathField.StringValue)) //判断本地是否存在该文件,存在则在检索结果栏里显示出来
49                         {
50                             int docId = hit.Doc; //该文件的在索引里的文档号,Doc是该文档进入索引时Lucene的编号,默认按照顺序编的
51                             reader.DeleteDocument(docId);//删除该索引
52                             reader.Commit();
53                             continue;
54                         }                       
55                             dtBGMC.Rows.Add(fileNameField.StringValue, pathField.StringValue);
56                             resultNum++;
57                     }
58                     MessageBox.Show("检索完成!共检索到" +resultNum+ "个符合条件的结果!", "success!");
59                 }
60                 finally
61                 {
62                     if (searcher != null)
63                         searcher.Dispose();
64 
65                     if (reader != null)
66                         reader.Dispose();                  
67                 }
68             }
69             else
70             {
71                 MessageBox.Show("请输入要查询的关键字!");
72                 return;
73             }
74         }
75        
soscw.com,搜素材

 

作者:goodgirlmia
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

用lucene.net根据关键字检索本地word文档,搜素材,soscw.com

用lucene.net根据关键字检索本地word文档

标签:winform   Lucene   style   blog   class   code   

原文地址:http://www.cnblogs.com/goodgirlmia/p/3712116.html


评论


亲,登录后才可以留言!