Approximate Search in Very Large Files Using the Pigeonhole Principle

(*) Corresponding author

Authors' affiliations

DOI's assignment:
the author of the article can submit here a request for assignment of a DOI number to this resource!
Cost of the service: euros 10,00 (for a DOI)


This paper presents a new technique for efficient searching with fuzzy criteria in very large information systems. The suggested technique uses the Pigeonhole Principle approach. This approach can be utilized with different embodiments, but the most effective realization would be to amplify some already given intrinsic approximate matching capabilities, like those in the FuzzyFind method [1][2]. Considering the following problem, a data to be searched is presented as a bit-attribute vector. The searching operation consists of finding a subset of this bit-attribute vector that is within particular Hamming distance. Normally, this search with approximate matching criteria requires sequential lookup for the whole collection of the attribute vector. This process can be easily parallelized, but in very large information systems this still would be slow and energy consuming. The suggested method, in this paper, of approximate search in very large files using the Pigeonhole Principle, circumvents the sequential search operations and reduces the calculations tremendously
Copyright © 2013 Praise Worthy Prize - All rights reserved.


Algorithms and Data Structures; Big Data; Information Retrieval; Approximate Matching; Pigeonhole Principle

Full Text:



E. Berkovich, Method of and system for searching a data dictionary with fault tolerant indexing, Jan. 23 2007. US Patent 7,168,025.

S. Y. Berkovich, E. Berkovich, B. Beroukhim, and G. M. Lapir. Organization of automatic spelling correction: Towards the design of intelligent information retrieval systems. The 21st National Conference of the ASEM, (Washington DC, pages 525– 527, 2000).

M. Crochemore and G. Tischler. The gapped suffix array: a new index structure for fast approximate matching. In String Processing and Information Retrieval, pages 359–364. Springer, 2010.

G. Navarro. Approximate text searching. PhD thesis, PhD thesis, Dept. of Computer Science, Univ. of Chile, 1998.

Z. Liu, X. Chen, J. Borneman, and T. Jiang. A fast algorithm for approximate string matching on gene sequences. In Combinatorial Pattern Matching, pages 79–90. Springer, 2005.

S. Berkovich, E. El-Qawasmeh, G. Lapir, M. Mack, and C. Zincke. Organization of near matching in bit attribute matrix applied to associative access methods in information retrieval. In APPLIED INFORMATICS-PROCEEDINGS-, pages 62–64, 1998.

B. Chazelle. Discrepancy theory and computational geometry. In Algorithms and Data Structures, pages 1–2. Springer, 1997.

R. Baeza-Yates and G. H. Gonnet. “A new approach to text searching”. Communications of the ACM, 35(10):74–82, 1992.

P. Clifford and R. Clifford. Simple deterministic wildcard matching. Information Processing Letters, 101(2):53–54, 2007.

R. Cole and R. Hariharan. Tree pattern matching and subset matching in randomized o (nlog 3 m) time. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 66–75. ACM, 1997.

M. Hadjieleftheriou and C. Li. Efficient approximate search on string collections. Proceedings of the VLDB Endowment, 2(2): 1660–1661, 2009.

S. Anuradha, G. Raghu Ram, Evolution Of New Searching Technique In Unicast Routing Algorithm Of Data Networks, (2009) International Review on Computers and Software (IRECOS), 4 (3), pp. 321-330.

A. Linari. Models and techniques for approximate similarity search in large databases 2007.

S. Brin. Near neighbor search in large metric spaces 1995.

T. Havens, J. Bezdek, C. Leckie, L. Hall, and M. Palaniswami. Fuzzy c-means algorithms for very large data. Fuzzy Systems, IEEE Transactions on, 20(6):1130–1146, 2012.

Z. Liu, X. Chen, J. Borneman, and T. Jiang. A fast algorithm for approximate string matching on gene sequences. In Combinatorial Pattern Matching, pages 79–90. Springer, 2005.

S. Y. Berkovich and A. Hegazy. Matching string patterns in large textual files. In IEEE International Symposium on New Directions in Computing, pages 122–127, 1985.

S. Wu and U. Manber. Agrep—a fast approximate patternmatching tool.

R. Farivar, S. Venkataraman, Y. Li, E. Chan, A. Verma, and R. H. Campbell. Accurate sequence alignment using distributed filtering on gpu clusters.

S. Y. Berkovich and E. El-Qawasmeh. Reversing the error correction scheme for a fault-tolerant indexing. The Computer Journal, 43(1): 54–64, 2000.

M.W. Berry. Survey of Text Mining I: Clustering, Classification, and Retrieval, volume 1. (Springer, 2004).

J. Byun. Organization of information retrieval procedures for approximate attribute matching. PhD thesis, Dept. of Computer Science, George Washington University, 2010.

S. B. Maurer and A. Ralston. Discrete algorithmic mathematics. (Addison-Wesley Reading MA, 1991).

E. K. Donald. The art of computer programming. Sorting and searching, 3:426–458, 1999.

S. K. Park and K. W. Miller. Random number generators: good ones are hard to find. Communications of the ACM, 31(10): 1192–1201, 1988.

P. Kvasnica, I. Kvasnica, Parallel Modelling of Fault-Tolerant Software Systems, (2012) International Review on Computers and Software (IRECOS), 7 (2), pp. 621-625.


  • There are currently no refbacks.

Please send any question about this web site to
Copyright © 2005-2024 Praise Worthy Prize