I was trying to get into code search in Pagure, thing that I land up on got really interesting and amazing. If you want to have a code searching mechanism in your website you need to look into something called Indexing.
The way search happens in some E-commerce sites like Amazon or be it the search happening on Google, with Google its web scrapping and then indexing on the results. The point being the response time , while you are searching for something you get results in few microseconds.
Now imagine going through such a huge database and going through them in few micro second how much ever power you have but what you need is a clever way to manage it. I was looking at a CS50 video in which Mark Zuckerberg was telling about how he managed his DB, the first architectural design he took was have different MySql instance for different school so that they reduce time taken to search and form relation.
That was a really clever move.
While I was searching for ways to have code search feature on Pagure, I landed up on a pyhton based library called Whoosh. It blew me off with the way it was doing its searches and maintaining the database. I actually looked for a lot for tutorials on how one can understand indexing.
I landed up on Building Search Engines using Python and the way he explained things like N-grams , edge N-grams and how different files store different index words with the frequency and path to documents. I am yet to analyze
git grep v/s
While I was going through whoosh I saw that it has performance issues and then I started contemplating on the fact that if search is not fast enough then there is no point in having it. I actually looked into HyperKitty I figured out they were using Whoosh before and I assumed even they suffered form performance issues or may be because Django introduce Haystack . As the name suggest you can also use this to find the needle in haystack.
Yeah! you are right, I started looking for Haystack in Flask and I found Flask-whoosh. Again the draw back I had was it use to search through databases and not files, where as my application was to search through files on the system
There came the
xapian there are a lot of core concepts involved while using or writing utilities in xapian. I went through the documentation for Xapian. They have covered a lot of concepts and have given examples of it, the bottleneck still persist when it comes to file searching and performance. I found a nice application Building Document Search which might give me some hope but still a lot of work is required there.
The whole concept being you need to do two things on a really high level:
Indexing is required to go through the each file or record and build something called Index which has the search words filtering
stop words and the new database is build having the frequency and location of the word , this is the most time consuming process.
This comprises of forming a query and searching through the formulated database and return the document in which word or phrase is found.
If you need to see a demo.
Till then Happy Coding an Bingo!