On strings similarity, graphs, and SharpShooter.

You know what, search engines frown upon the duplicate and similar content.

Or, rather, scratch that. What I wanted to say is that certain problems become utterly easy when you know which algorithms to apply. And, since I was lucky to touch some graph search stuff recently, I would like to share a few insights with you, – using a simple yet practical example. For reasons of brevity, some details got omitted (they should be easy to grasp from source code though) but if anything is unclear – don’t hesitate posting a comment or two!

So, back to search engines and duplicated content now. The “threshold similarity” will obviously depend on a concrete search engine, but generally pages are penalized with lower ranks for being similar to other pages. And even if you don’t deliberately copy-paste, there’s a chance to screw up, so one need to stay informed and keep the similarity low.

Therefore let’s help our friendly neighborhood SEO optimizer and build him a SharpShooter report about duplicate content! Say, a list of page URLs grouped or highlighted when they’re dangerously similar.

For this task, we’d need two algorithms: string similarity calculation and graph connectivity search – this might sound scary but the whole idea is pretty simple:

Grab all the pages, and for every pair of pages find the number describing their similarity.
Build a graph where the pages are nodes which get connected only when their similarity number exceeds a certain threshold (specified by us), and find all connected components in the graph.
Present the result as a SharpShooter report – say, a list of page URLs grouped or highlighted when they’re similar.

(If you want to experiment with code while following this humble narrative arc – feel free to to grab the source code here at bitbucket!)

Strings similarity.

Having a list of HTML pages, we need a method to determine how similar any two pages are. In general, there are two approaches: equivalence based and ranking based:

Equivalence methods give you a “true/false” for two strings. One example is stemming – say, both “consistency” and “consistently” get converted to “consist” and treated as similar.
Ranking methods, instead of a boolean give you a decimal number which is sort of a “distance” between strings – the smaller the number, the larger the similarity. One example is edit distance – the number of letter insertions or deletions necessary to transform one string to another.

It seems that ranking is more promising, so we’d use an amazing ranking algorithm which, to our amusement, calculates the number of adjacent letter chunks in two strings. Here’s a code snippet:

 private class SimilarityChecker
 {
    public decimal Distance(string from, string to) 
    {
       var fromChunks = GetLetterChunks(from);
       var toChunks = GetLetterChunks(to);
       var totalChunks = fromChunks.Count + toChunks.Count;
      
       int commonChunks = fromChunks.Sum(x => ComputeCommonChunks(x, toChunks));
       return (2m * commonChunks) / totalChunks;
    }
    ...
 }

Connected components in a graph.

Now, when we know how similar any two pages are, we build a graph where the pages (nodes) get connected only when their similarity exceeds certain threshold. After that, the only thing left is identifying all connected nodes – we’d use Breadth First Search for this purpose.

The algorithm, for any given node will visit all its adjacent nodes and give them the same “component id”. Later on, we’d identify connected components (read: similar pages) via this id. Here’s a snippet:

  while (queue.Count != 0) {
     var page = queue.Dequeue();
     if (page.IsExplored) {
        continue;
     }
     foreach (var relatedPage in page.RelatedPages.Where(x => !x.IsExplored)) {
        queue.Enqueue(relatedPage);
     }
     page.IsExplored = true;
     page.ComponentId = componentId;
  }
  componentId++;

That’s pretty much it for the algorithms part, and now we’re ready to unleash SharpShooter powers!

SharpShooter report from a console app.

We don’t need a full-fledged WinForms app for our example – a console app will fit just fine. Just need to make sure we add a reference to PerpetuumSoft.Reporting library:

Next, we define Main method. (Make sure it’s marked with STAThread: SharpShooter is written with WinForms and they’re not thread-safe, so we need STA).

First off we grab the pages’ HTML – for illustration purposes I have just hardcoded some small texts – and define a “minimal similarity” threshold:

  var urlsToPages = new Dictionary {
     {"http://page1", "This is content of page 1"},
     {"http://page2", "This is content of page 2"},
     {"http://page3", "This is content, still pretty similar, of page 3"},
     {"http://page4", "Really different text, not a duplicate"}
  };
  var threshold = 0.7m;
  var pages = new WebPageSimilarityChecker(threshold).Calculate(urlsToPages);

The above code creates our data source in pages variable, and we just need to pass it over to SharpShooter’s report slot and report manager.

There are two problems with this, however. First off, we don’t have SharpShooter objects defined in our program (and we cannot drag-n-drop because it’s a console app). The second problem is that data is dynamic – available at runtime, – so we will not get data source schema if we’d design the report from Visual Studio.

Good news is that both stumbling blocks could be removed with 5 lines of code:

  public static void Main()
  {
      ...
      var reportSlot = new FileReportSlot();
      var reportManager = new ReportManager();
      reportManager.Reports.Add(reportSlot);
      reportManager.DataSources.Add("Pages", pages);
      reportSlot.DesignTemplate();
   }

The last line in the above snippet does the trick. When we run the app, it will eventually open SharpShooter Designer, where both data schema and the actual data itself will be already available (because it’s run time).

Having done that, we create styles in Report Designer, to emphasize different components:

..then drop a DataBand onto the page, then drop a TextBox there and setup its bindings: we need to make sure the textbox is displaying the correct URL with the correct styles:

And now, lo and behold, when we hit “Preview Report” button in the top left corner, we see that indeed the three pages are pretty similar:

We can also play with the threshold – the above report was using 0.7 (ie, highlight the pages if they’re 70% similar), and when we change the threshold to 0.8 we’d get a different picture:

Summary.

There are certain interesting grouping and sorting strategies which are obviously not supported natively by any reporting tool, but which can make reports more informative and clear. These strategies are not limited by string similarity calculations and connected components search we implemented today, and you may find a decent use for similar algorithms in your own reports.

Another takeaway from this article is that at times it might be nearly impossible to design reports from Visual Studio (because data schema is available at runtime only). While this is extremely inconvenient, there’s a workaround where you use FileReportSlot, call DesignTemplate before the report gets rendered, and typically save the properly designed (at runtime) template under the name you provided at design time.

As usual, you can flick though source code for the project we’ve created – give it a try, play with it and let us know what you think!

May 5th, 2012

No Comments »

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

On strings similarity, graphs, and SharpShooter.

Leave a Comment

About Perpetuum Software

Categories

Authors