Clean Content to help searching

One of the issue we all faced with is when a user searches for content, we are allowing the end user to be able to search for anything and this could result in unnecessary searching or cause the backend to perform searches which could result in DOS attacks via SQL.

I have covered DOS attacks via SQL before

https://bryanavery.co.uk/post/2013/10/01/Denial-Of-Service-attacks-via-SQL-Wildcards-should-be-prevented/

SQL Wildcard attacks force the underlying database to carry out CPU-intensive queries by using several wildcards. This vulnerability generally exists in search functionalities of web applications. Successful exploitation of this attack will cause Denial of Service.

Depending on the connection pooling settings of the application and the time taken for attack query to execute, an attacker might be able to consume all connections in the connection pool, which will cause database queries to fail for legitimate users.

By default in ASP.NET, the maximum allowed connections in the pool is 100 and timeout is 30 seconds. Thus if an attacker can run 100 multiple queries with 30+ seconds execution time within 30 seconds no one else would be able to use the database related parts of the application.

Recommendation:

If the application does not require this sort of advanced search, all wildcards should be escaped or filtered.
References:

OWASP Testing for SQL Wildcard Attacks
https://www.owasp.org/index.php/Testing_for_SQL_Wildcard_Attacks_(OWASP-DS-001)

DoS Attacks using SQL Wildcards
http://www.zdnet.com/blog/security/dos-attacks-using-sql-wildcards-revealed/1134

So I’ve went to work to produce a Sequence Diagram on what we need to do to Clean the Content

I’ve generated this method for cleaning content which helps with removing unwanted characters

using System;
using System.Collections.Specialized;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

public class Search
{
    /// <summary>
    /// The regex strip html.
    /// </summary>
    private readonly Regex RegexStripHtml = new Regex("<[^>]*>", RegexOptions.Compiled);

    private StringCollection StopWords
    {
        get
        {
            var stopWords = new StringCollection();
            return stopWords;
        }
    }

    /// <summary>
    /// Removes stop words and HTML from the specified string.
    /// </summary>
    /// <param name="content">
    /// The content.
    /// </param>
    /// <param name="removeHtml">
    /// The remove Html.
    /// </param>
    /// <returns>
    /// The clean content.
    /// </returns>
    public string CleanContent(string content, bool removeHtml)
    {
        if (removeHtml)
        {
            content = this.StripHtml(content);
        }

        content = content.Replace("\\", string.Empty).Replace("|", string.Empty).Replace("(", string.Empty).Replace(")", string.Empty).Replace("[", string.Empty).Replace("]", string.Empty).Replace("*", string.Empty).Replace("?", string.Empty).Replace("}", string.Empty).Replace("{", string.Empty).Replace("^", string.Empty).Replace("+", string.Empty).Replace("%", string.Empty).Replace("_", string.Empty);

        var words = content.Split(new[] { ' ', '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
        var sb = new StringBuilder();
        foreach (var word in
            words.Select(t => t.ToLowerInvariant().Trim()).Where(word => RemoveSingleCharacters(word) && RemoveStopWords(word)))
        {
            sb.AppendFormat("{0} ", word);
        }

        return sb.ToString().Trim();
    }

    /// <summary>
    /// Strips all HTML tags from the specified string.
    /// </summary>
    /// <param name="html">
    /// The string containing HTML
    /// </param>
    /// <returns>
    /// A string without HTML tags
    /// </returns>
    public string StripHtml(string html)
    {
        return this.StringIsNullOrWhitespace(html) ? string.Empty : this.RegexStripHtml.Replace(html, string.Empty).Trim();
    }

    /// <summary>
    /// Returns whether a string is null, empty, or whitespace. Same implementation as in String.IsNullOrWhitespace in .Net 4.0
    /// </summary>
    /// <param name="value"></param>
    /// <returns></returns>
    public bool StringIsNullOrWhitespace(string value)
    {
        return value == null || value.Trim().Length == 0;
    }

    /// <summary>
    /// Removes the single characters.
    /// </summary>
    /// <param name="word">The word.</param>
    /// <returns></returns>
    private bool RemoveSingleCharacters(string word)
    {
        return word.Length > 1;
    }

    /// <summary>
    /// Removes the stop words.
    /// </summary>
    /// <param name="word">The word.</param>
    /// <returns></returns>
    private bool RemoveStopWords(string word)
    {
        return !this.StopWords.Contains(word);
    }
}