Friday, December 12, 2008

Building a profanity filter with

One of the disadvantages of a web site which relies heavily on user generated content is unwanted content and profanity. Since, there will always be more users then moderators you will have to rely on community policing to bring down unwanted content. But, rather than only rely on community policing (which might work well for a web 2.0 site), you might also want to build a basic profanity filter for your web site or the blog (where there can't be any other moderator except the blog administrators). On the current .net 2.0 site that we worked on we had to build a basic profanity filter using a custom dictionary; since, the only place where you would want the profanity filter to work will be input text elements and textareas ( textbox), I could think of three approaches:

  1. Sanitize the user input every time and on every page i.e. remove the unwanted text on every page and for every textbox on the page by calling a method in class library . This approach is the least scalable of all as it requires every developer to call the required method diligently and unnecessarily adds to the code bloat.
  2. Extend the textbox control & create your own control which internally calls the sanitize method in the getter of overridden Text property; the developers are only required to use the extended textbox instead of the base textbox on their pages. You can also in this case add additional properties like SanitizeText (bool) which can be set to false; in cases where you don't the text to be sanitized. Also, you might want to check if the textbox is not of Password type before running it through your profanity filter!
  3. Use tag mapping to substitute the base textbox with the extended textbox; tag mapping works great when you are already in the middle of your development cycle and have to implement such logic after-the-fact.

Given that we had to implement profanity filter after the fact; we went with approach 3 and so far it has worked great for us. By the way, the profanity filter was built in-house with a custom dictionary (since we needed it for multiple languages) by running the user entered text by the custom dictionary and doing a simple Regex.Replace.

Saturday, November 29, 2008

learn, learn, learn!

the beauty of spirituality and philosophy is that anything and everything can be attributed to life's lessons that one has to learn in order to grow and evolve. The more painful the physical manifestation of the event, the higher the lesson that you've learned. This model works great for one or two events where you can try to see the brighter side of things but soon you get tired of convincing yourself that everything is for good! the fact is there are some events in life, which no matter how hard we try turn to them around; they always come around to haunt us and remind us that life at times can be hurtful and there is no point in fighting your feelings or tricking your mind into not feeling the pain: it's alright to fall apart, sometimes.

Friday, October 24, 2008

building high performance site

Recently, I was involved with building a web 2.0 site on where the most important criteria was performance! The pages had to load supper fast (Average Response Time <2sec); style="font-weight: bold;">Caching: The biggest bottleneck in any website is database access; so you need to cache data aggressively: also, unless you are building a site which requires real time data you can live with showing stale data to the user (we make sure that the updates always pull the real time data so that you not displaying the stale data on the update page to the user). Caching alone (esp. lookup data) can increase the performance of your site multi-folds and at times; this alone can be enough to get to decent response times. The only caveat with caching in is in a distributed environment; unfortunately in ASP.Net 2.0 there is built in way of having distributed cache objects i.e. every node in the cluster has it's own copy of cached data which can be out of sync' from other nodes. One way to avoid this by either using a third-party cache provider like memcached (I haven't personally used it so not sure how it gels with or upgrading to .net 3.0 and using Velocity; or by making your load-balancing sticky i.e. a user always hits the same node on the farm for his session.
Web.config debug attribute: Make sure you set the debug="false" on production web.config; you don't want debug symbols to be loaded on production environment.
Ajax: Use Ajax where appropriate to reduce the perceived response time; but do keep in nmind that one of the biggest issue with using Ajax for (aka Atlas) is that you can't use a CDN to serve the ScriptResource.axd which means all the javascripts are served by your servers (and there are quite a few of them around 9!); this means that if your servers are located in the US and the user is accessing your site from China; he will have to make 9 requests just to download the javascript files and this greatly increases byte download time.
Offline Tools: There are lot of things that can be calculated/processed asynchronously; you can easily offload these processes to an offline tool and reduce the load on web servers (for instance calculating the tag cloud can easily be done once in a day or week as part of a offline tool).
Lazy DB Writes: Keep in mind that db writes are the slowest so it might be a good idea to do some db operations in a batch; the batch data can be written to a flat file or a MQ or even to a distributed cache environment where it can be picked up by the offline tool.
Page Compression: Since there is no direct way of g-zipping the pages in IIS; people don't even think of compressing there pages; compression can drastically reduce your page sizes. Here is the link which provides directions to setup page compression in IIS 6.
Optimize Images and JS: Images should be optimized for the web and js files should be minified.

That's all that we have done so far to achieve a pretty decent response times across our site; what do you believe that I missed out on?

Sunday, July 27, 2008


C# trivia: what happens if you wrap an object which doesn't implement IDisposable within a using block?
Ans: Nothing, the code compiles alright and runs just fine; the problem occurs if you assumed that only IDisposable objects can be wrapped within a using block and once the object goes out of using block, the Dispose() on it would be automatically called and hence freeing up your resources; just like I assumed when I wrapped a XmlDataReader within the using block; (wrongfully) thinking that the connection associated with it will automagically be closed once xmldr goes out of scope. Unfortunately; everything worked just fine till one day I started getting Connection pool exceptions and realized that any connection used by xmldr is not getting closed and had to explicitly close them within a finally block! Ideally, only objects implementing IDisposable should be allowed to be wrapped within a using block and you should at least get a compile time warning (if not error) in case you try to place a non-Idisposable object.

Aside: I guess you can wrap me within the using{}, cause I-sure-am-Disposable!

my thoughts on solr

Recently, we had a requirement to provide some advanced searching capabilities for a public web site: one of the most important feature was to support faceted navigation on a set of hierarchical fields. After having done my research on search providers which currently exist in the market: Google search for enterprise, Mercado, Endeca, Omniture etc, we decided to go with solr (yes, the price was one of the major deciding factors: solr is free!) and I have to admit I haven't been disappointed with using it. Since, this is a windows/.net shop I to use solrsharp as the bridge. Installing Jetty+solr was a breeze but I ran into issues with installing Jetty as a service (so that we didn't have to run solr in console mode); also, I read somewhere that Jetty is not great in handling unicode characters (and since ours is a multi-lingual site; this was a big negative): so Jetty gave way to Tomcat; installing Tomcat as a service was easy on Win32 but I figured out that getting it to run on a 64bit Windows as a 64bit application required patching the Tomcat files from svn repo; after patching Tomcat with the required executables it installed just fine on a 64 bit m/c.
Since, solr supports faceted navigation out of the box (along with complex boolean queries), we had absolutely no issues in meeting the requirements; I did run into issues with solrsharp not handing unicode characters (had to patch it) and not supporting NOT searches (had to patch again) but once I patched solrsharp; things have been going on great. Currently, I use the Standard request handler (and it looks like solrsharp doesn't support specifying any different request handler during query time) which works fine for most of the cases but sometimes I've seen where my search results are quite off specially when you are searching on multiple terms for e.g. a search on "Joan of Arc" (w/o the quotes) should ideally return a document where Joan of Arc appear in close proximity first; than other documents; but the resultset returned by solr seem to be more based on words collision (the number of times a word appears in a doc). Also, if you're using the standard request handler; you can't assign different weights to different fields in the document so all fields are treated as same (you might want to assign more weight to Title field than description): the bad news is standard request handler can't handle all this complex stuff; the good news is there's another request handler just for this: Dismax. So, I just need to patch solrsharp so that I can tell it which request handler to use (or it can pick it from some config file) and we should be able to use Dismax as our request handler. All in all, it's been a breeze using solr: it's quite fast, supports clustering, runs just fine on win architecture and above all is free & open source!
Moral of the story: If you are planning to provide search functionality in your next site/application: use solr!

Sunday, July 06, 2008

Miserable failure

This stemmed from a talk that I had with someone so; below is what I'd said: if you want to fail, at least fail trying to achieve something that you wanted to achieve that way in hindsight you can always tell yourself that you did your best. So, what's a miserable failure? if you fail trying to achieve what you never wanted anyway, you just end up spending your energy, your times and emotions on something which really doesn't matter and the more you get involved, the more you get sucked in...going by this definition at least I can't call myself a miserable failure; a failure, yes but not one of the miserable kind.

Saturday, June 28, 2008

ask a question only if at least one of possible answer is what you want to hear

So many times I have people asking me questions and then show their displeasure (either by being obvious or being subtle) when the answer is not what they wanted to hear; the problem is any other answer wouldn't have satisfied them either; so then why do they even ask such questions? maybe that's their way of finding out what they'd known all the while; or if something goes wrong later they can always claim that they knew it already: it's a win-win situation for them.

Saturday, May 17, 2008

Miss Mary's last words

It's been long since I last got a chance to write down something; the "new" job has made me work long hours on the same old sh*t! I guess, it's either my lack of faith or no longer
being in a delusion when people high up in the food chain talk about commitment, DNA (and other 3-4 letter buzzwords) and career growth etc, everything sounds to me like same old crap. I remind myself everyday at work: there is no charity out here; you work for them because they pay you and they pay you because you work for them and they believe they can mint more money from you than they pay you. I guess, the above statement holds true for almost 90% of corporations (& for 60% of people who work for them) out in the market and I don't blame them either; the primary reason they are in the market is to be profitable and make money (look ma! no charity) but should this profitability come at the cost of work ethics and professionalism? I doubt that any corporation would accept that they've been unprofessional, sloppy and perhaps even unethical at times when the fact is that most of them are. So shall I/ really care? nahh, this is not my first job and this will not be my last job: been there, done that!

Saturday, March 22, 2008

Another One Bites The Dust

Couple of days back I was looking for a song to download and given how good my esnips' hack(s) have worked in the recent past, I searched for the song on ensips and found out that esnips now uses a flash player rather than the WMP plugin; and for some reasons my bookmarklet didn't work. Well, like any good netizen I searched online for some other way of downloading the audio from esnips and found this greasemonkey script (standard disclaimer: the script is obfuscated so there's no way to tell if it does more than just letting you download the song, so use it at your own discretion); which does the job. By the way, the Process Explorer way of finding the mp3 in your temporary files still works on IE, for some reasons I couldn't find any mp3s in the process monitor list when I used Fx.

Saturday, March 08, 2008

It's a beautiful day

sometimes you want to write/share a lot of stuff; you're tired of keeping it all inside your head but then you ask yourself how does it matter? The ones who care will already know somehow; as to how you feel and why and they'll understand...

Saturday, February 23, 2008

Gmail crash on Firefox

In case you've been bugged with frequent gmail crashes on fx (9 out of 10 times for me), you can safely assume it to be some extension not gelling well with the big G; or at least that was the case for me. Finding out the real culprit was a bit tricky though: I disabled AdBlock+ on gmail, uninstalled YSlow and Firebug without any luck, in the end, the less pesky HTML Validator turned out to be the problem, disabling HTML Validator for gmail has fixed it for now.

Thursday, February 21, 2008

Spammy Yummy!

Another one of those spam hatred posts; didn't check my gmail a/c for three days and now I see 165 spam messages: I have no inclination to sift through them even though there might be few false positives. At least, with my gmail I was very careful with posting it on public forums; heck I would try to use to register to public forums in case I really had to register but the darn Google Newsgroups didn't let me use any other a/c except Google to post to their forums (there are some disadvantages of using SSO after all); so now my gmail email is open to spammers to harvest, and I guess the monsoon was really good this year cause the harvesting seems to be going on in full swing!).
Digressing a little; I believe there are two ways your email is spammed: 1) It's somewhere on a public facing site so is harvested by bots/spammers (2) You provide your email address to a site and they sell it off to harvesters. For (1) if you indeed have to make your email address public there's not much you can do; except use throw-away email addresses like mailinator.
2) If you have your own email provider with a default catch-all email a/c then it's easy; create unique email addresses for every site that you are registering with; for instance my registration email address could be for the Ameritrade site and for site; this way at least once I start getting spammed I know who sold my email address (and perhaps create Junk filters). Unfortunately; this doesn't work for majority of people; who don't have their own email provider. I think it would be great if public email providers like Yahoo, Google & MS etc. supported some kind of "catch-all" functionality with the email addresses that they provide, something like if I prefix or suffix my actual email address with some text and a delimiter; the mail would still be forwarded to my real email address for e.g. (here # is the delimiter and ameritrade is the suffix) should fwd mails to and then I would have to way to know in case ameritrade ever sells my email to some spammers.

Saturday, February 02, 2008


I've been noticing that I have become quite infrequent with my posting off-late; but once you haven't been doing anything really worthwhile for some time it's a little tough to be posting regularly without sounding redundant! Anyway, after a long hiatus (7 months to be precise); I again have a day job: it's the same run-off-the-mill stuff that I've been doing all through these years: it's not that I didn't know that this job isn't going to be any different from the older ones before I took it; it's just that I felt I gotta be doing something(!) and since even 7 months of break didn't really help me in figuring out what I wanted out of my life, the best was to get a job which would at least keep my motor running! The ironic bit about this job is first thing that they do is send me back to the states for few weeks; a place I didn't really want to return back to and the biggest irony being I have my return on 8th of March (if you've been a regular to my blog; you will know the significance of this day in my life). This is my first trip to the west coast so maybe I will spend time trying to see places nearby: I see the towering (literally & metaphorically) Space Needle from my hotel: got to go to the top; have plans to visit the Woodland Zoo & hopefully if they still have Orca Watching this time of year; try to sight few Orcas. Visited the Aquarium today which I think is passable (the boston one is any day better); but then perhaps; all these places wouldn't really appeal to me without having someone to share the experiences with!