replace or remove script

Solution Title: Regular Expressions to remove or replace
Author: pmengal
Points: 500   Grade: A
Date: 05/12/2003 01:18AM PDT

Hello,

I want to use a regular expression to replace or remove some texts.

Replace
-------

I want to be able to replace > by > in the following HTML text :

"<p><strong>Superman is greater > than Spiderman</strong></p>"

The same code should work also for this text without any change :

"<span class="thisname<isinvalid"><b>a > b ?</b></span>"

You understood, it's to use with a custom Html Encoder.

Remove
------

I want to remove all (tags included) that is between <script></script> like :

"<script> some malicious code </script>"

The same code (without any change, but different that the replace one of course) should work on this too :

"<script language="javascript"> some malicious code </script>"

and this one

"<script language='javascript'> some malicious code </script>"

and this one too

"<script dull="dull" language="javascript"> some malicious code</script>"

and this one too ...

"<SCRIPT Language="JavaScript"> some malicious code </ScripT>"

Sorry to be so complete, but I posted some 500 and 250 questions and got incomplete answers due to the non complete enough question.

Thanks in advance !
Comment from pmengal
Date: 05/12/2003 01:19AM PDT
Author Comment

Forgot to say :

Can you provide ALL the code to achieve this ? Giving me just the regular expression will not help me. I'm not familiar with regular expressions at all...

If you have time, giving me some website to learn is welcome ;)

Comment from AvonWyss
Date: 05/12/2003 05:22AM PDT
Comment

I will.... stay tuned.

Comment from testn
Date: 05/12/2003 06:37AM PDT
Comment

Hi,

my previous regex should work with >

pattern = "((?<!(<([^>])*))(>))|((?<=((<(\/)?[^A-Z,a-z,/,]){1}([^>,<])*))(>))"

About tutorials,

http://www.c-sharpcorner.com/3/RegExpPSD.asp
http://www.wellho.net/regex/dotnet.html
http://windows.oreilly.com/news/csharp_0101.html

If you want a comprehensive book, you might consider buying pdf from amazon
http://www.amazon.com/exec/obidos/tg/detail/-/B0000632ZU/102-4200309-1247344?vi=glance


Accepted Answer from testn
Date: 05/12/2003 06:42AM PDT
Accepted Answer

This is the code for removing malicious code.
using System.Text.RegularExpressions;

public string removeMaliciousCode(string oldStr) {

          string pattern = @"(?i)<script([^>])*>(\w|\W)*</script([^>])*>";
          string newStr  = Regex.Replace(oldStr,pattern,"");
                return newStr;
}

This function will return the string that contains no malicious code.

Comment from testn
Date: 05/12/2003 06:47AM PDT
Comment

Explain the function......

(?i) means case-insensitive string matching

<script([^>])*> means finding any string starting with "<script" and contains 0 or more characters before ending with >

(\w|\W)* means may having some string in between <Script> and </script> (0 or more characters of anything)

</script([^>])*>" means finding any string starting with "</script" and contains 0 or more characters before ending with >

However, this one may be too extreme since it will also match the whole string of

<SCRIPT Language="JavaScript"> some malicious code</ScripT> Hello <script ></script> without leaving Hello

Comment from testn
Date: 05/12/2003 07:22AM PDT
Comment

You can make it better by putting

<script[^>]*>.*?</script[^>]*>

It will screen

<SCRIPT Language="JavaScript"> some malicious code</ScripT> Hello <script ></script> without

to

Hello

since .*? mean non-greedy matching it will try to match up least possible characters of the pattern

Comment from testn
Date: 05/12/2003 07:33AM PDT
Comment

Please also keep testing when this applies to multiple lines data

You might need to change it to

<script[^>]*>(\w|\W)*?</script[^>]*>

or to

(?m)<script[^>]*>(\w|\W)*?</script[^>]*>

Comment from AvonWyss
Date: 05/12/2003 09:39PM PDT
Comment

         private string ReplaceMatch(Match match) {
               if (match.Groups["script"].Success)
                    return "";
               else if (match.Groups["gt"].Value==">")
                    return "&gt;";
               else
                    return match.Value;
          }
         
          public string CleanupHtml(string html) {
               return Regex.Replace(html, @"(?<script><script[^>]*>.*?</script[^>]*>)|(?<gt>(<(""[^""]""|'[^']'|[^>])+)?>)", new MatchEvaluator(ReplaceMatch), RegexOptions.ExplicitCapture|RegexOptions.IgnoreCase|RegexOptions.Singleline);
          }


This Regex will do both your tasks and at the same time. The first part is pretty similar to testn's suggestion, but I also provide the code to find single > chars (with no matching < before).

Comment from osxmaster
Date: 02/23/2004 02:19PM PST
Comment

Hi do you know how I can clean this page from HTML and script tags?

http://www.wipo.org

Seems to be very complicated.

thanks

Comment from AvonWyss
Date: 02/24/2004 09:59PM PST
Comment

Yes I do. Just a few days ago I answered a very similar question; have a look at recent posts in C# http:Q_20892954.html

Or you can of course also post a new Q.

From:http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_20613142.html#8508997
posted @ 2004-03-09 09:32  dudu  阅读(3282)  评论(1编辑  收藏  举报