replace or remove script
Solution Title: Regular Expressions to remove or replace Author: pmengal Points: 500 Grade: A Date: 05/12/2003 01:18AM PDT |
|
Hello, I want to use a regular expression to replace or remove some texts. Replace ------- I want to be able to replace > by > in the following HTML text : "<p><strong>Superman is greater > than Spiderman</strong></p>" The same code should work also for this text without any change : "<span class="thisname<isinvalid"><b>a > b ?</b></span>" You understood, it's to use with a custom Html Encoder. Remove ------ I want to remove all (tags included) that is between <script></script> like : "<script> some malicious code </script>" The same code (without any change, but different that the replace one of course) should work on this too : "<script language="javascript"> some malicious code </script>" and this one "<script language='javascript'> some malicious code </script>" and this one too "<script dull="dull" language="javascript"> some malicious code</script>" and this one too ... "<SCRIPT Language="JavaScript"> some malicious code </ScripT>" Sorry to be so complete, but I posted some 500 and 250 questions and got incomplete answers due to the non complete enough question. Thanks in advance ! |
Comment from pmengal Date: 05/12/2003 01:19AM PDT |
Author Comment | |
Forgot to say : Can you provide ALL the code to achieve this ? Giving me just the regular expression will not help me. I'm not familiar with regular expressions at all... If you have time, giving me some | ||
Comment from AvonWyss Date: 05/12/2003 05:22AM PDT |
Comment | |
I will.... stay tuned. | ||
Comment from testn Date: 05/12/2003 06:37AM PDT |
Comment | |
Hi, my previous regex should work with > pattern = "((?<!(<([^>])*))(>))|((?<=((<(\/)?[^A-Z,a-z,/,]){1}([^>,<])*))(>))" About tutorials, http://www.c-sharpcorner.com/3/RegExpPSD.asp http://www.wellho.net/regex/dotnet.html http://windows.oreilly.com/news/csharp_0101.html If you want a comprehensive book, you might consider buying pdf from amazon http://www.amazon.com/exec/obidos/tg/detail/-/B0000632ZU/102-4200309-1247344?vi=glance | ||
Accepted Answer from testn Date: 05/12/2003 06:42AM PDT |
Accepted Answer | |
This is the code for removing malicious code. using System.Text.RegularExpressions; public string removeMaliciousCode(string oldStr) { string pattern = @"(?i)<script([^>])*>(\w|\W)*</script([^>])*>"; string newStr = Regex.Replace(oldStr,pattern,""); return newStr; } This function will return the string that contains no malicious code. | ||
Comment from testn Date: 05/12/2003 06:47AM PDT |
Comment | |
Explain the function...... (?i) means case-insensitive string matching <script([^>])*> means finding any string starting with "<script" and contains 0 or more characters before ending with > (\w|\W)* means may having some string in between <Script> and </script> (0 or more characters of anything) </script([^>])*>" means finding any string starting with "</script" and contains 0 or more characters before ending with > However, this one may be too extreme since it will also match the whole string of <SCRIPT Language="JavaScript"> some malicious code</ScripT> Hello <script ></script> without leaving Hello | ||
Comment from testn Date: 05/12/2003 07:22AM PDT |
Comment | |
You can make it better by putting <script[^>]*>.*?</script[^>]*> It will screen <SCRIPT Language="JavaScript"> some malicious code</ScripT> Hello <script ></script> without to Hello since .*? mean non-greedy matching it will try to match up least possible characters of the pattern | ||
Comment from testn Date: 05/12/2003 07:33AM PDT |
Comment | |
Please also keep testing when this applies to multiple lines data You might need to change it to <script[^>]*>(\w|\W)*?</script[^>]*> or to (?m)<script[^>]*>(\w|\W)*?</script[^>]*> | ||
Comment from AvonWyss Date: 05/12/2003 09:39PM PDT |
Comment | |
private string ReplaceMatch(Match match) { if (match.Groups["script"].Success) return ""; else if (match.Groups["gt"].Value==">") return ">"; else return match.Value; } public string CleanupHtml(string html) { return Regex.Replace(html, @"(?<script><script[^>]*>.*?</script[^>]*>)|(?<gt>(<(""[^""]""|'[^']'|[^>])+)?>)", new MatchEvaluator(ReplaceMatch), RegexOptions.ExplicitCapture|RegexOptions.IgnoreCase|RegexOptions.Singleline); } This Regex will do both your tasks and at the same time. The first part is pretty similar to testn's suggestion, but I also provide the code to find single > chars (with no matching < before). | ||
Comment from osxmaster Date: 02/23/2004 02:19PM PST |
Comment | |
Hi do you know how I can clean this page from HTML and script tags? http://www.wipo.org Seems to be very complicated. thanks | ||
Comment from AvonWyss Date: 02/24/2004 09:59PM PST |
Comment | |
Yes I do. Just a few days ago I answered a very similar question; have a look at recent posts in C# http:Q_20892954.html Or you can of course also post a new Q. From:http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_20613142.html#8508997 |