Disclaimer
I just brought the site back to life so that people can stop reminding me that it's down. Please note that probably most if not all content is outdated. I'll try to update stuff as soon as possible.
best Ingo
trouble with xhtml_ceaning
While working on the trackback implementation of timtab I stumbled over a bug (feature?) with TYPO3s xhtml_cleaning function. In XHTML every tag must be lowercase and so the xhtml_cleaning function of TYPO3 does this job for us - even in comments.
Trackback requires us to put a peace of code into an HTML comment to advertise the trackback capabilities of the site.
Now the problem is that the code for advertising our trackback capabilities is lowercased and cleaned, too. But this "cleaned" code can not be found by other blogging tools and so they can't "see" that our site is capable of receiving trackbacks.

![]() |
![]() |
the differences between xhtml_cleaning off/on |
I found the function which is responsible for cleaning the source, it's HTMLcleaner() in class.t3lib_parsehtml.php. I'll try to teach her to leave comments as they are - nobody cares about comments anyway when it comes to validating.
To find the comments and strip them before cleaning and putting them back after cleaning I was in the need for a regular expression, here's what I got and it works prettty cool even with source code of sites like spiegel.de.
The regular expression explained:
- Match the characters "<!--" literally
- Match the regular expression below
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- Match either the regular expression below (attempting the next alternative only if this one fails)
- Match a single character present in the list below
- One of the characters "<>=+.:;,{}"'?#*%|&()@/"
- A [ character
- A ] character
- Match a single character that is a "word character" (letters, digits, etc.)
- Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
- A - character
- A \ character
- Match a single character present in the list below
- Or match regular expression number 2 below (attempting the next alternative only if this one fails)
- Match the characters "!important" literally which could be used in css
- Or match regular expression number 3 below (the entire group fails if this one fails to match)
- Match the characters "!=" literally which could be used in Javascript as "not equal"
- Match the characters "-->" literally
[update] Hmm...
the regex is nice but I managed it to shorten it "a bit". The following regex does exactly the same if . matches even new line characters which can be configured.
Arrrgghhhh...