Sweeper is an HTML code cleaner based on Mihai Şucan's ReTidy. It is written in PHP and mostly uses regular expressions, DOM, XPath and LOM.
Use run_sweeper.php or paste_sweep.php while selecting the profile and other options for sweeper to use. Options:
You must have PHP installed. For windows try WampServer. Under linux, PHP is probably already installed.
For output of messages as sweeper is running:
... ; Default Value: Off ; Development Value: 4096 ; Production Value: 4096 ; http://php.net/output-buffering output_buffering =On0 ...
Sweeper can sometimes take a long time to run, especially when sweeping many or long files or using the clean_word profile.
... ; Maximum execution time of each script, in seconds ; http://php.net/max-execution-time ; Note: This directive is hardcoded to 0 for the CLI SAPI max_execution_time =303600 ; Maximum amount of time each script may spend parsing request data. It's a good ; idea to limit this time on productions servers in order to eliminate unexpectedly ; long running scripts. ; Note: This directive is hardcoded to -1 for the CLI SAPI ; Default Value: -1 (Unlimited) ; Development Value: 60 (60 seconds) ; Production Value: 60 (60 seconds) ; http://php.net/max-input-time max_input_time =603600 ; Maximum input variable nesting level ; http://php.net/max-input-nesting-level ;max_input_nesting_level = 64 ; Maximum amount of memory a script may consume (128MB) ; http://php.net/memory-limit memory_limit =128M1024M ...
To suppress notices about coding quality (that were suppressed when writing sweeper).
... ; Default Value: E_ALL & ~E_NOTICE ; Development Value: E_ALL | E_STRICT ; Production Value: E_ALL & ~E_DEPRECATED ; http://php.net/error-reporting error_reporting = E_ALL & ~E_NOTICE & ~E_STRICT ...
The correct timezone must be chosen so that the versioning of abbreviations files will work properly (see http://php.net/manual/en/timezones.php for a list of acceptable values):
... [Date] ; Defines the default timezone used by the date functions ; http://php.net/date.timezone date.timezone =UTC(your value here) ...
Profiles configure what sweeper does. They specify what functions run and how to run them.
Applies abbreviations for the specified government department or by using the path to the abbreviations file. Use after the find_abbr profile.
It may sometimes be useful to render the output with paste_sweep.php and click the "Flip abbr and acronyms" checkbox to verify that abbreviations have been properly applied.
Adds a Byte Order Mark to files; make sure your files are UTF-8.
This profile facilitates processing of arbitrary code that is located in the arbitrary_sweep() function.
Does some basic cleaning that should be safe for most files.
Does some basic cleaning that should be safe for most files, with the addition of clf2 changes and strip_xmllang.
Converts CSS classes to inline styles by finding the style information for a given class in the stylesheets.
Mostly syntax formatting of CSS code.
This profile is intended to be used on HTML generated from adobe acrobat's “Save As... HTML (*.html, *.htm)
” option.
This profile is intended to be used on XML generated from microsoft excel's “Save As... XML Spreadsheet 2003 (*.xml)
” option. You'll probably want to run clean_word afterwards since this profile brings the code into a format that is similar to what that profile would run on.
Does a couple simple things to clean web feed files, such as generating an ID for an entry.
Convert common indesign generated class names to HTML code.
This profile is part of a process intended to enable persons without HTML editing skills to edit a webpage in a word processor instead. The process is as follows:
Cleans up a few common code artifacts from wix.
This profile is intended to be used on HTML generated from microsoft word's “Save As... Web Page, Filtered (*.htm; *.html)
” option. This profile can take a long time to run. This is because it is cleaning word of many unnecessary wrappers. Ensuring that the document is semantically the same while using cleaner code (less tags) can require a long time. A significant contribution to the required time is:
Same as clean_word, with the addition of clf2 changes.
Uses clf2 specific code, such as clf2 class names.
Converts the character set of files to iso-8859-1.
Converts the character set of files to UTF-8.
Transforms character entities to their raw character equivalents.
Transforms appropriately-marked sections into definition lists.
Undoes kerning that makes it so that a character entity appears in the place of two or more letters because simple letters are more supported across computer programs than the HTML character entities of their ligatures.
Make tables accessible. What this amounts to is making explicit the data relations so that user agents other than a browser on a typically-sized monitor of a PC can also be sure they know how to represent the data. Specific actions this profile takes:
Same as dom_table_accessibility but also does strip_tbody, strip_xmllang, strip_caption_pre, strip_span1. Uses complex table accessibility (ids and headers).
Make tables accessible. What this amounts to is making explicit the data relations so that user agents other than a browser on a typically-sized monitor of a PC can also be sure they know how to represent the data. Freshly restructures the tables and applies scope to the headers.
Same as dom_table_accessibility but also does strip_tbody, strip_xmllang, strip_caption_pre, strip_span1. Uses simple table accessibility (scope and colspan and rowspan)
Transforms character entities and raw characters to their named character entity equivalents.
Nearly same as update_to_WET4.
Looks in the code for potential abbreviations and writes them to a file. A human then looks at the potentially new abbreviations then adds the ones they want to the abbreviations file.
Any lists marked with a fliplist class will be flipped.
Attempts to transform a table so that its x and y axes are reversed.
Tries to force some changes to footnotes harder, including turning endnotes into footnotes.
Saves files for metadata from HTML that is incompatible with word processors so that they can be reintegrated into the HTML according to the process using the clean_openoffice profile.
Link title attributes that contain the same text as the link text are removed.
Since the start attribute is invalid on <ol> in XHTML, this profile adds some javascript that adds the start attribute so that the page validates with W3C validator because this validator does not validate javascript.
Counting things for diagnostic purposes rather than changing things.
Adds <q> tags and normalizes quote characters to be the appropriate oriented quote character for the language. Notice that there is no way to guarantee which orientation a non-oriented quote character should be transformed into, such as when quote characters are improperly used, so that additional work may be necessary.
Turns embedded stylesheets into inline styles.
Shifts footnotes down by one.
Shifts footnotes up by one.
Shifts headings down by one.
Shifts headings up by one.
Applies structure (matching tables of contents and headings) to documents.
Applies structure (matching tables of contents and headings) to documents, while regenerating the TOC from the headings.
Turns inline styles into classes by looking at the stylesheets and matching inline style information to stylesheet style information.
Templates the code according to what templates have been specified.
Cleans the syntax of code.
Mostly changing classes to WET4.
Cleans many things; intended to catch wordpress code artifacts.