Sweeper

About

Sweeper is an HTML code cleaner based on Mihai Şucan's ReTidy. It is written in PHP and mostly uses regular expressions, DOM, XPath and LOM.

Usage

Use run_sweeper.php or paste_sweep.php while selecting the profile and other options for sweeper to use. Options:

Profile: controls what sweeper will do.
Government department: this is used for detection of the abbreviations files and, if a template is needed but not specified, the template.
Path: use this for abbreviations other than government clients.
English template: this specifies the english template that sweeper will use (if required by the profile).
French template: this specifies the french template that sweeper will use (if required by the profile).
Source: is the path to the files to be swept.
Target: is the path the swept files will be saved to.
Language: you may specify the language of the code, which affects some changes that are applied.
Table headers starting id: you may specify a table headers id to start at (for table accessibility).
Render output: will render the output of the changed code in the browser.

You must have PHP installed. For windows try WampServer. Under linux, PHP is probably already installed.

PHP configuration (in php.ini)

For output of messages as sweeper is running:

...
; Default Value: Off
; Development Value: 4096
; Production Value: 4096
; http://php.net/output-buffering
output_buffering = On0
...

Sweeper can sometimes take a long time to run, especially when sweeping many or long files or using the clean_word profile.

...
; Maximum execution time of each script, in seconds
; http://php.net/max-execution-time
; Note: This directive is hardcoded to 0 for the CLI SAPI
max_execution_time = 303600     

; Maximum amount of time each script may spend parsing request data. It's a good
; idea to limit this time on productions servers in order to eliminate unexpectedly
; long running scripts. 
; Note: This directive is hardcoded to -1 for the CLI SAPI
; Default Value: -1 (Unlimited)
; Development Value: 60 (60 seconds)
; Production Value: 60 (60 seconds)
; http://php.net/max-input-time
max_input_time = 603600

; Maximum input variable nesting level
; http://php.net/max-input-nesting-level
;max_input_nesting_level = 64

; Maximum amount of memory a script may consume (128MB)
; http://php.net/memory-limit
memory_limit = 128M1024M
...

To suppress notices about coding quality (that were suppressed when writing sweeper).

...
; Default Value: E_ALL & ~E_NOTICE
; Development Value: E_ALL | E_STRICT
; Production Value: E_ALL & ~E_DEPRECATED
; http://php.net/error-reporting
error_reporting = E_ALL & ~E_NOTICE & ~E_STRICT
...

The correct timezone must be chosen so that the versioning of abbreviations files will work properly (see http://php.net/manual/en/timezones.php for a list of acceptable values):

...
[Date]
; Defines the default timezone used by the date functions
; http://php.net/date.timezone
date.timezone = UTC(your value here)
...

Profiles

Profiles configure what sweeper does. They specify what functions run and how to run them.

abbr

Applies abbreviations for the specified government department or by using the path to the abbreviations file. Use after the find_abbr profile.

It may sometimes be useful to render the output with paste_sweep.php and click the "Flip abbr and acronyms" checkbox to verify that abbreviations have been properly applied.

add_BOM

Adds a Byte Order Mark to files; make sure your files are UTF-8.

arbitrary_sweep

This profile facilitates processing of arbitrary code that is located in the arbitrary_sweep() function.

basic

Does some basic cleaning that should be safe for most files.

basic_with_clf2

Does some basic cleaning that should be safe for most files, with the addition of clf2 changes and strip_xmllang.

classes_to_styles

Converts CSS classes to inline styles by finding the style information for a given class in the stylesheets.

clean_CSS

Mostly syntax formatting of CSS code.

clean_PDF

This profile is intended to be used on HTML generated from adobe acrobat's “Save As... HTML (*.html, *.htm)” option.

clean_excel

This profile is intended to be used on XML generated from microsoft excel's “Save As... XML Spreadsheet 2003 (*.xml)” option. You'll probably want to run clean_word afterwards since this profile brings the code into a format that is similar to what that profile would run on.

clean_feeds

Does a couple simple things to clean web feed files, such as generating an ID for an entry.

clean_indesign

Convert common indesign generated class names to HTML code.

clean_openoffice

This profile is part of a process intended to enable persons without HTML editing skills to edit a webpage in a word processor instead. The process is as follows:

sweep using html_to_word profile which creates the abbr files and metadata files
copy HTML into OpenOffice from a browser, excluding the template, and save as .odt
make any desired edits in OpenOffice or Microsoft Word
save as HTML in OpenOffice
place abbr files and metadata files and HTML saved from OpenOffice into not-swept folder and use the clean_openoffice profile
run other profiles as normal, if needed (for example dom_table_accessibility or basic)

clean_wix

Cleans up a few common code artifacts from wix.

clean_word

This profile is intended to be used on HTML generated from microsoft word's “Save As... Web Page, Filtered (*.htm; *.html)” option. This profile can take a long time to run. This is because it is cleaning word of many unnecessary wrappers. Ensuring that the document is semantically the same while using cleaner code (less tags) can require a long time. A significant contribution to the required time is:

Sections that become one block with much styling; example: a table of contents that becomes a long paragraph with links formatted by line breaks.

clean_word_clf2

Same as clean_word, with the addition of clf2 changes.

clf2

Uses clf2 specific code, such as clf2 class names.

convert_to_iso_8859_1

Converts the character set of files to iso-8859-1.

convert_to_utf8

Converts the character set of files to UTF-8.

decode_character_entities

Transforms character entities to their raw character equivalents.

definition_listify

Transforms appropriately-marked sections into definition lists.

dekern

Undoes kerning that makes it so that a character entity appears in the place of two or more letters because simple letters are more supported across computer programs than the HTML character entities of their ligatures.

dom_table_accessibility

Make tables accessible. What this amounts to is making explicit the data relations so that user agents other than a browser on a typically-sized monitor of a PC can also be sure they know how to represent the data. Specific actions this profile takes:

Apply scope or ids and headers (works best when provided with <th>s)
Generate other table structure (<thead>, <tbody>, <tfoot>)

dom_table_accessibility_complex

Same as dom_table_accessibility but also does strip_tbody, strip_xmllang, strip_caption_pre, strip_span1. Uses complex table accessibility (ids and headers).

dom_table_accessibility_fresh

dom_table_accessibility_simple

Same as dom_table_accessibility but also does strip_tbody, strip_xmllang, strip_caption_pre, strip_span1. Uses simple table accessibility (scope and colspan and rowspan)

encode_character_entities

Transforms character entities and raw characters to their named character entity equivalents.

final_clean

Nearly same as update_to_WET4.

find_abbr

Looks in the code for potential abbreviations and writes them to a file. A human then looks at the potentially new abbreviations then adds the ones they want to the abbreviations file.

flip_lists

Any lists marked with a fliplist class will be flipped.

flip_tables

Attempts to transform a table so that its x and y axes are reversed.

force_footnotes

Tries to force some changes to footnotes harder, including turning endnotes into footnotes.

html_to_word

Saves files for metadata from HTML that is incompatible with word processors so that they can be reintegrated into the HTML according to the process using the clean_openoffice profile.

link_titles

Link title attributes that contain the same text as the link text are removed.

ol_start

Since the start attribute is invalid on <ol> in XHTML, this profile adds some javascript that adds the start attribute so that the page validates with W3C validator because this validator does not validate javascript.

quality_assurance

Counting things for diagnostic purposes rather than changing things.

quotation

Adds <q> tags and normalizes quote characters to be the appropriate oriented quote character for the language. Notice that there is no way to guarantee which orientation a non-oriented quote character should be transformed into, such as when quote characters are improperly used, so that additional work may be necessary.

remove_embedded_stylesheets

Turns embedded stylesheets into inline styles.

shiftFootnotesDown

Shifts footnotes down by one.

shiftFootnotesUp

Shifts footnotes up by one.

shiftHeadingsDown

Shifts headings down by one.

shiftHeadingsUp

Shifts headings up by one.

structure

Applies structure (matching tables of contents and headings) to documents.

structure_generate_TOC

Applies structure (matching tables of contents and headings) to documents, while regenerating the TOC from the headings.

styles_to_classes_full

Turns inline styles into classes by looking at the stylesheets and matching inline style information to stylesheet style information.

templateCode

Templates the code according to what templates have been specified.

tidy

Cleans the syntax of code.

update_to_WET4

Mostly changing classes to WET4.

wordpress

Cleans many things; intended to catch wordpress code artifacts.