HTML::TableParser
HTML::TableParser uses HTML::Parser to extract data from an HTML table.
The data is returned via a series of user defined callback functions or
methods. Specific tables may be selected either by a matching a unique
table id or by matching against the column names. Multiple (even nested)
tables may be parsed in a document in one pass.
Table Identification
Each table is given a unique id, relative to its parent, based upon its
order and nesting. The first top level table has id 1, the second 2,
etc. The first table nested in table 1 has id 1.1, the second 1.2, etc.
The first table nested in table 1.1 has id 1.1.1, etc. These, as well as
the tables' column names, may be used to identify which tables to parse.
Data Extraction
As the parser traverses a selected table, it will pass data to user
provided callback functions or methods after it has digested particular
structures in the table. All functions are passed the table id (as
described above), the line number in the HTML source where the table was
found, and a reference to any table specific user provided data.
Table Start
The start callback is invoked when a matched table has been
found.
Table End
The end callback is invoked after a matched table has been
parsed.
Header The hdr callback is invoked after the table header has been read
in. Some tables do not use the
tag to indicate a header, so
this function may not be called. It is passed the column names.
Row The row callback is invoked after a row in the table has been
read. It is passed the column data.
Warn The warn callback is invoked when a non-fatal error occurs
during parsing. Fatal errors croak.
New This is the class method to call to create a new object when
HTML::TableParser is supposed to create new objects upon table
start.
Callback API
Callbacks may be functions or methods or a mixture of both. In the
latter case, an object must be passed to the constructor. (More on that
later.)
The callbacks are invoked as follows:
start( $tbl_id, $line_no, $udata );
end( $tbl_id, $line_no, $udata );
hdr( $tbl_id, $line_no, \@col_names, $udata );
row( $tbl_id, $line_no, \@data, $udata );
warn( $tbl_id, $line_no, $message, $udata );
new( $tbl_id, $udata );
Data Cleanup
There are several cleanup operations that may be performed
automatically:
Chomp chomp() the data
Decode Run the data through HTML::Entities::decode.
DecodeNBSP
Normally HTML::Entitites::decode changes a non-breaking space
into a character which doesn't seem to be matched by Perl's
whitespace regexp. Setting this attribute changes the HTML
"nbsp" character to a plain 'ol blank.
Trim remove leading and trailing white space.
Data Organization
Column names are derived from cells delimited by the | and |
tags. Some tables have header cells which span one or more columns or
rows to make things look nice. HTML::TableParser determines the actual
number of columns used and provides column names for each column,
repeating names for spanned columns and concatenating spanned rows and
columns. For example, if the table header looks like this:
+----+--------+----------+-------------+-------------------+
| | | Eq J2000 | | Velocity/Redshift |
| No | Object |----------| Object Type |-------------------|
| | | RA | Dec | | km/s | z | Qual |
+----+--------+----------+-------------+-------------------+
The columns will be:
No
Object
Eq J2000 RA
Eq J2000 Dec
Object Type
Velocity/Redshift km/s
Velocity/Redshift z
Velocity/Redshift Qual
Row data are derived from cells delimited by the and | tags.
Cells which span more than one column or row are handled correctly, i.e.
the values are duplicated in the appropriate places.
INSTALLATION
This is a Perl module distribution. It should be installed with whichever
tool you use to manage your installation of Perl, e.g. any of
cpanm .
cpan .
cpanp -i .
Consult http://www.cpan.org/modules/INSTALL.html for further instruction.
Should you wish to install this module manually, the procedure is
perl Makefile.PL
make
make test
make install
COPYRIGHT AND LICENSE
This software is Copyright (c) 2018 by Smithsonian Astrophysical
Observatory.
This is free software, licensed under:
The GNU General Public License, Version 3, June 2007