HTML::TableParser HTML::TableParser uses HTML::Parser to extract data from an HTML table. The data is returned via a series of user defined callback functions or methods. Specific tables may be selected either by a matching a unique table id or by matching against the column names. Multiple (even nested) tables may be parsed in a document in one pass. Table Identification Each table is given a unique id, relative to its parent, based upon its order and nesting. The first top level table has id 1, the second 2, etc. The first table nested in table 1 has id 1.1, the second 1.2, etc. The first table nested in table 1.1 has id 1.1.1, etc. These, as well as the tables' column names, may be used to identify which tables to parse. Data Extraction As the parser traverses a selected table, it will pass data to user provided callback functions or methods after it has digested particular structures in the table. All functions are passed the table id (as described above), the line number in the HTML source where the table was found, and a reference to any table specific user provided data. Table Start The start callback is invoked when a matched table has been found. Table End The end callback is invoked after a matched table has been parsed. Header The hdr callback is invoked after the table header has been read in. Some tables do not use the tag to indicate a header, so this function may not be called. It is passed the column names. Row The row callback is invoked after a row in the table has been read. It is passed the column data. Warn The warn callback is invoked when a non-fatal error occurs during parsing. Fatal errors croak. New This is the class method to call to create a new object when HTML::TableParser is supposed to create new objects upon table start. Callback API Callbacks may be functions or methods or a mixture of both. In the latter case, an object must be passed to the constructor. (More on that later.) The callbacks are invoked as follows: start( $tbl_id, $line_no, $udata ); end( $tbl_id, $line_no, $udata ); hdr( $tbl_id, $line_no, \@col_names, $udata ); row( $tbl_id, $line_no, \@data, $udata ); warn( $tbl_id, $line_no, $message, $udata ); new( $tbl_id, $udata ); Data Cleanup There are several cleanup operations that may be performed automatically: Chomp chomp() the data Decode Run the data through HTML::Entities::decode. DecodeNBSP Normally HTML::Entitites::decode changes a non-breaking space into a character which doesn't seem to be matched by Perl's whitespace regexp. Setting this attribute changes the HTML "nbsp" character to a plain 'ol blank. Trim remove leading and trailing white space. Data Organization Column names are derived from cells delimited by the and tags. Some tables have header cells which span one or more columns or rows to make things look nice. HTML::TableParser determines the actual number of columns used and provides column names for each column, repeating names for spanned columns and concatenating spanned rows and columns. For example, if the table header looks like this: +----+--------+----------+-------------+-------------------+ | | | Eq J2000 | | Velocity/Redshift | | No | Object |----------| Object Type |-------------------| | | | RA | Dec | | km/s | z | Qual | +----+--------+----------+-------------+-------------------+ The columns will be: No Object Eq J2000 RA Eq J2000 Dec Object Type Velocity/Redshift km/s Velocity/Redshift z Velocity/Redshift Qual Row data are derived from cells delimited by the and tags. Cells which span more than one column or row are handled correctly, i.e. the values are duplicated in the appropriate places. INSTALLATION This is a Perl module distribution. It should be installed with whichever tool you use to manage your installation of Perl, e.g. any of cpanm . cpan . cpanp -i . Consult http://www.cpan.org/modules/INSTALL.html for further instruction. Should you wish to install this module manually, the procedure is perl Makefile.PL make make test make install COPYRIGHT AND LICENSE This software is Copyright (c) 2018 by Smithsonian Astrophysical Observatory. This is free software, licensed under: The GNU General Public License, Version 3, June 2007