HtmlQuery Wyam.Html

Queries HTML content of the input documents and creates new documents with content and metadata from the results.
Once you provide a DOM query selector, the module creates new output documents for each query result and allows you to set the new document content and/or set new metadata based on the query result.

Package

This module exists in the Wyam.Html package which is not part of the core distribution. Add the following preprocessor directive to your configuration file to use it:
#n Wyam.Html
Alternativly, you can add all modules at once with the following preprocessor directive:
#n Wyam.All

Usage

  • HtmlQuery(string querySelector)

    Creates the module with the specified query selector.

    • querySelector

      The query selector to use.

Fluent Methods

Chain these methods together after the constructor to modify behavior.

  • First(bool first = true)

    Specifies that only the first query result should be processed (the default is false).

    • first

      If set to true, only the first result is processed.

  • GetAll()

    Gets all information for each query result and sets the metadata of the corresponding result document(s). This is equivalent to calling GetOuterHtml(), GetInnerHtml(), GetTextContent(), and GetAttributeValues() with default arguments.

  • GetAttributeValue(string attributeName, string metadataKey = null)

    Gets the specified attribute value of each query result and sets it in the metadata of the corresponding result document(s). If the attribute is not found for a given query result, no metadata is set. If metadataKey is null, the attribute name will be used as the metadata key, otherwise the specified metadata key will be used.

    • attributeName

      Name of the attribute to get.

    • metadataKey

      The metadata key in which to place the attribute value.

  • GetAttributeValues()

    Gets the values for all attributes of each query result and sets them in the metadata of the corresponding result document(s) with keys names equal to the attribute local name.

  • GetInnerHtml(string metadataKey = "InnerHtml")

    Gets the inner HTML of each query result and sets it in the metadata of the corresponding result document(s) with the specified key.

    • metadataKey

      The metadata key in which to place the inner HTML.

  • GetOuterHtml(string metadataKey = "OuterHtml")

    Gets the outer HTML of each query result and sets it in the metadata of the corresponding result document(s) with the specified key.

    • metadataKey

      The metadata key in which to place the outer HTML.

  • GetTextContent(string metadataKey = "TextContent")

    Gets the text content of each query result and sets it in the metadata of the corresponding result document(s) with the specified key.

    • metadataKey

      The metadata key in which to place the text content.

  • SetContent(bool? outerHtml = true)

    Sets the content of the result document(s) to the content of the corresponding query result, optionally specifying whether inner or outer HTML content should be used. The default is null, which does not add any content to the result documents (only metadata).

    • outerHtml

      If set to true, outer HTML content is used for the document content. If set to false, inner HTML content is used for the document content. If null, no document content is set.

Output Metadata

The metadata values listed below apply to individual documents and are created and set by the module as indicated in their descriptions.

  • HtmlKeys.InnerHtml: System.String

    Contains the inner HTML of the query result (unless an alternate metadata key is specified).

  • HtmlKeys.OuterHtml: System.String

    Contains the outer HTML of the query result (unless an alternate metadata key is specified).

  • HtmlKeys.TextContent: System.String

    Contains the text content of the query result (unless an alternate metadata key is specified).

GitHub