Python UDF

To create a sample input table in Hive that contains HTML tags, which you can later use to test the Python UDF for removing HTML tags, follow these steps. First, let’s define the table with a simple schema and then insert some sample data that includes HTML tags.


### Step 1: Define the Hive Table

Open your Hive interface and run the following command to create a table. This table will include an `id` column and a `description` column that will contain HTML-formatted text:

```sql
CREATE TABLE html_description_table (
id INT,
description STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
```

### Step 2: Insert Sample Data into the Table

Now, insert some sample data into this table. The descriptions include HTML tags to simulate the data you might need to clean:

```sql
INSERT INTO TABLE html_description_table VALUES
(1, '<p>This is a <strong>strong</strong> tag</p>'),
(2, '<a href="http://example.com">Example Link</a>'),
(3, '<div>Styled text with <span>color</span></div>'),
(4, '<ul><li>Item 1</li><li>Item 2</li></ul>');
```

### Using the Table

With the table and data set up, you can now use the table `html_description_table` to test the Python UDF. Run a query with the `TRANSFORM` clause to apply the Python script you've created for removing HTML tags:

```sql


SELECT TRANSFORM(description)
USING 'python3 /home/arvind/hivetest1/remove_html_tags.py'
AS clean_description
FROM html_description_table;

```

Wed Feb 12, 2025

Say Yes to New Adventures

arvind agrawal
A California-based travel writer, lover of food, oceans, and nature.