Programming Notes – lxml & html on Phython

Date: May 10, 2020

During CB (Circuit Breaker) measurement, there are plenty of time at home. So, I wrote a small Python script yesterday to grab data from two web sites so to display it onto my Raspberry Pi 4 with 3.5″ LCD screen.

This is my portable Raspberry Pi 4 with 3.5″ LCD screen.

So, I am now displaying Singapore Covid19 info and the World Covid19 info on it. It is running on 60 seconds update mode.

My son Jay says why it does not show the difference between updates. So, I added in.

Here is some programning notes for myself so that next time, I know how to do such programming tasks.

Covid19 Websites

This website is good in terms of visual, but it is bad for extracting data. Because this website I think wanted to protect their “IP” (which is public data anyway), and heavily uses javascripts to pull data from database. It allows you to download the JSON file or CSV file, but it is dynamically generated by the server.

In other words, there is no DATA you can obtain from the dynamically generated webpage. In the HTML file generated, you cannot see the text figures of “4,100,609” or “4100609”. So, basically this website is useless.

Next, there is this Coronavirus COVID19 API website that you can get all the DATA you need world wide.

And you can simply uses Curl to fetch these data.

So, it is just using command line tools “Curl” to get the XML info from the server.

You would think, this is probably the best to get the DATA? But the DATA stored on this server is 1-2 days old. So, no good for me if I wanted a REAL-TIME data.

You can forget about the MOH website too. Because a lot of useless DATA formats. Difficult to “decode” the figures. Hahahahaha

After searching for some times, I decided to use this website. https://co.vid19.sg

As this website has all the data I need.

The returned HTML pages has all the Covid19 data on it.

This is the total cases in Singapore (very up to date).

Here you have the critical, active, discharged and death info. All inside the generated HTML file.

As for the Reported Case, it is here in the variable found within the javascript /SCRIPT sections in the generated HTML file.

Wait now, where are all the World DATA info? Well this website only gives you Singapore DATA. No World Data. So, we need another website to pull those data.

And I used this https://worldometers.info website.

As you can see, the generated HTML, there are the World total reported cases (4,029,543 cases) and the Death cases (276,484 cases).

Tkinter Library

OK, I need to quickly presented the figures onto the screen. So, in order to do a quick prototyping, I uses the Tkinter library for Python.

I uses Python because it is new to me. I used to write C, C++, Perl scripts, etc. Python is new to me, but all programming is the same. Once you master the programing concepts, the rest is just learning the languange and syntax.

I will output the data into a TK frame text label.

Don’t bother to go use time.sleep() function to wait for 60 seconds. It won’t work!

So, instead use the root.after() within a Refresher() function and call itself every second and update the TEXT using the text.configure() function.

Below is a sample of such. And it works.

https://www.tutorialspoint.com/python3/tk_label.htm

LXML & HTML

Since we are dealing with URL & HTML. The best is to use the LXML.HTML library.

requests.get() will get the entire HTML file. Whole file stored in response.text.

lxml.html.fromstring() is to generate sort of a tree element structure.

tree.xpath() is for you to search for the TAGS from the tree element. For example, if you search for <h2>blah blah</h2>, you will do tree.xpath(‘//h2’) and this will return a list of all the contents inside the <h2> tags.

Yup, it is some what powerful, but I think later I will use the stupid way to do it, faster, better. Its stupid, because uses a lot of steps. But for rapid prototyping, it does not matter.

Example: I want to abstract the following four figures for critical, active, discharged, deceased.

So, as in the script above, you see that, I will use the “re” (regular expression) Library.

First search for the word “var breakdowndatapie =” as a separator. So, there is only one single occurage of this text search. So, as a result, it returns 2 data in the list. Everything before the “var breakdowndatapie =” and everything behind.

So, since the useful data is at the back, so, we use the result1[0] the later data.

Now, we need to look for another separator. So, let’s search for “data: [” as separator. This is the text just in front of our Critical figure = 22.

So, again, stupidly, we narrow down our search to two pieces of data and again, we using the later data.

Then, we do another search and SPLIT the data on “]” the square closing bracket. And there you go. you have found “22, 19647, 2040, 20” in the front data of the split.

and you continue to search for “,” comma. You can of course, seach for “, ” with a comma space. That will split all the 4 numbers (in string) for you. But I stupidly didn’t do that. And then, I stupidly and lazily didnt correct my mistake. And without thinking, I stupidly uses the string.strip() function to strip off all the spaces. hahahahaha

But when you do Rapid prototyping, your end result is to get what you want. Not to design with shortest and most pretty codes.

That is how you get things do in rapid protyping.

And I managed to also run my Raspberry Pi 4 on a Virtual Desktop using VNC.

Of course I did every thing with a reason.

Not only I fulfill my task, what I want to do. I also show Jay, how VNC works. How it can remote control another computer. And VNC is now stored on his learning device.

I also show Jay how you can “Rapid Prototyping” a software using Python. Does not matter you uses stupid way to program, but it does the purpose of prototyping. Proof of concept.

It is a fun project. And now I can monitor the Covid19 without going into the computer and search for info.

The data is nicely shown.

So programning is not that hard. If I can write this without prior knowledge of Python, I am sure anyone can do the job too.

This is my Corona Monitoring station. Hahahaha

	My 2nd Hole-In-One (… on Hole-in-One Golf
	miniliew on Gadget Notes – Minitoy f…
	miniliew on TCM Vs. Gastric Reflux
	miniliew on Golf with Tae Kyung Kim from H…
	miniliew on Malaysian (Singapore PR) Renew…

miniLiew

A geeky dad blog about his gadgets, his sons and family, his opinions, and lots of fun stuffs.

Programming Notes – lxml & html on Phython

Like this:

Leave a ReplyCancel reply

Programming Notes – lxml & html on Phython

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from miniLiew