Friday, July 17, 2015

Reading a Wikipedia Dump

So I wanted to embark on a journey to find out if I could make a natural language processor by analysing how grammar works, but I needed some examples. I needed to see how sentences were put together in terms of the parts of speech and also see which words can be removed from a sentence without altering the meaning but making it easier to process. I figured Wikipedia has the largest source of text and context that I could simply download and scan to answer my questions. Problem is, the Wikipedia Dump is encoded in XML so that it can be easily parsed and used offline, but is not what I am looking for to analyse the language.

My solution to the problem was to parse the Wikipedia dump myself and turn each page into its own file. In order to do the initial testing, I decided to use a dump for Simple Wiki since it's quite a few gigabytes less. I also wrote a small python script to go through the XML file at intervals of 1000 characters at a time. This allowed me to get some insight into how the XML file was structured.

Between the <page> and </page> tags was the page which included the title and sections and subsections and references and all those things. The title of each page was found between the <title> and </title> tags. The main text, which houses relevant information, is kept between the <text> and </text> tags. Words surrounded by three single quotes (''' ''') appeared to be what signalled bolded words. Words between two sets of square brackets ([[ ]]) are links to other pages. Sections and subsections are signalled by words between two (== ==) and three (=== ===) equal signs consecutively. The asterisk symbol (*) indicated a bullet point. And it appeared that {{ndash}} was simply a dash (-).

Knowing all this, I was able to construct a program that made Python list of every page in the dump and then made a text file with the title of the page title and the content of the page text.

Turns out some of the text files only contained text that was only one file and described a redirect. Otherwise those files were empty, so I had to write a little Python script that went through and removed all of those, leaving only files with actually useful text.

Otherwise, I had to go through all the files individually and remove files with names in foreign languages and names of things that aren't useful for my purposes... 140,000+ text files... it's fun...

Anyway, the code for the program that got each page and turned it into a text file is at the far bottom and the general algorithm is below this block of text. The program was designed to parse the entire 400+ MB simplewiki file without loading it all into memory... not sure why... made sense at the time. But it scans through the whole file line by line to find the page, title and text tags. Just remember to actually download the Wiki dump if you want to try this, use SimpleWiki since it's quite a few gigabytes less than the actual EnWiki dump.


open simplewiki.xml
set text variable to " "
while lines haven't run out{
    page content = ""
    while you haven't found the <page> tag{
        text variable = readline()
    while you haven't hit the </page> tag{
        add the current line to the page text variable
    set a title variable equal to the page title by searching between <title> tags
    open a file for writing with the title variable's title
    set a text variable equal to the  text between <text> tags
    write that text to the file
    close the file}
close the simplewiki.xml file


#!/usr/bin/env python
f = open("simplewiki.xml")
x = " "
while x != "": #while still reading
    page = "" #hold page
    while x.find("<page") == -1: #skip through lines until find page start
        x = f.readline()
        if x == "": break
    while x.find("</page") == -1: #get lines between <page> and </page>
        x = f.readline()
        page += x
        if x == "": break
    title = page[page.find("<title")+7:page.find("</title")].replace("/","-") #get title
    g = open(title+".txt","w") #open file with title
    text = page[page.find("<text"):page.find("</text")]
print "Done" #give confirmation of finishing


Here are pictures. This first one is a screenshot of the finished code in my favorite text editor: Gedit. I also use Ninja-IDE in extreme cases where I'm working on complicated projects.

This second picture is a picture of all of my processed text files. Since there are so many, I actually sorted them out based on the letter they start with and put them in folders. All the text files you can see have weird/foreign characters at the start and have been sorted manually after this picture was taken. But there you go, I now house every topic of SimpleWiki on my computer in text file format... come at me internet blackout.

Wednesday, July 15, 2015

Gravity Visualizer

Hello world, today I will be sharing a little project I have been working on involving physics, C++ and a crazy question.

A few months ago, I was talking to a friend about the destruction of Earth through a huge meteor impact and I had a crazy idea. What if a meteor shot through Earth and split the earth perfectly in half so you had two hemispheres, had the two parts moved away due to the energy of the impact and then had the two parts attract each other and collide; what would be the energy and destruction level of something like that? Now I know it sounds crazy, but it's a purely theoretical question that could possibly be simulated.

In order to begin to answer my question, I had to know how non-spherical objects attract each other and I figured a square would be a pretty easy shape to experiment with. I am working with 2D shapes because I don't know how to program 3D graphics and the math for it would get pretty messy.

Initially I just wanted to get a shape, have a small particle circle around it and collect information regarding the force exerted by the shape; all simulated of course.

The first version was coded in Python using the PyGame graphics library since I was already familiar with it and Python is cross-compatible with the Linux computers I use at home and the Windows computers I used in the classes that I developed the first version in.

Below is an image of the square I used in the first version. The square is 50x50 pixels, each being 1 mass unit for a total of 2500 mass units. Forces ≥20 are a light blue, forces between 10 and 20 are yellow, forces between 4 and 10 are a hot pink(?), forces between 3 and 4 are red, forces between 2 and 3 are blue and forces between 1 and 2 are green.
This helped solve part of the question, shapes have gravity fields that are roughly the same shape at very close distances, but as you get farther, the shape of the gravity field becomes spherical. Other shapes were tested and the same was true, but the distance at which it became more spherical was dependant on how odd the shape was. Below is a hemisphere, it has a mostly spherical gravity field, but it is just a bit lopsided.

This was all well and good for a while, but the output looked horrible and took forever to render in Python. So, about two days ago, I wrote it in C++ and made a few changes. I made it so instead of scanning in a circular pattern, it would just calculate the force present in every pixel of the 600x600 space. It would also represent the colors as a modulus function of 256 to retain a grayscale color between 0 and 255, as dictated by most RGB colors; this means that once the color surpasses 255, it starts again at 0, represented as the alternating rings. The result of the new output is shown below.
Below are more shapes which have had their gravity fields calculated, notice how the fields become more or less spherical as they go farther.

For the more code-savy people out there, I will post the general algorithm for the C++ gravity visualizer:

make an empty array for the coordinates and mass of the points
add points to the array corresponding to the shape you want
calculate the center of mass and move the shape so center is at origin
draw your shape on the screen
for every x pixel{
    for every y pixel{
        set force2 to 0
        for every point{
            calculate distance of pixel and point
            set force = (gconst * point mass) / distance
            set force2 = force2 + force
        set pixel intensity to force2 % 256
    update screen
draw your shape on the screen
save screenshot

The source code can be found here.

Tuesday, July 14, 2015

It has begun...

Hello world, today I am beginning my blog.

This blog will mostly be about my adventures in writing software and designing hardware in an attempt to spread ideas around. This blog will also be used as somewhat of a portfolio to document my projects in computer science. I will also be occasionally exploring algorithms and theory involved in programming.

A lot of my programming projects will be based around Python and C++, but there may also be a few in other languages as I do know around 8 languages fluently and have tinkered around with a few others.