CASE STUDY
How I did it
Importing Data into phpBB By Hand
THE PROBLEM
Client had a website in two sections: twenty-odd static pages and a forum section with thousands of posts. In the spirit of academic freedom, and against best advice, he allowed anyone viewing the site to post into the forum as anonymous. The site was up and moderately active from 2002 until 2005. The client had deleted spam as it appeared on the forum, but his living situation changed and he no longer was able to supervise and prune as actively as before. After that, spamdexing robots killed the forum in about two days.
Client had been basically a charity case who had worn out his welcome with the website host. No more support, no more answering emails. Client wanted his old site transferred to a new host and wanted to see if the old forum posts could be resurrected. Client had obtained a tarball of the old website files containing, among other things, a mySQL database.
The old website used webadmin software called herc. It appears to have been custom-written by the site administrator, as I can't find any reference to it having been used elsewhere. It combined cPanel-type web admininstration with web-page editing and some other features; many appear to have been planned for but not implemented, including searching.
Client wanted what he had before with the ability to search the forums and a few other incidental requirements, such as being able to sort topics into different forums or categories.
THE WRECKAGE
I never bothered to investigate herc very far, but looking at the tarball, it appears that most if not all of the web content, including the static pages, were served up from the mySQL database. I picked up a book about mySQL a year or two ago, but never had done much with it. Looking through the data from the mySQL command line, I found the old forum data in a myISAM table called messages, and a list of users called lcUsers. I wrote a webpage utility in php that would allow me to browse the old data based on the primary key field (pkey). It just showed every field in a particular record, with a NEXT button that would show the next record unless I typed a new pkey into a text box. If the pkey didn’t exist, the fields would be changed to x’s and “invalid data” text. I didn’t find much else that appeared to be useful, except for old versions of the static pages not served up by herc, and I stole all the artwork. I downloaded a utility that would save a page from my browser as a .JPG, visited the website, and took ‘snapshots’ of the pages that still existed on the website so I could match the old setup as closely as possible. For the forum itself, I’d just have to rely on my brief glimpse of it a year and a half ago, because it's dead, Jim...
GETTING A NEW WEBHOST
I'd never picked a webhost before, so we went shopping for one that had PHP, mySQL, and was local to our area physically. SSL, a secure command-line interface to the host computer, is available but it cost $3 more per month and I felt that we could probably do without it -- we could add the extra cost later if it turned out to be necessary. The site uses phpMyAdmin to perform MySQL adminstrative tasks, which is a bit of a pain but manageable. I recreated the twenty-odd static pages with PHP and HTML4.01, and created a “We suffered a major crash and switched hosts/This section is under construction/Check back later” stub for the unimplemented stuff (Download PDF, Support This Site, Forum, etc.). The pages all looked the same as before with the addition of “Welcome to the NEW ...” at the top of each page so the client would know when he was looking at the new site. We filled out the sitehost form, gave the credit card number, got the new passwords, used an FTP client to upload the files to the new host, and changed the DNS servers to point to the new website. This took about a week.
SELECTING A FORUM
Client didn’t get much visual indication of activity over the next couple of weeks as I investigated what to use as a new forum. I’d either have to use something off-the-shelf or write my own, which would give me more control, but might be as buggy, quirky and insecure as the setup he had just left, not to mention taking longer to develop. Our new web host won’t allow third-party software, at least, not easily. They did have YABB (Yet Another Bulletin Board) and phpBB 2.0.17 available. I installed both of them on our site and played around with them for a bit. I wasn't too impressed with YABB, but phpBB had all the features we needed and more right out of the box: session cookie management, CAPTCHA, COPPA, category/forum/topic heirarchy with the ability to move topics around, email integration, plus more that we might use in the future: moderators, user groups, ban user. It’s a mature product which is widely used, and there are a lot of color schemes, styles and modifications out there. There are importers which are designed to migrate postings from different types of forum software (I couldn’t find one for herc), and a lot of phpBB technical data is easy to find on the Web. The more I thought about how long it would take for me to do what phpBB already did, the less I liked the idea of writing my own. I ordered a book about phpbb and how to modify it.
CONVINCING THE CLIENT
I pointed the client to the test area of the website and asked him to play with posting, editing, using the adminstrative tools, etc. His initial reaction was extremely negative. I should have forseen this, because phpBB out of the box does look a bit cheesy. He expected to see what he was used to seeing, a forum that looked like the rest of the red-and-black website. One of his objections was that the size of the textbox used for posting was much smaller than he was used to, and, to be fair, he usually wrote lengthy postings. What he saw was the default phpBB silver/blue theme with the phpBB logo on top. I went out and downloaded a red-and-black theme that fit well with the rest of the site scheme, and changed the colors and logo. I dug around in the code until I found the textbox size and made it ridiculously large. I then asked him to go and look again. He was much happier and I took the opportunity to point out that any objection he had to do with how the forum would look were just layout issues and could be dealt with, rather than the mechanics of phpBB. The issue would continue to surface from time to time, but if I told him a particular issue was “just layout”, he would come around, especially after actually making the tweak.
INVESTIGATING phpBB
In order to look under the hood of phpBB, I installed it to my local Linux/Apache machine. I then installed the same color/layout style, called kPower, that I had installed on the website. I found a list of all phpBB tables and a description of all the fields on the Web. There were three, maybe four phpBB tables that I would have to deal with -- phpBB_posts, phpBB_posts_text, phpBB_topics and possibly phpBB_users. I created the same sort of webpage data browsers to examine every field in every record as I did for the old forum: one for phpbb_posts and phpbb_posts_text (there’s a one-to-one relationship between them; they’re basically the same table split up into two pieces, I assume for efficiency), and one to investigate phpbb_topics.
DEVELOPING A PLAN
My basic objective would be to modify the existing forum database by adding columns to it which would be in the proper phpBB format and be filled either with the appropriate data from the old format or the proper default value for our situation. Then I’d just transfer those new relevant columns to the new database by appending or inserting those records into the existing-but-mostly-empty phpbb database. I knew how the phpBB Search feature worked by browsing the Web, and knew that any imported data would not be searchable unless the search tables were rebuilt so I installed a MOD, rebuild_search, on my local machine to make sure I could do it on the website when the time came, and that the MOD actually worked as intended. So the rough outline becomes:
- Strip the bogus messages out of old data
- Add new-style columns to old database
- Fill all new-style columns with the appropriate stuff
- Insert new-style column data into phpBB
DEALING WITH SPAM
The messages table had about 10,000 records, but the last several thousand were nothing but spam, with a few legitimate postings interspersed, so I couldn’t just delete all after record number such-and-such. I created a new column in the messages table, called it PORN and set the default to ‘N’. Then I used my browser and a mySQL terminal window to look at each record. When I found some spam, I took the Primary Key Number from the browser and used mySQL to UPDATE messages SET PORN=‘Y’ WHERE pkey=n. After browsing all the records, I used mySQL to DELETE FROM messages WHERE PORN=‘Y’. This left about 1600 legitimate postings spread over about 250 topics.
MAKING A MAP
I found it useful to draw out a list of all the columns I had in old.messages, and a list of all the columns in phpbb_posts, phpbb_posts_text and phpbb_topics. I won’t reproduce it here, but I drew arrows showing which data from the old columns would go into which columns of phpBB. I used mySQL’s command line to SHOW COLUMNS FROM messages. I used phpBBdoctor.com’s excellent reference to list the columns in phpBB. There are three possibilities for each column in the old data:
- Straightforward data copying (with the possibility of truncation/padding because of a different definition of the column data type): For instance, the TEXT data contained in my body field (the text of the post itself), appears to go directly into the TEXT field called phpbb_posts_text.post_text. In my case, the direct transfers are: message (the post subject), and body (the post text). I used solid arrows to draw these lines.
- Data won’t be copied because the data is not relevant to new situation or field is always empty, always contains the same value, or will increase complexity past the point I’m willing to deal with: The fields I will ignore are: fid, ip (always 0), replies, recentpost, views (phpBB has the same fields but they bind together in a particular way and I don’t want to mess with them) and status (always seems to be ‘Y’). I marked these with X’s on the list. I also decided not to use the primary key (pkey INT(11)) from the old table because I could generate new unique numbers easily enough and, after all that spam, I’d be leaving lots of large gaps in my numbering, and, yes, I’m just too anal to allow that.
MORE DECISIONS
- Data I need to handle but which is not a straightforward copying process:
There are two similar fields in the old database: timedate and stamptime. Which to use? Are they the same data in different formats (One is INT(11) the other is DATETIME)? I decided I didn't care and since I needed to fill an INT(11) column (phpbb_posts.post_time) I decided to use stamptime.
What about the parent INT(11) field? Here the way the old database worked was different from how it needed to be in phpBB. In the old database, when a topic was started, the parent field was set to 0. Any posts that continued that particular thread or topic had it's parent set to the primary key (pkey INT(11)) of the original posting. In phpBB every posting in a particular thread has the same topic_id in phpbb_posts. As a further complication there are two other fields, both in phpbb_topics: topic_first_post_id, and topic_last_post_id. These must be set correctly because, from my internet reading, they can get skewed on heavily used boards and cause problems in properly displaying topics. So the data needs to be used, but not by straightforward copying. I drew dashed lines from parent to phpbb_posts.topic_id, phpbb_topics.topic_id/first_post_id/last_post_id. I won't give the howto details here, but if you need to do something similar, refer to the procedure mentioned later.
What about the username field? There are two fields in phpBB that might be relevant - phpbb_topics.topic_poster (MEDIUMINT(8)) and phpbb_posts.post_username (VARCHAR(25)). Further investigation showed that normal phpbb procedure was to use the topic_poster id number from another phpbb table, phpbb_users. The reason that post_username exists (it’s normally NULL) turns out to be that when a user is deleted from phpbb_users, all the posts with that topic poster id get filled with the now-unused post_username -- it’s a cubbyhole for users who are no longer around. This creates a new level of complexity. I either need to recreate all the accounts for old users, creating 43 new entries in the phpbb_users table or just ignore this data and pretend they’re all extinct. What are the ramifications of this decision? Well, I’m already working on three separate phpbb tables and I’m not anxious to deconstuct a fourth. If I ignore this data, then any old user who comes back to the forum and continues to post will not get credited with these postings (phpBB places emphasis on marking heavily-posting users with stars or labels or some sort of emphasis). As we’re an academic discussion area, we probably don’t plan to do much with this phpbb capability. So I discussed this with the client. He answered about what I expected (“If doing it this way makes it simpler for you and I get my forum back faster, go ahead”). So we'll forget topic_poster and use post_username.
These are the sorts of decisions you will have to make based on your own situation.
SELECTING DEFAULTS AND PHPBB DEPENDENCIES
To complete my chart, I need to figure how to fill all the rest of the phpbb fields. Many come from relationships within phpbb itself: phpbb_posts.post_id corresponds to phpbb_posts_text.post_id, phpbb_posts.topic_id is the same as phpbb_topics.topic_id, etc...
What about phpbb_posts.forum_id and phpbb_topics.forum_id? I need a value for that. No problem. I went to the phpbb installation on the site and created a new forum called “Restored Data Forum -- Data from the old site will appear here”. Then I just used my browser to look at the new site. Oh, it's forum number 17? Now I know how to fill that column...
enable_html, enable_smilies, enable_bbcode: We don’t plan to use these features, and there sure isn’t any bbcode in the old forum anyway, so we’ll set them to disable. You can leave most of these at their default value unless you plan to do something differently.
What’s phpbb_posts_text.bbCode_uid? Well, it”s interesting reading on the net, but not well explained and pretty arcane. Especially when importing data from forums that don't use bbcode, just leave it the default NULL.
What about phpbb_posts.poster_ip? I’ve got an old ip field, but it’s always blank -- must be an “unimplemented feature”. I used DEFAULT “7F000001” so everything I’m importing comes from IP address 127.0.0.1, meaning ’localhost’.
A SPECIAL NOTE FOR ANYONE USING post_username INSTEAD OF topic_poster: Set topic_poster to -1, representing “anonymous” user instead of the stated default value of NULL. This will save you some head-banging later...
BLAZING A TRAIL AND LEAVING BREADCRUMBS, uh, I mean DOCUMENTING
Since I’m not extremely fluent in mySQL and I wanted to be able to undo any mistakes, I backed up everything before I started. I created a dataset of the first couple hundred of records so I wasn’t working with the whole big glob, and I logged all my SQL statements into a text file so I could recreate every single step later with the entire database once I knew I had all the bugs worked out and everything worked. This turned out to be mostly wasted effort, since it worked the first time (after changing post_username to ’anonymous’, see above, and learning that I had to log into phpbb as admin and ’resync’ the Restored Data Forum), but I didn’t know that at the start. I imported the small test database into my local machine so I wouldn't have to risk corrupting the database on the site. I had to ask for help at a mySQL forum at one point, but got the answer I needed in less than an hour (thanks Alan Larkin!).
DOING IT FOR REAL
After successfully completing the test, it should be a simple matter to do it for real. I installed the rebuild_search MOD on the site, just as I did locally, before I began. For me, since I have dialup, it was easier to download the three tables from the site and make the changes locally. My favorite little text editor, gEdit, didn’t have good enough memory management to handle huge copy/paste operations well; I had better luck with OpenOffice. Since I have no SSL access to the site, I had to use phpMyAdmin as importer/exporter, but this didn’t present any problems. Once the data is all there, don't forget to rebuild your search tables with the rebuild_search MOD, so that the data is searchable.
AFTER THE AFTERGLOW
There’s nothing like the feeling of making something work that forces you stretch your abilities. I know a lot more arcane stuff than I did before I started. The rest of admin-ing this site will be ho-hum in comparison, mostly tweaking the layout. If you are doing or need to do something similar -- shoot me an email! I'm available and looking for the next fun project...









Textures courtesy of