I used to save tweet data collection in sqlite using python, A memo that I researched variously because I wanted to do it at high speed while multithreading with java.
When dealing with sqlite sqlite-jdbc is easy to use, so I fetch it from maven and use it.
Initially, I was collecting data on the HDD,
I wondered why it was so slow, and it took a lot of time to add DBs rather than collecting them.
I searched for the cause.
Solve one by one
prepareStatement takes a long time, so change it to ʻINSERT OR IGNORE.journal_mode to MEMORY and set sync_mode to ʻOFF`.Example of saving the tweet that was said first
DataBase.java
    /**
     *Driver registration
     */
    public static void dbInit() {
        try {
            Class.forName("org.sqlite.JDBC");
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        }
    }
    /**
     *pragma settings
     *Here 3.Jornal_mode and sync_mode is set
     */
    public static Properties getProperties() {
        Properties prop = new Properties();
        prop.put("journal_mode", "MEMORY");
        prop.put("sync_mode", "OFF");
        return prop;
    }
    /**
     *Process to add to database
     *If you do not put synchronized, deadlock will occur when writing hard with multithread
     */
    public static synchronized void putTweet2SQL(File dbFile, List<Status> tweet) {
        Statement stmt;
        String dbHeader = "jdbc:sqlite:" + dbFile.getAbsolutePath();
        PreparedStatement pstmt;
        dbInit();
        try (Connection conn = DriverManager.getConnection(dbHeader, getProperties())) { //try-with-resources
            conn.setAutoCommit(false);
            stmt = conn.createStatement();
            //Create if you don't have a database
            stmt.executeUpdate("CREATE TABLE IF NOT EXISTS tweets (tweet_id INTEGER PRIMARY KEY, user_id INTEGER, user_screen_name TEXT,tweet_text TEXT)");
            //Tweet ID, user ID, screen name, tweet text
            //2.Preparing to add data in bulk
            pstmt = conn.prepareStatement("INSERT OR IGNORE INTO tweets VALUES (?, ?, ?, ?)");
            for (Status status : tweet) {
                place = status.getPlace().getFullName();
                pstmt.setLong(1, status.getId());
                pstmt.setLong(2, status.getUser().getId());
                pstmt.setString(3, status.getUser().getScreenName());
                pstmt.setString(4, status.getText());
                pstmt.addBatch();//1.Add to processing
            }
            pstmt.executeBatch();//1.The actual database addition process is done here
            conn.commit();
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }
Even when reading from the DB, setFetchSize will call the specified number at once, so it will be comfortable to some extent.
DataBase.java
    /**
     *Get Tweets
     *If you pull huge data, the heap will be insufficient, so in that case rewrite appropriately
     * @return Tweet list
     */
    public static List<Status> getTweetsFromSQL(File dbFile) {
        String dbHeader = "jdbc:sqlite:" + dbFile.getAbsolutePath();
        PreparedStatement pstmt;
        List<Status> userDetails = new ArrayList<>();//Appropriate list
        dbInit();
        try (Connection conn = DriverManager.getConnection(dbHeader, getProperties())) {
            pstmt = conn.prepareStatement("SELECT * FROM tweets");
            pstmt.setFetchSize(1000);//If you set 5000 trillion or something, you will get it all at once
            ResultSet rs = pstmt.executeQuery();
            while (rs.next()) {//Tweet ID, user ID, screen name, tweet text
                Status status = new status();
                status.setId(rs.getLong(1));
                status.setUserId(rs.getLong(2));
                status.setScreenName(rs.getString(3));
                status.setTweetText(rs.getString(4));
                
            }
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return userDetails;
    }
It's much faster and I'm happy because I could buy more SSDs
Recommended Posts